CN116167354B - Medical term feature extraction model training and standardization method and device - Google Patents

Medical term feature extraction model training and standardization method and device Download PDF

Info

Publication number
CN116167354B
CN116167354B CN202310422988.XA CN202310422988A CN116167354B CN 116167354 B CN116167354 B CN 116167354B CN 202310422988 A CN202310422988 A CN 202310422988A CN 116167354 B CN116167354 B CN 116167354B
Authority
CN
China
Prior art keywords
medical
medical term
terms
term
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310422988.XA
Other languages
Chinese (zh)
Other versions
CN116167354A (en
Inventor
赵礼悦
齐综擎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Asiainfo Data Co ltd
Original Assignee
Beijing Asiainfo Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Asiainfo Data Co ltd filed Critical Beijing Asiainfo Data Co ltd
Priority to CN202310422988.XA priority Critical patent/CN116167354B/en
Publication of CN116167354A publication Critical patent/CN116167354A/en
Application granted granted Critical
Publication of CN116167354B publication Critical patent/CN116167354B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a medical term feature extraction model training and standardization method and device. The method comprises the following steps: acquiring a plurality of client medical terms and a plurality of standard medical terms; according to the mapping relation between nouns, combining the client medical nouns and the standard medical nouns as training samples to generate a sample set; inputting the sample set into a twin network model for feature extraction, carrying out model parameter optimization adjustment according to the difference between the feature vectors extracted in each training sample, and taking the adjusted model as a medical term feature extraction model. The invention greatly improves the efficiency and accuracy of the standardized process of medical science and technology.

Description

Medical term feature extraction model training and standardization method and device
Technical Field
The invention relates to the field of medical insurance wind control, in particular to a method and a device for training and standardizing a medical term feature extraction model.
Background
Because the medical information systems used by different hospitals are different, the formats and modes of recorded information are also different, when information exchange is needed among a plurality of hospitals, the computer can only identify codes and identifiers, so that the information can not be exchanged on a semantic level, and medical resource sharing such as cross-regional medical treatment, cross-system medical treatment and the like can not be realized, and the method is characterized in that the information can not be shared. Standardizing medical terms is of great importance for achieving medical resource sharing.
Various medical terms are accumulated in the information systems of all hospitals, and the medical terms are possibly composed of common or customary terms, common names and the like in the process of long-term medical business working by medical staff, and are important information sources in medical insurance management data, so that various medical terms used by clients collected in the original information system are standardized, and the method has very important significance for analysis and use of subsequent data.
Disclosure of Invention
In view of the foregoing, the inventors have devised the present invention to provide a medical term feature extraction model training, normalization method and apparatus that overcomes or at least partially solves the foregoing problems.
In a first aspect, an embodiment of the present invention provides a training method for a medical term feature extraction model, including:
acquiring a plurality of client medical terms and a plurality of standard medical terms;
according to the mapping relation between nouns, combining the client medical nouns and the standard medical nouns as training samples to generate a sample set;
inputting the sample set into a twin network model for feature extraction, carrying out model parameter optimization adjustment according to differences among feature vectors extracted by each training sample in the twin network model, and taking the adjusted model as a medical term feature extraction model.
In one embodiment, the combining the client medical term and the standard medical term according to the mapping relationship between terms as a training sample includes:
respectively combining the client medical terms with the correct mapping relation with the standard medical terms to serve as positive samples, and generating a positive sample set;
and shuffling the client medical terms and/or standard medical terms of each positive sample in the positive sample set to generate a negative sample set.
In one embodiment, shuffling the customer healthcare terms and/or standard healthcare terms of each positive sample in the positive sample set to generate a negative sample set includes:
shuffling the customer healthcare terms and/or standard healthcare terms of each positive sample in the positive sample set;
and determining that the client medical terms and standard medical terms combination which do not have the correct mapping relation and have the similarity value higher than the preset similarity threshold value are used as the negative sample set from the shuffled positive sample set.
In a second aspect, an embodiment of the present invention provides a method for standardization of medical terms, including:
inputting medical terms to be standardized into a medical term feature extraction model to perform feature extraction, and obtaining corresponding feature vectors serving as first feature vectors;
acquiring feature vectors corresponding to standard medical terms in the standard medical term word stock as second feature vectors;
calculating a similarity value between the first feature vector and each second feature vector;
extracting a standard medical term corresponding to the maximum similarity value as a standard medical term matched with the medical term to be standardized;
the medical term feature extraction model is obtained through the training method of the medical term feature extraction model.
In one embodiment, the second feature vector, before the step of inputting the medical term to be normalized into the medical term feature extraction model for feature extraction, further includes:
respectively inputting each noun in the standard medical professional name word stock into the medical professional noun feature extraction model to perform feature extraction, obtaining feature vectors respectively corresponding to each noun in the standard medical professional name word stock, and storing the feature vectors;
correspondingly, the obtaining the feature vector corresponding to each standard medical term in the standard medical term word stock includes:
extracting feature vectors corresponding to nouns in the pre-stored standard medical professional name word stock respectively.
In one embodiment, the calculating the similarity value between the first feature vector and each of the second feature vectors includes:
and respectively carrying out cosine similarity calculation on the first feature vector and each second feature vector to obtain a cosine similarity value between the first feature vector and each second feature vector.
In a third aspect, an embodiment of the present invention provides a training device for a medical term feature extraction model, including:
the acquisition module is used for acquiring a plurality of client medical terms and a plurality of standard medical terms;
the generation module is used for combining the client medical terms and the standard medical terms as training samples according to the mapping relation among terms to generate a sample set;
the training module is used for inputting the sample set into a pre-built twin network model to perform feature extraction, performing model parameter optimization adjustment according to differences among feature vectors extracted in each training sample, and taking the adjusted model as a medical term feature extraction model.
In a fourth aspect, an embodiment of the present invention provides an apparatus for standardization of medical terms, including:
the first feature extraction module is used for inputting medical term to be standardized into the medical term feature extraction model to perform feature extraction, and obtaining a corresponding feature vector as a first feature vector;
the feature vector obtaining module is used for obtaining feature vectors corresponding to each standard medical term in the standard medical term word stock to be used as second feature vectors;
the similarity calculation module is used for calculating a similarity value between the first feature vector and each second feature vector;
the normalization module is used for extracting a standard medical term corresponding to the maximum similarity value as a standard medical term matched with the medical term to be normalized;
the medical term feature extraction model is obtained through the training method of the medical term feature extraction model.
In a fifth aspect, embodiments of the present invention provide a computer storage medium having stored therein computer-executable instructions that when executed by a processor implement a method of training a medical term feature extraction model as described above or a method of medical term normalization as described above.
In a sixth aspect, embodiments of the present invention provide an electronic device, a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the training method of the medical term feature extraction model as described above or the method of medical term normalization as described above when executing the program.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
according to the training method of the medical term feature extraction model, which is provided by the embodiment of the invention, the client medical term and the standard medical term are combined into the training sample according to the mapping relation between terms, the training sample is used for training the twin network model and carrying out parameter adjustment to obtain the medical term feature extraction model, the text similar to the semantic is found by comparing the cosine similarity or the Manhattan distance and the like in the twin network model, and the matching relation between the feature vector of the client medical term and the feature vector of the standard medical term is fully learned, so that the feature vector output by the medical term feature extraction model after the medical term to be standardized is input when the medical term feature extraction model is used, the feature vector of the standard medical term corresponding to the medical term to be standardized can be as close as possible, and the standard medical term truly corresponding to the medical term to be standardized can be conveniently and further selected by using the similarity value between the feature vectors, so that the standardization of the medical term can be realized.
Under the condition of insufficient training data quantity, positive and negative samples are used for training, so that the number of samples is effectively increased, the over fitting of the model is avoided, and the output of the medical term feature extraction model obtained through training is more accurate.
According to the medical term standardization method provided by the embodiment of the invention, the nouns in the standard medical term library are respectively input into the trained medical term feature extraction model for feature extraction, the feature vectors corresponding to the nouns in the standard medical term library are obtained and stored, the situation that the nouns in the standard medical term library are repeatedly subjected to feature extraction when the medical term feature extraction model is called again for feature extraction in the later period is avoided, and the nouns in the standard medical term library are directly obtained from the feature vectors corresponding to the nouns in the stored standard medical term library is avoided, so that the medical term standardization efficiency is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a training method of a medical term feature extraction model in an embodiment of the invention;
FIG. 2 is a flowchart showing the implementation of the steps for generating a sample set in an embodiment of the present invention;
FIG. 3 is a schematic diagram of the internal encoder and decoder of the SBERT model according to the present invention;
FIG. 4 is a flow chart of a method of normalization of medical terms in an embodiment of the present invention;
FIG. 5 is a schematic diagram of a training device for a medical term feature extraction model according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a device for standardizing medical terms according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In order to solve the problem that in the prior art, it is difficult to efficiently and conveniently standardize the medical terms of clients in the medical field, an embodiment of the present invention provides a training method for a feature extraction model of the medical terms, where the flow is shown in fig. 1, and the method includes the following steps:
step S1: acquiring a plurality of client medical terms and a plurality of standard medical terms;
step S2: according to the mapping relation between nouns, combining the client medical nouns and the standard medical nouns as training samples to generate a sample set;
step S3: inputting the sample set into a twin network model for feature extraction, carrying out model parameter optimization adjustment according to the difference between feature vectors extracted by each training sample in the twin network model, and taking the adjusted model as a medical term feature extraction model.
In an embodiment of the present invention, a medical term derived from a medical facility information system is referred to as a customer medical term.
In the embodiment of the present invention, the terms used by standards and specifications in the medical field are called standard medical terms, and spoken, custom-made terms, or short terms of some terms, etc., are called non-standard medical terms.
The customer medical terms may be non-standard medical terms, or may include both non-standard medical terms and standard medical terms.
In the diagnosis and treatment records of different hospital information systems, because the information systems used by different hospitals are different, the information records are different, so that a larger proportion of the client medical terms may be nonstandard medical terms, and the records of the same medical term may be non-unique and have various expression modes, thereby being unfavorable for realizing medical resource sharing of cross-regional medical treatment, cross-system medical treatment and the like, and therefore, the client medical terms need to be standardized.
In the above step S2, each of the client medical terms has a corresponding standard medical term, and if the acquired client medical term itself is the standard medical term, the corresponding standard medical term is itself.
According to the embodiment of the invention, the twin network model is adopted to extract the characteristics of the medical terms, compared with other characteristic extraction modes, the twin network model (Siamese network) can obtain semantically meaningful sentence vectors, the cosine similarity or Manhattan distance and other methods are used in the twin network model, the characteristic vectors of a group of input training sample data are compared, and the model parameters are optimized and adjusted according to the difference, so that the model fully learns the matching relation between the characteristic vectors of the client medical terms and the characteristic vectors of the standard medical terms, and the characteristic vectors output by the model are ensured to be closer to the characteristic vectors corresponding to the true standard medical terms, namely, the characteristic of the characteristic vectors is more accurate. And a data basis is provided for realizing the standardization of medical terms by using the trained medical term feature extraction model.
In some alternative embodiments, in the step S1, a plurality of client medical terms and a plurality of standard medical terms have a mapping relationship, and in particular, there may be a plurality of situations: 1. a customer medical term having a mapping relationship with only one standard medical term, and a standard medical term having a mapping relationship with only one customer medical term; 2. multiple customer medical terms may have a mapping relationship with the same standard medical term (typically, there is no case where one customer medical term has a correspondence with multiple standard medical terms at the same time.)
The method for acquiring the medical terms of the clients may be various, for example, the medical terms of the clients may be directly acquired from diagnosis records of a plurality of hospital information systems, or a plurality of medical terms of the clients may be manually selected in advance, which is not limited in the embodiment of the present invention.
In some alternative embodiments, as shown in fig. 2, the step of generating the sample set in the step S2 may be implemented, for example, by the following steps:
step S21: respectively combining the client medical terms with the correct mapping relation with the standard medical terms to serve as positive samples, and generating a positive sample set;
step S22: and shuffling the customer medical terms and/or standard medical terms of each positive sample in the positive sample set to generate a negative sample set.
In the positive sample of the positive sample set, the client medical term and the standard medical term have a correct mapping relation, and in the negative sample of the negative sample set, the combination of the client medical term and the standard medical term is exactly opposite to the positive sample and has a wrong mapping relation.
In some alternative embodiments, in the step S21, the combination of the customer medical term and the standard medical term under the same type may be placed in one sample set (positive sample set or negative sample set) according to the type of medical term. Of course, it is also possible to place combinations of different types of customer medical terms and standard medical terms in one sample set without distinguishing the types of medical terms.
As shown in table 1 below, table 1 below is an example of the content of a positive sample set composed of a client medical term of the same type (both types of diseases) and a corresponding standard medical term, there are two columns of data in table 1 below, the left column of data is a client medical term, the right column of data is a standard medical term having a correct mapping relationship with the client medical term on the left, and taking the first column of data as an example, the client medical term "typhoid sepsis (province uniform)" has a standard medical term having a correct mapping relationship with the client medical term "typhoid sepsis".
Table 1:
Figure SMS_1
the positive sample set may also be a content example of a positive sample set composed of a plurality of types of client medical terms and corresponding standard medical terms, and is described with a specific example, as shown in table 2 below, table 2 below is a content example of a positive sample set composed of a plurality of types of client medical terms (including a plurality of types of "diseases", "medicines", "medical service items", etc.) and corresponding standard medical terms, in table 2 below, there are two columns of data, the left column of data is a client medical term, the right column of data is a standard medical term having a correct correspondence with the client medical term on the left side, and the first row of data is exemplified, where the standard medical term corresponding to the client medical term "typhoid sepsis (provincial unification)" is "typhoid sepsis".
Table 2:
Figure SMS_2
in some alternative embodiments, in order to enhance the learning effect of the negative sample, a sample that is more similar but does not actually have the correct mapping relationship (confusing medical terms) may be learned as the enhanced negative sample.
In particular, the reinforced negative sample is obtained, for example, by the following steps:
shuffling the customer medical terms and/or standard medical terms of each positive sample in the positive sample set;
the order of the standard medical terms columns may be shuffled, the order of the customer medical terms columns may be shuffled, the two columns may be shuffled separately, the content of the negative sample set may be described by a specific example, the order of the standard medical terms columns may be shuffled by taking the example of the content of the positive sample set shown in table 2, the result may be shuffled by the order of the customer medical terms columns as shown in table 3, the result may be shown in table 4, and the two columns may be shuffled separately, the result may be shown in table 5:
table 3:
Figure SMS_3
table 4:
Figure SMS_4
table 5:
Figure SMS_5
and secondly, determining that a client medical term and standard medical term combination with a correct mapping relation does not exist in the positive sample set after shuffling and the similarity value of the client medical term and standard medical term combination is higher than a preset similarity threshold value, and taking the client medical term and standard medical term combination as a negative sample set.
The similarity value between the client medical term and the standard medical term can be calculated by referring to a similarity calculation method in the prior art, such as calculating cosine similarity or a similar algorithm, and the like, which is not described in detail in the embodiment of the present invention.
In some alternative embodiments, referring to the framework of the SBERT model, which is one example of the twin network model shown in fig. 3, the structure includes two inputs, namely a BERT and a pulling output u vector on the left side, a BERT and a pulling output v vector on the right side, u and v being text vectors, and cosine-similarity of the u and v vectors can be calculated by a cosine-sim (u, v) module inside the model, and the training objective of the twin network model is: the cosine similarity between text vectors output via BERT and pulling of the left and right paths of the same thing, representing different texts of the same thing, is as close to 1 as possible (ideal is 1, representing the same or very similar). For the embodiment of the invention, the training process of the preset twin network model by using the sample set is as follows:
the samples in the sample set, i.e. a pair of medical terms including a customer medical term and a standard medical term, are respectively input as Sentence a and Sentence B in fig. 3 into BERTs of two paths, and are respectively converted into corresponding feature vectors through BERTs and pooling of the left and right paths, for example, a feature vector corresponding to a 512-dimensional customer medical term and a feature vector corresponding to a 512-dimensional standard medical term are obtained, and further a cosine similarity value is calculated. And (3) taking the feature vector corresponding to the client medical term and the cosine similarity value of the feature vector corresponding to the standard medical term as a target, finely adjusting parameters of the twin network model, after training 30 rounds (30 epochs), enabling the loss function value of a sample set of the twin network model not to drop any more, stopping training when the parameter adjustment representing the twin network model reaches a relatively optimal solution, and obtaining a medical term feature extraction model, wherein the medical term feature extraction model is finally obtained, namely a model formed by two parts (such as a left branch) of a trained path BERT and a pulling path in FIG. 3.
In the above, the twin network model is taken as an example for describing the SBERT model, compared with other models, the calculation speed of the SBERT model is faster and the efficiency is higher in the task of calculating the semantic similarity, but the embodiment of the invention is not limited to training by using the framework of the twin network model, for example, the BERT can be replaced by other encoders such as a transformer, and the similarity value of the text vector can be calculated by the euclidean distance in the model. The medical term feature extraction model obtained through the SBERT model training also enables Bert in the model to better capture the relation between texts and generate a better text vector.
Accordingly, after the medical term feature extraction model is trained, the feature vector corresponding to the medical term of the customer can be output through the trained medical term feature extraction model.
Based on this, the embodiment of the invention provides a method for standardizing medical terms, the flow of which is shown in fig. 4, comprising the following steps:
step S41: inputting medical terms to be standardized into a medical term feature extraction model to perform feature extraction, and obtaining corresponding feature vectors serving as first feature vectors;
step S42: acquiring feature vectors corresponding to standard medical terms in the standard medical term word stock as second feature vectors;
step S43: calculating a similarity value between the first feature vector and each second feature vector;
step S44: extracting a standard medical term corresponding to the maximum similarity value as a standard medical term matched with the medical term to be standardized;
the medical term feature extraction model is obtained through the training method of the medical term feature extraction model.
In some alternative embodiments, before the step S41, that is, before the step of inputting the medical terms to be standardized into the medical term feature extraction model to perform feature extraction, each term in the standard medical term word stock is respectively input into the medical term feature extraction model to perform feature extraction, so as to obtain feature vectors corresponding to each term in the standard medical term word stock, and store the feature vectors, so that the subsequent step S42 extracts the feature vectors corresponding to each term in the pre-stored standard medical term word stock to obtain the second feature vector, without executing the step of extracting the standard medical term vector in the medical term standardization process, thereby simplifying the implementation process and improving the efficiency of the medical term standardization process.
In some optional embodiments, in step S43, the cosine similarity value between the first feature vector and each second feature vector may be obtained by performing the cosine similarity calculation on the first feature vector and each second feature vector according to the following formula:
Figure SMS_6
in the above formula, a and B represent two different feature vectors, namely a first feature vector and a second feature vector;
Figure SMS_7
and->
Figure SMS_8
Each component of a and B is characterized.
In some alternative embodiments, in the step S44, the standard medical term corresponding to the maximum similarity value is extracted as the standard medical term matched with the medical term to be standardized, where the similarity value is used to characterize the matching degree between the medical term to be standardized and each standard medical term.
In some alternative embodiments, a plurality of standard medical terms with the highest similarity value with the medical terms to be standardized can be selected, and finally, a final standardized result is selected from the selected plurality of standard medical terms by combining with a manual participation auditing mode.
The following description is given by way of a specific example: the standard medical term "lotus leaf" is used as the standardized result of the medical term "lotus leaf" to be standardized, and the standardized result can be manually checked by medical specialists to further ensure the accuracy of the standardized result, as shown in the following table 6.
Table 6:
Figure SMS_9
according to the medical term standardization method provided by the embodiment of the invention, the similarity value between the feature vector of the medical term to be standardized and the feature vector of each standard medical term is calculated respectively, the standard medical term with the highest similarity value is used as the standardization result of the medical term to be standardized, the machine learning model is utilized, and the similarity algorithm is combined, so that the efficiency and accuracy of the medical term standardization process are greatly improved, and the problems of lower manual processing efficiency and low accuracy are solved.
Based on the same inventive concept, the embodiment of the present invention further provides a training device for a medical term feature extraction model, where the structure of the device is shown in fig. 5, and the training device includes:
an acquisition module 51 for acquiring a plurality of customer medical terms and a plurality of standard medical terms;
a generating module 52, configured to combine the client medical terms and the standard medical terms as training samples according to the mapping relationship between terms, and generate a sample set;
the training module 53 is configured to input the sample set to a pre-built twin network model for feature extraction, perform model parameter optimization adjustment according to differences between feature vectors extracted in each training sample, and use the adjusted model as a medical term feature extraction model.
The specific manner in which the respective modules perform the operations of the training apparatus for the medical term feature extraction model in the above-described embodiments has been described in detail in the embodiments related to the method, and will not be described in detail herein.
Based on the same inventive concept, an embodiment of the present invention further provides a medical term standardization apparatus, where the apparatus has a structure as shown in fig. 6, and includes:
the first feature extraction module 61 is configured to input a medical term to be standardized into a medical term feature extraction model to perform feature extraction, and obtain a corresponding feature vector as a first feature vector;
the feature vector obtaining module 62 is configured to obtain feature vectors corresponding to each standard medical term in the standard medical term word stock as a second feature vector;
a similarity calculating module 63, configured to calculate a similarity value between the first feature vector and each second feature vector;
a normalization module 64, configured to extract a standard medical term corresponding to the maximum similarity value as a standard medical term matched with the medical term to be normalized;
the medical term feature extraction model is obtained by the training method of the medical term feature extraction model as described above.
The specific manner in which the various modules perform the operations of the apparatus described in connection with the above embodiments are described in detail in connection with the embodiments of the method and will not be described in detail herein.
Based on the same inventive concept, the embodiments of the present invention further provide a computer storage medium, in which computer-executable instructions are stored, which when executed by a processor implement the training method of the medical term feature extraction model as described above or the method of medical term normalization as described above.
Based on the same inventive concept, an embodiment of the present invention further provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the training method of the medical term feature extraction model or the medical term standardization method when executing the program.
Unless specifically stated otherwise, terms such as processing, computing, calculating, determining, displaying, or the like, may refer to an action and/or process of one or more processing or computing systems, or similar devices, that manipulates and transforms data represented as physical (e.g., electronic) quantities within the processing system's registers or memories into other data similarly represented as physical quantities within the processing system's memories, registers or other such information storage, transmission or display devices. Information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. The processor and the storage medium may reside as discrete components in a user terminal.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. These software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.
The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".

Claims (10)

1. A training method for a medical term feature extraction model, comprising:
acquiring a plurality of client medical terms and a plurality of standard medical terms;
according to the mapping relation between nouns, combining the client medical nouns and the standard medical nouns as training samples to generate a sample set;
the method comprises the steps of respectively inputting a client medical term and a standard medical term in a training sample of a sample set to two paths of BERT on the left side and the right side of an SBERT model, respectively generating a feature vector corresponding to the client medical term and a feature vector corresponding to the standard medical term through the BERT on the left side and the face of the SBERT model and the BERT on the right side and the face of the SBERT model, calculating cosine similarity values of the feature vector corresponding to the client medical term and the feature vector corresponding to the standard medical term through a cosine similarity module in the SBERT model, continuously improving the cosine similarity values to serve as targets, finely adjusting parameters of the SBERT model, and after training for a plurality of rounds, stopping training if a loss function value of the sample set of the SBERT model is not reduced any more, and obtaining a medical term feature extraction model.
2. The method of claim 1, wherein combining the customer medical term and the standard medical term as training samples according to the mapping relationship between terms comprises:
respectively combining the client medical terms with the correct mapping relation with the standard medical terms to serve as positive samples, and generating a positive sample set;
and shuffling the client medical terms and/or standard medical terms of each positive sample in the positive sample set to generate a negative sample set.
3. The method of claim 2, wherein shuffling the customer healthcare terms and/or standard healthcare terms for each positive sample in the positive sample set to generate a negative sample set comprises:
shuffling the customer healthcare terms and/or standard healthcare terms of each positive sample in the positive sample set;
and determining that the client medical terms and standard medical terms combination which do not have the correct mapping relation and have the similarity value higher than the preset similarity threshold value are used as the negative sample set from the shuffled positive sample set.
4. A method of medical term standardization, comprising:
inputting medical terms to be standardized into a medical term feature extraction model to perform feature extraction, and obtaining corresponding feature vectors serving as first feature vectors;
acquiring feature vectors corresponding to standard medical terms in the standard medical term word stock as second feature vectors;
calculating a similarity value between the first feature vector and each second feature vector;
extracting a standard medical term corresponding to the maximum similarity value as a standard medical term matched with the medical term to be standardized;
wherein the medical term feature extraction model is obtained by a training method of the medical term feature extraction model according to any one of claims 1 to 3.
5. The method of claim 4, wherein the second feature vector, prior to the step of inputting the medical term to be normalized into the medical term feature extraction model for feature extraction, further comprises:
respectively inputting each noun in the standard medical professional name word stock into the medical professional noun feature extraction model to perform feature extraction, obtaining feature vectors respectively corresponding to each noun in the standard medical professional name word stock, and storing the feature vectors;
correspondingly, the obtaining the feature vector corresponding to each standard medical term in the standard medical term word stock includes:
extracting feature vectors corresponding to nouns in the pre-stored standard medical professional name word stock respectively.
6. The method of claim 4, wherein said calculating a similarity value between said first feature vector and each of said second feature vectors comprises:
and respectively carrying out cosine similarity calculation on the first feature vector and each second feature vector to obtain a cosine similarity value between the first feature vector and each second feature vector.
7. A training device for a medical term feature extraction model, comprising:
the acquisition module is used for acquiring a plurality of client medical terms and a plurality of standard medical terms;
the generation module is used for combining the client medical terms and the standard medical terms as training samples according to the mapping relation among terms to generate a sample set;
the training module is used for respectively inputting the client medical terms and the standard medical terms in the training samples of the sample set to the left and right paths of BERT of the SBERT model, respectively generating the feature vectors corresponding to the client medical terms and the feature vectors corresponding to the standard medical terms through the BERT and the pulling on the left side of the SBERT model and the BERT and the pulling on the right side of the SBERT model, calculating cosine similarity values of the feature vectors corresponding to the client medical terms and the feature vectors corresponding to the standard medical terms through the cosine similarity module in the SBERT model, continuously improving the cosine similarity values to be a target, finely adjusting parameters of the SBERT model, and stopping training after training for a plurality of rounds if the loss function value of the sample set of the SBERT model is not reduced any more, so as to obtain a medical term feature extraction model.
8. A medical term standardized device, comprising:
the first feature extraction module is used for inputting medical term to be standardized into the medical term feature extraction model to perform feature extraction, and obtaining a corresponding feature vector as a first feature vector;
the feature vector obtaining module is used for obtaining feature vectors corresponding to each standard medical term in the standard medical term word stock to be used as second feature vectors;
the similarity calculation module is used for calculating a similarity value between the first feature vector and each second feature vector;
the normalization module is used for extracting a standard medical term corresponding to the maximum similarity value as a standard medical term matched with the medical term to be normalized;
the medical term feature extraction model is obtained by the training method of the medical term feature extraction model according to any one of claims 1 to 3.
9. A computer storage medium having stored therein computer executable instructions which when executed by a processor implement the training method of the medical term feature extraction model of any of claims 1-3 or the method of medical term normalization of any of claims 4-6.
10. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which processor implements the training method of the medical term feature extraction model according to any one of claims 1-3 or the method of medical term normalization according to any one of claims 4-6 when executing the program.
CN202310422988.XA 2023-04-19 2023-04-19 Medical term feature extraction model training and standardization method and device Active CN116167354B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310422988.XA CN116167354B (en) 2023-04-19 2023-04-19 Medical term feature extraction model training and standardization method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310422988.XA CN116167354B (en) 2023-04-19 2023-04-19 Medical term feature extraction model training and standardization method and device

Publications (2)

Publication Number Publication Date
CN116167354A CN116167354A (en) 2023-05-26
CN116167354B true CN116167354B (en) 2023-07-07

Family

ID=86416627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310422988.XA Active CN116167354B (en) 2023-04-19 2023-04-19 Medical term feature extraction model training and standardization method and device

Country Status (1)

Country Link
CN (1) CN116167354B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107403068B (en) * 2017-07-31 2018-06-01 合肥工业大学 Merge the intelligence auxiliary way of inquisition and system of clinical thinking
CN110349639B (en) * 2019-07-12 2022-01-04 之江实验室 Multi-center medical term standardization system based on general medical term library
CN112036172B (en) * 2020-09-09 2022-04-15 平安科技(深圳)有限公司 Entity identification method and device based on abbreviated data of model and computer equipment
CN112347257A (en) * 2020-11-11 2021-02-09 北京嘉和海森健康科技有限公司 Patient symptom spoken normalization method and device
CN113593661A (en) * 2021-07-07 2021-11-02 青岛国新健康产业科技有限公司 Clinical term standardization method, device, electronic equipment and storage medium
CN113297852B (en) * 2021-07-26 2021-11-12 北京惠每云科技有限公司 Medical entity word recognition method and device

Also Published As

Publication number Publication date
CN116167354A (en) 2023-05-26

Similar Documents

Publication Publication Date Title
Zhang et al. A context-aware recurrent encoder for neural machine translation
CN116150382B (en) Method and device for determining standardized medical terms
CN112233746B (en) Automatic medical data standardization method
CN110427486B (en) Body condition text classification method, device and equipment
CN112883157B (en) Method and device for standardizing multi-source heterogeneous medical data
CN111613220A (en) Pathological information registration and input device and method based on voice recognition interaction
CN113343696A (en) Electronic medical record named entity identification method, device, remote terminal and system
CN113094477A (en) Data structuring method and device, computer equipment and storage medium
CN114298314A (en) Multi-granularity causal relationship reasoning method based on electronic medical record
Atef et al. AQAD: 17,000+ arabic questions for machine comprehension of text
CN112836019B (en) Public medical health named entity identification and entity linking method and device, electronic equipment and storage medium
CN116167354B (en) Medical term feature extraction model training and standardization method and device
CN113314207A (en) Object recommendation method and device, storage medium and electronic equipment
CN113435200A (en) Entity recognition model training and electronic medical record processing method, system and equipment
CN111063445A (en) Feature extraction method, device, equipment and medium based on medical data
CN115631825A (en) Method for automatically generating structured report by using natural language model and related equipment
AU2021106425A4 (en) Method, system and apparatus for extracting entity words of diseases and their corresponding laboratory indicators from Chinese medical texts
CN112632106B (en) Knowledge graph query method, device, equipment and storage medium
CN114974554A (en) Method, device and storage medium for fusing atlas knowledge to strengthen medical record features
CN114429820A (en) Intelligent rehabilitation evaluation system and method for hospital rehabilitation department
Calamari et al. Shape disparity of bovid (Mammalia, Artiodactyla) horn sheaths and horn cores allows discrimination by species in 3D geometric morphometric analyses
CN116186271B (en) Medical term classification model training method, classification method and device
CN111639874A (en) Rating method and related equipment
Basu et al. Automatic Medical Text Simplification: Challenges of Data Quality and Curation.
CN116303102B (en) Test data generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant