CN114580354B - Information coding method, device, equipment and storage medium based on synonym - Google Patents
Information coding method, device, equipment and storage medium based on synonym Download PDFInfo
- Publication number
- CN114580354B CN114580354B CN202210478341.4A CN202210478341A CN114580354B CN 114580354 B CN114580354 B CN 114580354B CN 202210478341 A CN202210478341 A CN 202210478341A CN 114580354 B CN114580354 B CN 114580354B
- Authority
- CN
- China
- Prior art keywords
- semantic
- semantic representation
- descriptions
- medical record
- coding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Epidemiology (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The application provides a synonym-based information coding method, a synonym-based information coding device, synonym-based information coding equipment and a synonym-based information coding storage medium, wherein the method comprises the following steps: coding words in the medical record text to obtain a first semantic representation corresponding to the medical record text; and acquiring multiple descriptions corresponding to preset disease code identifiers, wherein the multiple descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifiers. And determining a second semantic representation corresponding to the disease coding identification according to the multiple descriptions, and determining a third semantic representation corresponding to the disease coding identification in the medical record text according to the multiple descriptions and the first semantic representation. And determining whether the medical record text is marked with the disease coding identification according to the similarity between the third semantic representation and the second semantic representation. In the automatic coding process of the medical record text, the synonym description of the disease name is fully utilized, so that the automatic and accurate coding processing of the medical record text can be realized.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for encoding information based on synonyms.
Background
When each medical institution manages the text of the medical records, a coding person needs to map the text codes of the medical records to standard coding identifiers such as International Classification of Diseases (ICD), for example, ICD9 or ICD 10. The encoding process is error prone and labor intensive.
Disclosure of Invention
The embodiment of the invention provides a synonym-based information encoding method, a synonym-based information encoding device, synonym-based information encoding equipment and a synonym-based storage medium, which are used for improving the accuracy of an information encoding result.
In a first aspect, an embodiment of the present invention provides a method for encoding information based on synonyms, where the method includes:
encoding words in a medical record text to obtain a first semantic representation corresponding to the medical record text;
acquiring multiple descriptions corresponding to a preset disease code identifier, wherein the multiple descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifier;
determining a second semantic representation corresponding to the disease coding identification according to the plurality of descriptions;
determining a third semantic representation of the medical record text corresponding to the disease coding identification according to the plurality of descriptions and the first semantic representation;
and determining whether the medical record text is marked with the disease coding identifier according to the similarity between the third semantic representation and the second semantic representation.
In a second aspect, an embodiment of the present invention provides an apparatus for encoding information based on synonyms, where the apparatus includes:
the medical record encoding module is used for encoding words in a medical record text to obtain a first semantic representation corresponding to the medical record text;
the system comprises a description acquisition module, a semantic analysis module and a semantic analysis module, wherein the description acquisition module is used for acquiring a plurality of descriptions corresponding to preset disease code identifiers, and the plurality of descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifiers;
the semantic processing module is used for determining a second semantic representation corresponding to the disease coding identification according to the multiple descriptions; determining a third semantic representation of the medical record text corresponding to the disease-encoding identification based on the plurality of descriptions and the first semantic representation; and determining whether the medical record text is marked with the disease coding identification according to the similarity of the third semantic representation and the second semantic representation.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to implement at least the synonym-based information encoding method of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to implement at least the synonym-based information encoding method of the first aspect.
In a fifth aspect, an embodiment of the present invention provides a method for encoding information based on synonyms, where the method includes:
coding words in a target text to obtain a first semantic representation corresponding to the target text;
acquiring multiple category descriptions corresponding to preset category identifications, wherein the multiple category descriptions comprise standard descriptions and synonym descriptions corresponding to the category identifications;
determining a second semantic representation corresponding to the category identification according to the plurality of category descriptions;
determining, from the plurality of category descriptions and the first semantic representation, a third semantic representation of the target text corresponding to the category identification;
and determining whether the target text is marked with the category identification according to the similarity of the third semantic representation and the second semantic representation.
The embodiment of the invention can realize automatic coding of medical record texts) according to the included diseases. Specifically, for each word included in the medical record text, semantic encoding processing may be performed first to obtain a first semantic representation corresponding to the medical record text. Aiming at known disease coding identifiers (such as coding identifiers contained in ICD 9), on one hand, a standard description corresponding to each disease coding identifier, namely a standard disease name, is obtained, on the other hand, a synonym description corresponding to the standard description is obtained, so that multiple descriptions formed by the standard description and the synonym description corresponding to the same disease coding identifier are obtained, then, semantic coding is carried out on each description corresponding to the same disease coding identifier, and a second semantic representation corresponding to the disease coding identifier is obtained by combining the semantic coding result of each description. Then, according to the multiple descriptions corresponding to any disease code identification and the first semantic representation, a third semantic representation of the medical record text corresponding to the disease code identification is determined, namely the medical record text is based on the semantic representation of the disease code identification label. And determining whether the medical record text should be marked with the disease coding identification according to the similarity between the third semantic representation and the second semantic representation.
In the automatic coding process of the medical record text, the synonym description of the disease name is fully utilized, so that the automatic and accurate coding processing of the medical record text can be realized.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a flowchart of a method for encoding information based on synonyms according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a medical record encoding process according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for encoding information based on synonyms according to an embodiment of the present disclosure;
FIG. 4 is a flowchart of a method for encoding information based on synonyms according to an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating an application of a synonym-based information encoding method according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a synonym-based information encoding device according to an embodiment of the present disclosure;
FIG. 7 is a schematic structural diagram of an electronic device corresponding to the synonym-based information encoding device provided in the embodiment shown in FIG. 6.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The features of the embodiments and examples described below may be combined with each other without conflict between the embodiments. In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.
The synonym-based information encoding method provided by the embodiment of the invention can be executed by an electronic device, wherein the electronic device can be a server or a user terminal, and the server can be a physical server or a virtual server (virtual machine) of a cloud.
Fig. 1 is a flowchart of a method for encoding information based on synonyms according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
101. and coding the words in the medical record text to obtain a first semantic representation corresponding to the medical record text.
102. And acquiring multiple descriptions corresponding to preset disease code identifiers, wherein the multiple descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifiers.
103. And determining a second semantic representation corresponding to the disease coding identification according to the plurality of descriptions.
104. And determining a third semantic representation of the medical record text corresponding to the disease coding identification according to the plurality of descriptions and the first semantic representation.
105. And determining whether the medical record text is marked with the disease coding identification according to the similarity between the third semantic representation and the second semantic representation.
The scheme provided by the embodiment of the invention can be applied to an application scene of disease coding of medical history texts. The medical record text is subjected to disease coding, that is, according to the description contents such as the disease name and the like contained in the medical record text, a general disease coding identifier which should be marked on the medical record text is determined, for example, "paratyphoid B" is contained in the medical record text, and the corresponding disease coding identifier is a10.2. Based on the automatic determination of the disease code identification of the medical record text, the medical record text can be classified, filed and inquired, and doctors can know the past medical history of the patients conveniently.
In practical application, the medical record text can be an outpatient medical record or an inpatient medical record. The medical record text can be obtained by scanning a handwritten medical record, or can be automatically generated by directly filling a medical record text form on a terminal such as a computer and the like. Because the scheme provided by the embodiment of the invention mainly processes the text content in the medical record text, the medical record text is also collectively referred to as the medical record text in the embodiment of the invention.
In order to realize disease coding of medical record texts, firstly, a medical record text needs to be coded to obtain a semantic representation corresponding to the medical record text, which is called as a first semantic representation.
Specifically, the medical record text may depict information about disease conditions, disease names, and the like, and the description contents are subjected to word segmentation to obtain a plurality of words (or referred to as words), and each word may be subjected to word vector encoding (such as word2 vec) to be mapped into a mathematical vector form that can be processed by a computer. Then, a certain neural network model can be adopted, word vectors corresponding to the obtained multiple words are input into the neural network model, so that hidden states output by the neural network model aiming at the multiple words are obtained and serve as semantic vectors corresponding to the corresponding words, and finally the semantic vectors corresponding to the multiple words form a first semantic representation corresponding to the medical record text.
In practical applications, the word segmentation processing may also be splitting one character by one character, that is, a word. The Neural Network model may adopt a Bi-directional Long-Short Term Memory (Bi-LSTM) model, an LSTM model, a Recurrent Neural Network (RNN) model, or the like.
For ease of understanding, for example, for a piece of medical history text, it is assumed that its input consists of a number of words (or called words) represented as a set:wherein, in the process,represents the total number of words,one of which is indicated. Further, it is assumed that a corresponding word vector set obtained by performing word vector encoding on each word is as follows:wherein, in the step (A),meaning wordThe corresponding word vector.
Then, semantic coding is performed on each word vector in the word vector set through, for example, a Bi-LSTM model, so as to obtain the following coding result:. Wherein the content of the first and second substances,it is shown that the semantic code computation,representing word vectorsCorresponding semantic vectors, i.e. word vectorsHidden state vectors output by the model after input to the model.To representA matrix of semantic vectors, i.e. the first semantic representation.
Since the medical record text is subjected to the disease coding processing, the disease coding identifier corresponding to the current medical record text, that is, the disease coding identifier that should be included in the current medical record text, is actually determined from a plurality of known disease coding identifiers. Therefore, by querying the general disease code identification database, each disease code identification and its corresponding standard disease description content, which is usually a standard disease name, can be obtained. And then, semantic coding processing is carried out on the description content corresponding to each disease coding identification.
In the embodiment of the invention, in order to improve the accuracy of the disease coding result of the medical record text, for any disease coding identifier, not only the corresponding standard description in the database but also the synonym description are considered. For example, assuming that the standard description corresponding to a disease code identifier in the above database is "typhoid", the synonym description corresponding to the disease code identifier can be determined by querying a known medical knowledge map, etc., such as "cold", "wind chill", etc. The creation of the knowledge-graph is not the focus of the embodiments of the present invention and is not described in detail.
That is to say, in the embodiment of the present invention, when the medical history text is automatically encoded, because the same disease may appear in terms of nouns with greatly different forms, the synonym information of the disease name can be fully utilized to complete automatic and accurate medical history text encoding.
Since it is not known which diseases are included in the current medical record text when the medical record text is coded, it is necessary to perform a determination process of corresponding semantic representation for each known disease code identifier in the database, and finally determine the disease code identifier included in the medical record text based on the semantic representation corresponding to each disease code identifier.
Since the processing procedure for each disease code identification is the same, for convenience of description, only any one of the disease codes is used for identificationThe description is given for the sake of example.
It is assumed that the disease code identification is known from the databaseThe corresponding standard is described asThe synonym descriptions of the inquired synonyms are respectively as follows:. Thus, by thisThe description constitutes the disease code identificationA corresponding set of descriptions.The preset value can be set according to the requirements, and it should be noted that if a certain disease code identifier is not found, the corresponding disease code identifier cannot be foundThe description may then be completed by copying a plurality of its standard descriptions.
For each description thereinSuppose it is made ofIndividual words (or words) are formed, expressed as:。
then, the identification is carried out according to the disease codeCorresponding toSpecies description, determining disease coding identityThe corresponding second semantic representation may optionally be implemented as:
are respectively paired withThe description is encoded to obtainDescription is correspondingA fourth semantic representation;
according toA fourth semantic representation for determining disease code identificationA corresponding second semantic representation.
Wherein, optionally, are respectively pairedThe description is encoded to obtainDescription is corresponding toA fourth semantic representation, which may be implemented as: aiming at any description, coding each word in any description to obtain semantic representation corresponding to each word; and performing maximum pooling on the semantic representations corresponding to the words to obtain a fourth semantic representation corresponding to any description.
Wherein, optionally, according toA fourth semantic representation for determining disease code identificationThe corresponding second semantic representation may be implemented as: to pairPerforming maximum pooling on the fourth semantic representation to obtain disease code identificationA corresponding second semantic representation.
The above-described process for each description may be expressed as:
wherein, in orderAny of the descriptionsBy way of example, the aboveDescription of the representationIs contained inThe words are respectively corresponding to word vectors, which can be defined byThe word vectors are sequentially input into the neural network model for semantic coding of the medical record text, such as the Bi-LSTM model, to be coded, and semantic coding results corresponding to the word vectors are obtained, that is, the semantic coding results corresponding to the word vectors are obtainedCorresponding to each wordAnd (4) semantic representation.
Then, to thisMaximal pooling of semantic representations (i.e., as described above)) Process, can be describedCorresponding fourth semantic representation。
Then, the identification is coded for the diseaseCorresponding toPerforming maximal pooling on the fourth semantic representations corresponding to the species descriptions to obtain disease coding identificationCorresponding second semantic representationThe process can be expressed as:
by the above-mentioned coding identification for diseasesThe semantic coding processing of the corresponding multiple descriptions can be known, and finally the obtained disease coding identificationThe corresponding second semantic representation includes semantic information of each description, not only semantic information of standard description.
Then, the identification is carried out according to the disease codeCorresponding multiple descriptions and a first semantic representation corresponding to the medical record text are determined, and the medical record text is determined to correspond to the disease coding identificationThe third semantic representation of (2). Since the medical record text is semantically coded and the relation between the medical record text and each disease coding mark is considered, the medical record text corresponds to the disease coding markThe third semantic representation of "can be understood to mean, in effect, that the determination of the medical history text is based on tags(disease-coding identifiers are considered as a sort label), and in the semantic representation determination process, the association relationship between the medical record text and each disease-coding identifier is established. The association may be implemented by an Attention (Attention) mechanism.
In general terms, identification is based on disease codesDetermining that the medical record text corresponds to the disease coding identifierThe third semantic representation of (2) may be implemented as:
determining attention coefficient vectors of words in the medical record text corresponding to each fourth semantic representation according to a plurality of fourth semantic representations corresponding to a plurality of descriptions and the first semantic representation; determining that the medical record text corresponds to the disease coding identification according to the attention coefficient vector and the first semantic representationThe third semantic representation of (3).
Wherein the identification is carried out by disease codesCorresponding toAny of the descriptionFor example, as can be seen from the above example, the fourth semantic representation corresponding to the description isThe first languageMeaning is expressed asDetermining that a word in the medical record text corresponds to a fourth semantic representationThe attention coefficient vector of (2) is based on the principle of attention mechanism, and actuallyComputing, as a Query (Query), a first semantic representation of a text of a medical recordThe calculation of the attention coefficient is actually the calculation of the attention coefficient contained in the medical record textThe attention coefficient values corresponding to the words, i.e. the compositionOf (2) aboveA semantic vectorThe respective corresponding attention coefficient. From thisThe attention coefficient value constitutes a fourth semantic representation of words in the medical record text corresponding to the fourth semantic representationThe attention coefficient vector of (3).
The words in the case history text correspond to a fourth semantic representationThe physical meaning of the attention coefficient vector of (1) can be understood as: each word contained in the medical record text is used for judging that the medical record text contains descriptionA respective corresponding degree of contribution, which is reflected by the attention factor.
Corresponding words in the obtained medical record text to a fourth semantic representationBy using the attention coefficient vector to represent the first semantic meaningIs contained inA semantic vectorWeighted summation processing is carried out to obtain the medical record text corresponding to the disease code identificationThe third semantic representation of (3).
In fact, the words in the case history text correspond to the fourth semantic representationThe attention coefficient vector of (1) is a dimension ofOf a vector of (A) AEach vector element corresponds to the aboveMultiplying semantic vectors one by one, and then carrying out vector addition and calculation to finally obtain a dimension ofThe vector of (b) is the third semantic representation.
Finally, calculating the medical record text corresponding to the disease code identificationThird semantic representation and disease coding identificationThe similarity of the corresponding second semantic representation is used for determining that the medical record text should be marked with the disease coding identification when the similarity meets the set condition。
In the automatic coding process of the medical record text, the synonym description of the disease name is fully utilized, so that the automatic and accurate coding processing of the medical record text can be realized.
To facilitate understanding of the above-described automatic encoding process, it is schematically illustrated in conjunction with fig. 2.
As shown in fig. 2, in order to realize disease coding of medical record text, a coding system comprising a plurality of functional modules illustrated in the figure can be used, and the coding system can actually form a coding model comprising a semantic coding module, a maximum pooling processing module, an attention calculating module and a similarity output module illustrated in the figure.
Wherein the semantic coding module may be the Bi-LSTM model introduced above, and the max-pooling processing module is used to achieve the max-pooling described above: () The processing and similarity output module is actually the output layer of the coding model and is used for calculating loss in the training stageA function, except that the loss function is defined by the similarity of the third semantic representation to the second semantic representation.
As shown in FIG. 2, for the medical record text mentioned above, the word vectors corresponding to the words contained in the medical record text are input into the semantic coding module, and then the first semantic representation is output. The word vector contained in each description corresponding to any disease code identification is input into the semantic coding module, the semantic vector of each word in one description output by the semantic coding is input into the maximum pooling processing module, and the fourth semantic representation corresponding to the description is obtained, as described above, the disease code identificationCorresponding toThe description describes the fourth semantic representation corresponding to each as:. The fourth semantic representations are further processed by a maximum pooling processing module to obtain disease coding identificationThe corresponding second semantic representation:。
and for each fourth semantic representation, calculating an attention coefficient corresponding to each word in the medical record text by an attention calculation module in combination with the first semantic representation to obtain an attention coefficient vector corresponding to each fourth semantic representation:. Then, based on each calculated attention coefficient vector, the first and second attention coefficient vectors are respectively matchedA semantic representationThe weighted summation is carried out on a plurality of semantic vectors contained in the semantic expression vector, and a plurality of weighted semantic expressions are obtained:. Finally, maximum pooling processing is carried out on the weighted semantic representations to obtain medical record texts corresponding to the disease coding identificationsThird semantic representation of。
Thereafter, a third semantic representation is computedWith a second semantic representationThe similarity of (c).
As shown in fig. 2, the calculation of the similarity may be defined as: calculating medical record text contains labels(i.e., disease code identification)) Log probability of (d):. Wherein the content of the first and second substances,it is shown that the Sigmoid function is,the transpose is represented by,representing a dual affine transformation matrix.
In the coding model training stage, when the medical record text is used as a training sample, the disease coding identification contained in the medical record text is labeled in advance and used as supervision information. The similarity defined by the logarithmic probability actually reflects the medical record text and any disease code identificationThe similarity value of the medical record text and each disease code identification can be obtained by traversing each disease code identification contained in the disease code identification database, a similarity threshold can be set, and if the similarity value of the medical record text and a certain disease code identification is greater than the threshold, the medical record text is considered to contain the disease code identification. Therefore, the actually determined disease coding identification contained in the case history text is compared with the pre-marked supervision information, namely, the coding model parameters can be adjusted according to the loss function value, and when the model is trained to be convergent, the double affine transformation matrix suitable for various diseases can be obtained. Based on the training of the matrix, the coding model can overcome the dependence on long-tail data, namely, the influence of sample imbalance is overcome, and the sample imbalance is mainly embodied as that the number of descriptions corresponding to some disease coding identifiers which can be collected is less.
For the above mentioned: after the multiple descriptions corresponding to the disease coding identification are respectively coded to obtain multiple fourth semantic representations corresponding to the multiple descriptions, according to the multiple fourth semantic representations and the first semantic representation corresponding to the medical record text, the attention coefficient vector of each fourth semantic representation corresponding to the word in the medical record text is determined. An alternative way of determining the attention coefficient vector is provided in the embodiments of the present invention, as shown in fig. 3.
Fig. 3 is a flowchart of a method for encoding information based on synonyms according to an embodiment of the present invention, as shown in fig. 3, the method may include the following steps:
301. and coding a plurality of words in the medical record text to obtain a first semantic representation corresponding to the medical record text, wherein the first semantic representation is formed by a plurality of semantic vectors corresponding to the words.
302. The method comprises the steps of obtaining multiple descriptions formed by standard descriptions and synonym descriptions corresponding to preset disease coding identifications, coding the multiple descriptions respectively to obtain multiple fourth semantic representations corresponding to the multiple descriptions, and determining a second semantic representation corresponding to the disease coding identification according to the multiple fourth semantic representations.
The execution process of the above steps can refer to the related description in the foregoing embodiments, which is not described herein again.
303. And segmenting the first semantic representation into a plurality of semantic blocks, wherein each semantic block comprises a plurality of sub-semantic vectors corresponding to the plurality of words, each sub-semantic vector is formed by partial dimensions in the corresponding semantic vector, and the number of the semantic blocks is equal to that of the plurality of descriptions.
Is accepted in the first semantic representationAnd any disease code identificationCorresponding toThe description of the species:,here, the first semantic is expressed asCutting intoAnd semantic blocks with the same size. Wherein the medical record text includesThe semantic vectors corresponding to the words are as follows:。
wherein, the segmentation mode does: assumptions form a first semantic representationThe above-mentionedA semantic vector forms oneA matrix of rows and columns, where each semantic vector is assumed to be K-dimensional. Equally divide the K columns intoGroups, then each group will constitute a semantic block. For example, the number of bits of K =100,every 10 columns are grouped, thus 10 semantic blocks are obtained, wherein each semantic block comprisesPartial dimensions in the rowlock meaning vector, calledA sub-semantic vector.
304. determining an attention coefficient vector in which a plurality of sub-semantic vectors in the target semantic block correspond to a target fourth semantic representation, wherein the target fourth semantic representation is the same as the target semantic block in sequence number, and the target fourth semantic representation is any one of the plurality of fourth semantic representations.
For any fourth semantic representation, accepting the example aboveComputing target semantic blocksAttention coefficient vector corresponding to the fourth semantic representation, i.e. withComputing target semantic blocks as queries (Query)InThe sub-semantic vectors each correspond to an attention coefficient. Wherein the target semantic blockNumber ofWith a fourth semantic representationIs numberedAre the same. To summarizeNamely: for the purpose ofA fourth semantic representation of each of the descriptions, anAnd the semantic blocks are used for performing attention calculation on the fourth semantic representation and the semantic blocks in a one-to-one correspondence mode. By the aid of the calculation mode, the trained coding model can better focus on semantic information which is more important for the predicted disease coding identification during attention calculation, namely, a larger attention coefficient is distributed to the semantic information which is more important for the accurate predicted disease coding identification.
Expressed in a fourth semanticFor example, with target semantic blocksThe attention calculation result of (a) may be expressed as:
wherein, tanh is an arc tangent function, which can be replaced by a relu function, etc.,to solve for the attention coefficient vector.Andis a matrix of weight coefficients.
305. And respectively carrying out weighted summation on a plurality of semantic vectors contained in the first semantic representation by using a plurality of determined attention coefficient vectors corresponding to a plurality of fourth semantic representations to obtain a plurality of weighted semantic representations, and carrying out maximum pooling processing on the plurality of weighted semantic representations to obtain a third semantic representation of the medical record text corresponding to the disease coding identification.
The third semantic representationThe calculation process of (a) can be expressed as:wherein the attention coefficient vectors corresponding to the plurality of fourth semantic representations are respectively:. The plurality of weighted semantic representations are respectively:。
306. and determining whether the medical record text is marked with the disease coding identification according to the similarity between the third semantic representation and the second semantic representation.
Optionally, the third semantic representation, the second semantic representation and the trained affine-double transformation matrix can be usedDetermining the similarity of the third semantic representation and the second semantic representation:. If the similarity is larger than the set threshold, the medical record text is considered to comprise the disease code identificationThe corresponding disease is the coded mark of the disease on the medical record text mark。
The above embodiment introduces a scheme for disease coding of medical history texts. Indeed, similar application requirements may exist in many other application scenarios. This application requirement can be summarized as: under the condition that a plurality of category identifications are preset and standard category (name) description corresponding to each category identification is set in advance, the category identification corresponding to the text is required to be determined for the currently input text. Since only the correspondence between the category identifier and the corresponding labeling category description is established in advance, the labeling of the category identifier associated with the text is performed only based on the correspondence, and the accuracy is limited because the category identifier is labeled on the text only in the case where the labeling category description corresponding to a certain category identifier is included in the text.
In order to improve the accuracy of the result of the category identification tag associated with the text, based on the scheme provided by the above embodiment, the embodiment of the present invention provides a general solution, as shown in fig. 4, the method includes the following steps:
401. and coding the words in the target text to obtain a first semantic representation corresponding to the target text.
402. And acquiring multiple types of descriptions corresponding to preset type identifications, wherein the multiple types of descriptions comprise standard descriptions and synonym descriptions corresponding to the type identifications.
403. And determining a second semantic representation corresponding to the category identification according to the plurality of category descriptions.
404. And determining a third semantic representation of the target text corresponding to the category identification according to the multiple category descriptions and the first semantic representation.
405. And determining whether the target text is marked with the category identification according to the similarity of the third semantic representation and the second semantic representation.
The target text can be, for example, the medical record text in the foregoing embodiment, and accordingly, the category identifier is a variety of disease code identifiers. The target text may also be a description text of the product, and the category identifier may be a category name of the product, for example, a category identifier of a potato may correspond to multiple descriptions: potatoes, yam eggs, and the like.
The implementation of this embodiment may refer to the related descriptions in the foregoing other embodiments, and will not be described herein.
As described above, the information encoding method based on synonyms provided by the present invention can be executed in the cloud, and a plurality of computing nodes may be deployed in the cloud, and each computing node has processing resources such as computation and storage. In the cloud, a plurality of computing nodes may be organized to provide a service, and of course, one computing node may also provide one or more services. The way that the cloud provides the service may be to provide a service interface to the outside, and the user calls the service interface to use the corresponding service. The service Interface includes Software Development Kit (SDK), application Programming Interface (API), and other forms.
Aiming at the scheme provided by the embodiment of the invention, the cloud end can provide a service interface of the information coding service, and a user calls the service interface through user equipment to trigger a calling request to the cloud end, wherein the request comprises a medical record text. The cloud determines the compute nodes that respond to the request, and performs the following steps using processing resources in the compute nodes:
encoding words in a medical record text to obtain a first semantic representation corresponding to the medical record text;
acquiring multiple descriptions corresponding to preset disease code identifiers, wherein the multiple descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifiers;
determining a second semantic representation corresponding to the disease coding identification according to the plurality of descriptions;
determining a third semantic representation of the medical record text corresponding to the disease coding identification according to the plurality of descriptions and the first semantic representation;
and determining whether the medical record text is marked with the disease coding identification according to the similarity of the third semantic representation and the second semantic representation.
In addition, the model training task described in the foregoing embodiment may also be completed by the computing node in the cloud.
For ease of understanding, the description is exemplified in conjunction with fig. 5. The user can call an information coding service interface (API interface in the figure) through the user equipment E1 illustrated in fig. 5, and upload a service request containing a medical record text through the interface. In the cloud, as shown in the figure, besides a plurality of computing nodes, a management node E2 running a management and control service is also deployed, after receiving a service request sent by the user equipment E1, the management node E2 determines a computing node E3 responding to the service request, after receiving a medical record text, the computing node E3 executes the steps, and finally outputs each disease code identification associated with the medical record text to be sent to the user equipment E1, and the user equipment E1 displays a final detection result. For the detailed implementation process, reference is made to the descriptions in the foregoing embodiments, and details are not described herein.
The synonym-based information encoding device according to one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means can each be constructed using commercially available hardware components and configured through the steps taught in this scheme.
Fig. 6 is a schematic structural diagram of an information encoding device based on synonyms according to an embodiment of the present invention, as shown in fig. 6, the device includes: a medical record coding module 11, a description obtaining module 12 and a semantic processing module 13.
And the medical record encoding module 11 is configured to encode words in a medical record text to obtain a first semantic representation corresponding to the medical record text.
The description obtaining module 12 is configured to obtain multiple descriptions corresponding to a preset disease code identifier, where the multiple descriptions include a standard description and a synonym description corresponding to the disease code identifier.
The semantic processing module 13 is configured to determine, according to the multiple descriptions, a second semantic representation corresponding to the disease coding identifier; determining a third semantic representation of the medical record text corresponding to the disease coding identification according to the plurality of descriptions and the first semantic representation; and determining whether the medical record text is marked with the disease coding identifier according to the similarity between the third semantic representation and the second semantic representation.
Optionally, in the process of determining the second semantic representation corresponding to the disease coding identifier, the semantic processing module 13 is specifically configured to: respectively coding the multiple descriptions to obtain multiple fourth semantic representations corresponding to the multiple descriptions; determining the second semantic representation corresponding to the disease coding identification according to the plurality of fourth semantic representations.
Optionally, the semantic processing module 13 is specifically configured to: aiming at any description, coding each word in any description to obtain semantic representation corresponding to each word; performing maximum pooling on the semantic representations corresponding to the words to obtain a fourth semantic representation corresponding to any description; performing maximal pooling on the fourth semantic representations to obtain the second semantic representation corresponding to the disease coding identifier.
Optionally, in the process of determining, according to the multiple descriptions and the first semantic representation, that the medical record text corresponds to a third semantic representation of the disease coding identifier, the semantic processing module 13 is specifically configured to: determining an attention coefficient vector of a word in the medical record text corresponding to each fourth semantic representation according to the plurality of fourth semantic representations and the first semantic representation; determining a third semantic representation of the medical record text corresponding to the disease-encoding marker based on the attention coefficient vector and the first semantic representation.
Optionally, the medical record text includes a plurality of words, and the first semantic representation is formed by a plurality of semantic vectors corresponding to the plurality of words. Based on this, in the process of determining the attention coefficient vector corresponding to each fourth semantic representation of the words in the medical record text, the semantic processing module 13 is specifically configured to: segmenting the first semantic representation into a plurality of semantic blocks, wherein each semantic block comprises a plurality of sub-semantic vectors corresponding to the plurality of words, each sub-semantic vector is composed of partial dimensions in the corresponding semantic vector, and the number of the semantic blocks is equal to that of the plurality of descriptions; determining an attention coefficient vector for a plurality of sub-semantic vectors in a target semantic block corresponding to a target fourth semantic representation, wherein the target fourth semantic representation has the same sequence number as the target semantic block, and the target fourth semantic representation is any one of the plurality of fourth semantic representations.
Optionally, in the process of determining that the medical record text corresponds to the third semantic representation of the disease coding identifier, the semantic processing module 13 is specifically configured to: respectively carrying out weighted summation on a plurality of semantic vectors contained in the first semantic representation by using a plurality of determined attention coefficient vectors corresponding to the fourth semantic representations to obtain a plurality of weighted semantic representations; performing maximal pooling on the weighted semantic representations to obtain a third semantic representation of the medical record text corresponding to the disease coding identification.
Optionally, the semantic processing module 13 is specifically configured to: and determining the similarity of the third semantic representation and the second semantic representation according to the third semantic representation, the second semantic representation and a trained dual affine transformation matrix.
The apparatus shown in fig. 6 can perform the steps provided in the foregoing embodiments, and the detailed performing process and technical effects refer to the description in the foregoing embodiments, which are not described herein again.
In one possible design, the structure of the synonym-based information encoding apparatus shown in fig. 6 may be implemented as an electronic device. As shown in fig. 7, the electronic device may include: a processor 21, a memory 22, and a communication interface 23. Wherein the memory 22 has stored thereon executable code which, when executed by the processor 21, makes the processor 21 at least to implement the synonym-based information encoding method as provided in the previous embodiments.
In addition, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to implement at least the synonym-based information encoding method as provided in the foregoing embodiments.
The above described embodiments of the apparatus are merely illustrative, wherein the network elements illustrated as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A synonym-based information coding method is characterized by comprising the following steps:
coding words in a medical record text to obtain a first semantic representation corresponding to the medical record text;
acquiring multiple descriptions corresponding to preset disease code identifiers, wherein the multiple descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifiers;
determining a second semantic representation corresponding to the disease coding identification according to the plurality of descriptions;
determining a third semantic representation of the medical record text corresponding to the disease coding identification according to the plurality of descriptions and the first semantic representation;
determining whether the medical record text is marked with the disease coding identifier according to the similarity between the third semantic representation and the second semantic representation;
wherein the determining of the third semantic representation comprises:
respectively coding the multiple descriptions to obtain multiple fourth semantic representations corresponding to the multiple descriptions;
determining an attention coefficient vector of a word in the medical record text corresponding to each fourth semantic representation according to the plurality of fourth semantic representations and the first semantic representation;
respectively carrying out weighted summation on a plurality of semantic vectors contained in the first semantic representation by using a plurality of determined attention coefficient vectors corresponding to the fourth semantic representations to obtain a plurality of weighted semantic representations; performing maximal pooling on the weighted semantic representations to obtain a third semantic representation of the medical record text corresponding to the disease coding identification.
2. The method of claim 1, wherein determining a second semantic representation corresponding to the disease coding identifier from the plurality of descriptions comprises:
determining the second semantic representation corresponding to the disease coding identifier according to the plurality of fourth semantic representations.
3. The method according to claim 2, wherein said separately encoding said plurality of descriptions to obtain a plurality of fourth semantic representations corresponding to said plurality of descriptions comprises:
aiming at any description, coding each word in any description to obtain semantic representation corresponding to each word;
and performing maximum pooling on the semantic representation corresponding to each word to obtain a fourth semantic representation corresponding to any description.
4. The method according to claim 2, wherein the determining the second semantic representation to which the disease coding identifier corresponds according to the plurality of fourth semantic representations comprises:
performing maximal pooling on the fourth semantic representations to obtain the second semantic representation corresponding to the disease coding identifier.
5. The method of claim 1, wherein the medical record text includes a plurality of words, and the first semantic representation is formed by a plurality of semantic vectors corresponding to the plurality of words;
the determining, from the plurality of fourth semantic representations and the first semantic representation, an attention coefficient vector for a word in the medical record text corresponding to each fourth semantic representation includes:
segmenting the first semantic representation into a plurality of semantic blocks, wherein each semantic block comprises a plurality of sub-semantic vectors corresponding to the plurality of words, each sub-semantic vector is formed by partial dimensions in the corresponding semantic vector, and the number of the semantic blocks is equal to that of the plurality of descriptions;
determining attention coefficient vectors of a plurality of sub-semantic vectors in a target semantic block corresponding to a target fourth semantic representation, wherein the target fourth semantic representation has the same sequence number as the target semantic block, and the target fourth semantic representation is any one of the plurality of fourth semantic representations.
6. The method of claim 1, further comprising:
and determining the similarity of the third semantic representation and the second semantic representation according to the third semantic representation, the second semantic representation and a trained double affine transformation matrix.
7. An information encoding device based on synonyms, comprising:
the medical record encoding module is used for encoding words in a medical record text to obtain a first semantic representation corresponding to the medical record text;
the system comprises a description acquisition module, a database acquisition module and a database processing module, wherein the description acquisition module is used for acquiring a plurality of descriptions corresponding to a preset disease code identifier, and the plurality of descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifier;
the semantic processing module is used for determining a second semantic representation corresponding to the disease coding identification according to the plurality of descriptions; determining a third semantic representation of the medical record text corresponding to the disease coding identification according to the plurality of descriptions and the first semantic representation; determining whether the medical record text is marked with the disease coding identifier according to the similarity between the third semantic representation and the second semantic representation;
wherein, in the process of determining the third semantic representation, the semantic processing module is specifically configured to: respectively coding the multiple descriptions to obtain multiple fourth semantic representations corresponding to the multiple descriptions; determining an attention coefficient vector of a word in the medical record text corresponding to each fourth semantic representation according to the plurality of fourth semantic representations and the first semantic representation; respectively carrying out weighted summation on a plurality of semantic vectors contained in the first semantic representation by using a plurality of determined attention coefficient vectors corresponding to the fourth semantic representations to obtain a plurality of weighted semantic representations; performing maximal pooling on the weighted semantic representations to obtain a third semantic representation of the medical record text corresponding to the disease coding identification.
8. An electronic device, comprising: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the synonym-based information encoding method of any one of claims 1-6.
9. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the synonym-based information encoding method of any one of claims 1-6.
10. A synonym-based information coding method is characterized by comprising the following steps:
coding words in a target text to obtain a first semantic representation corresponding to the target text;
acquiring multiple category descriptions corresponding to preset category identifications, wherein the multiple category descriptions comprise standard descriptions and synonym descriptions corresponding to the category identifications;
determining a second semantic representation corresponding to the category identification according to the plurality of category descriptions;
determining a third semantic representation of the target text corresponding to the category identification according to the multiple category descriptions and the first semantic representation;
determining whether the category identification is marked in the target text or not according to the similarity of the third semantic representation and the second semantic representation;
wherein the determining of the third semantic representation comprises:
respectively encoding the multiple category descriptions to obtain multiple fourth semantic representations corresponding to the multiple category descriptions;
determining, from the plurality of fourth semantic representations and the first semantic representation, an attention coefficient vector for a word in the target text corresponding to each fourth semantic representation;
respectively carrying out weighted summation on a plurality of semantic vectors contained in the first semantic representation by using a plurality of determined attention coefficient vectors corresponding to the fourth semantic representations to obtain a plurality of weighted semantic representations; and performing maximum pooling on the plurality of weighted semantic representations to obtain a third semantic representation of the target text corresponding to the category identifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210478341.4A CN114580354B (en) | 2022-05-05 | 2022-05-05 | Information coding method, device, equipment and storage medium based on synonym |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210478341.4A CN114580354B (en) | 2022-05-05 | 2022-05-05 | Information coding method, device, equipment and storage medium based on synonym |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114580354A CN114580354A (en) | 2022-06-03 |
CN114580354B true CN114580354B (en) | 2022-10-28 |
Family
ID=81778842
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210478341.4A Active CN114580354B (en) | 2022-05-05 | 2022-05-05 | Information coding method, device, equipment and storage medium based on synonym |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114580354B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116305285B (en) * | 2023-03-30 | 2024-04-05 | 肇庆学院 | Patient information desensitization processing method and system combining artificial intelligence |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113239166A (en) * | 2021-05-24 | 2021-08-10 | 清华大学深圳国际研究生院 | Automatic man-machine interaction method based on semantic knowledge enhancement |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107818169B (en) * | 2017-11-13 | 2021-09-07 | 医渡云(北京)技术有限公司 | Electronic medical record retrieval and storage method and device, storage medium and electronic terminal |
CN109785959A (en) * | 2018-12-14 | 2019-05-21 | 平安医疗健康管理股份有限公司 | A kind of disease code method and apparatus |
CN111563209B (en) * | 2019-01-29 | 2023-06-30 | 株式会社理光 | Method and device for identifying intention and computer readable storage medium |
US20200301953A1 (en) * | 2019-03-20 | 2020-09-24 | Microstrategy Incorporated | Indicating synonym relationships using semantic graph data |
CN111506673A (en) * | 2020-03-27 | 2020-08-07 | 泰康保险集团股份有限公司 | Medical record classification code determination method and device |
CN112148871B (en) * | 2020-09-21 | 2024-04-12 | 北京百度网讯科技有限公司 | Digest generation method, digest generation device, electronic equipment and storage medium |
CN112183026B (en) * | 2020-11-27 | 2021-11-23 | 北京惠及智医科技有限公司 | ICD (interface control document) encoding method and device, electronic device and storage medium |
CN112489740A (en) * | 2020-12-17 | 2021-03-12 | 北京惠及智医科技有限公司 | Medical record detection method, training method of related model, related equipment and device |
CN112632910A (en) * | 2020-12-21 | 2021-04-09 | 北京惠及智医科技有限公司 | Operation encoding method, electronic device and storage device |
-
2022
- 2022-05-05 CN CN202210478341.4A patent/CN114580354B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113239166A (en) * | 2021-05-24 | 2021-08-10 | 清华大学深圳国际研究生院 | Automatic man-machine interaction method based on semantic knowledge enhancement |
Also Published As
Publication number | Publication date |
---|---|
CN114580354A (en) | 2022-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111695033B (en) | Enterprise public opinion analysis method, enterprise public opinion analysis device, electronic equipment and medium | |
CN111222305B (en) | Information structuring method and device | |
CN112241626A (en) | Semantic matching and semantic similarity model training method and device | |
CN112232024A (en) | Dependency syntax analysis model training method and device based on multi-labeled data | |
CN112580328A (en) | Event information extraction method and device, storage medium and electronic equipment | |
CN110348012B (en) | Method, device, storage medium and electronic device for determining target character | |
CN111159485A (en) | Tail entity linking method, device, server and storage medium | |
CN111274822A (en) | Semantic matching method, device, equipment and storage medium | |
CN112182167B (en) | Text matching method and device, terminal equipment and storage medium | |
CN113761219A (en) | Knowledge graph-based retrieval method and device, electronic equipment and storage medium | |
CN113297351A (en) | Text data labeling method and device, electronic equipment and storage medium | |
CN112613293A (en) | Abstract generation method and device, electronic equipment and storage medium | |
CN114580354B (en) | Information coding method, device, equipment and storage medium based on synonym | |
CN114741468A (en) | Text duplicate removal method, device, equipment and storage medium | |
CN110262906B (en) | Interface label recommendation method and device, storage medium and electronic equipment | |
CN110852066B (en) | Multi-language entity relation extraction method and system based on confrontation training mechanism | |
CN116402166A (en) | Training method and device of prediction model, electronic equipment and storage medium | |
CN116629423A (en) | User behavior prediction method, device, equipment and storage medium | |
CN113705692B (en) | Emotion classification method and device based on artificial intelligence, electronic equipment and medium | |
CN115203206A (en) | Data content searching method and device, computer equipment and readable storage medium | |
CN114547313A (en) | Resource type identification method and device | |
CN110442767B (en) | Method and device for determining content interaction platform label and readable storage medium | |
CN115526176A (en) | Text recognition method and device, electronic equipment and storage medium | |
CN111611981A (en) | Information identification method and device and information identification neural network training method and device | |
CN114792086A (en) | Information extraction method, device, equipment and medium supporting text cross coverage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |