CN114580354B

CN114580354B - Information coding method, device, equipment and storage medium based on synonym

Info

Publication number: CN114580354B
Application number: CN202210478341.4A
Authority: CN
Inventors: 袁正; 谭传奇; 黄松芳
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-10-28
Anticipated expiration: 2042-05-05
Also published as: CN114580354A

Abstract

The application provides a synonym-based information coding method, a synonym-based information coding device, synonym-based information coding equipment and a synonym-based information coding storage medium, wherein the method comprises the following steps: coding words in the medical record text to obtain a first semantic representation corresponding to the medical record text; and acquiring multiple descriptions corresponding to preset disease code identifiers, wherein the multiple descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifiers. And determining a second semantic representation corresponding to the disease coding identification according to the multiple descriptions, and determining a third semantic representation corresponding to the disease coding identification in the medical record text according to the multiple descriptions and the first semantic representation. And determining whether the medical record text is marked with the disease coding identification according to the similarity between the third semantic representation and the second semantic representation. In the automatic coding process of the medical record text, the synonym description of the disease name is fully utilized, so that the automatic and accurate coding processing of the medical record text can be realized.

Description

Information coding method, device, equipment and storage medium based on synonym

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for encoding information based on synonyms.

Background

When each medical institution manages the text of the medical records, a coding person needs to map the text codes of the medical records to standard coding identifiers such as International Classification of Diseases (ICD), for example, ICD9 or ICD 10. The encoding process is error prone and labor intensive.

Disclosure of Invention

The embodiment of the invention provides a synonym-based information encoding method, a synonym-based information encoding device, synonym-based information encoding equipment and a synonym-based storage medium, which are used for improving the accuracy of an information encoding result.

In a first aspect, an embodiment of the present invention provides a method for encoding information based on synonyms, where the method includes:

encoding words in a medical record text to obtain a first semantic representation corresponding to the medical record text;

acquiring multiple descriptions corresponding to a preset disease code identifier, wherein the multiple descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifier;

determining a second semantic representation corresponding to the disease coding identification according to the plurality of descriptions;

determining a third semantic representation of the medical record text corresponding to the disease coding identification according to the plurality of descriptions and the first semantic representation;

and determining whether the medical record text is marked with the disease coding identifier according to the similarity between the third semantic representation and the second semantic representation.

In a second aspect, an embodiment of the present invention provides an apparatus for encoding information based on synonyms, where the apparatus includes:

the medical record encoding module is used for encoding words in a medical record text to obtain a first semantic representation corresponding to the medical record text;

the system comprises a description acquisition module, a semantic analysis module and a semantic analysis module, wherein the description acquisition module is used for acquiring a plurality of descriptions corresponding to preset disease code identifiers, and the plurality of descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifiers;

the semantic processing module is used for determining a second semantic representation corresponding to the disease coding identification according to the multiple descriptions; determining a third semantic representation of the medical record text corresponding to the disease-encoding identification based on the plurality of descriptions and the first semantic representation; and determining whether the medical record text is marked with the disease coding identification according to the similarity of the third semantic representation and the second semantic representation.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to implement at least the synonym-based information encoding method of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to implement at least the synonym-based information encoding method of the first aspect.

In a fifth aspect, an embodiment of the present invention provides a method for encoding information based on synonyms, where the method includes:

coding words in a target text to obtain a first semantic representation corresponding to the target text;

acquiring multiple category descriptions corresponding to preset category identifications, wherein the multiple category descriptions comprise standard descriptions and synonym descriptions corresponding to the category identifications;

determining a second semantic representation corresponding to the category identification according to the plurality of category descriptions;

determining, from the plurality of category descriptions and the first semantic representation, a third semantic representation of the target text corresponding to the category identification;

and determining whether the target text is marked with the category identification according to the similarity of the third semantic representation and the second semantic representation.

The embodiment of the invention can realize automatic coding of medical record texts) according to the included diseases. Specifically, for each word included in the medical record text, semantic encoding processing may be performed first to obtain a first semantic representation corresponding to the medical record text. Aiming at known disease coding identifiers (such as coding identifiers contained in ICD 9), on one hand, a standard description corresponding to each disease coding identifier, namely a standard disease name, is obtained, on the other hand, a synonym description corresponding to the standard description is obtained, so that multiple descriptions formed by the standard description and the synonym description corresponding to the same disease coding identifier are obtained, then, semantic coding is carried out on each description corresponding to the same disease coding identifier, and a second semantic representation corresponding to the disease coding identifier is obtained by combining the semantic coding result of each description. Then, according to the multiple descriptions corresponding to any disease code identification and the first semantic representation, a third semantic representation of the medical record text corresponding to the disease code identification is determined, namely the medical record text is based on the semantic representation of the disease code identification label. And determining whether the medical record text should be marked with the disease coding identification according to the similarity between the third semantic representation and the second semantic representation.

In the automatic coding process of the medical record text, the synonym description of the disease name is fully utilized, so that the automatic and accurate coding processing of the medical record text can be realized.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flowchart of a method for encoding information based on synonyms according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a medical record encoding process according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for encoding information based on synonyms according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of a method for encoding information based on synonyms according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an application of a synonym-based information encoding method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a synonym-based information encoding device according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device corresponding to the synonym-based information encoding device provided in the embodiment shown in FIG. 6.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The features of the embodiments and examples described below may be combined with each other without conflict between the embodiments. In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

The synonym-based information encoding method provided by the embodiment of the invention can be executed by an electronic device, wherein the electronic device can be a server or a user terminal, and the server can be a physical server or a virtual server (virtual machine) of a cloud.

Fig. 1 is a flowchart of a method for encoding information based on synonyms according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

101. and coding the words in the medical record text to obtain a first semantic representation corresponding to the medical record text.

102. And acquiring multiple descriptions corresponding to preset disease code identifiers, wherein the multiple descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifiers.

103. And determining a second semantic representation corresponding to the disease coding identification according to the plurality of descriptions.

104. And determining a third semantic representation of the medical record text corresponding to the disease coding identification according to the plurality of descriptions and the first semantic representation.

105. And determining whether the medical record text is marked with the disease coding identification according to the similarity between the third semantic representation and the second semantic representation.

The scheme provided by the embodiment of the invention can be applied to an application scene of disease coding of medical history texts. The medical record text is subjected to disease coding, that is, according to the description contents such as the disease name and the like contained in the medical record text, a general disease coding identifier which should be marked on the medical record text is determined, for example, "paratyphoid B" is contained in the medical record text, and the corresponding disease coding identifier is a10.2. Based on the automatic determination of the disease code identification of the medical record text, the medical record text can be classified, filed and inquired, and doctors can know the past medical history of the patients conveniently.

In practical application, the medical record text can be an outpatient medical record or an inpatient medical record. The medical record text can be obtained by scanning a handwritten medical record, or can be automatically generated by directly filling a medical record text form on a terminal such as a computer and the like. Because the scheme provided by the embodiment of the invention mainly processes the text content in the medical record text, the medical record text is also collectively referred to as the medical record text in the embodiment of the invention.

In order to realize disease coding of medical record texts, firstly, a medical record text needs to be coded to obtain a semantic representation corresponding to the medical record text, which is called as a first semantic representation.

Specifically, the medical record text may depict information about disease conditions, disease names, and the like, and the description contents are subjected to word segmentation to obtain a plurality of words (or referred to as words), and each word may be subjected to word vector encoding (such as word2 vec) to be mapped into a mathematical vector form that can be processed by a computer. Then, a certain neural network model can be adopted, word vectors corresponding to the obtained multiple words are input into the neural network model, so that hidden states output by the neural network model aiming at the multiple words are obtained and serve as semantic vectors corresponding to the corresponding words, and finally the semantic vectors corresponding to the multiple words form a first semantic representation corresponding to the medical record text.

In practical applications, the word segmentation processing may also be splitting one character by one character, that is, a word. The Neural Network model may adopt a Bi-directional Long-Short Term Memory (Bi-LSTM) model, an LSTM model, a Recurrent Neural Network (RNN) model, or the like.

For ease of understanding, for example, for a piece of medical history text, it is assumed that its input consists of a number of words (or called words) represented as a set:

wherein, in the process,

represents the total number of words,

one of which is indicated. Further, it is assumed that a corresponding word vector set obtained by performing word vector encoding on each word is as follows:

wherein, in the step (A),

meaning word

The corresponding word vector.

Then, semantic coding is performed on each word vector in the word vector set through, for example, a Bi-LSTM model, so as to obtain the following coding result:

. Wherein the content of the first and second substances,

it is shown that the semantic code computation,

representing word vectors

Corresponding semantic vectors, i.e. word vectors

Hidden state vectors output by the model after input to the model.

To represent

A matrix of semantic vectors, i.e. the first semantic representation.

Since the medical record text is subjected to the disease coding processing, the disease coding identifier corresponding to the current medical record text, that is, the disease coding identifier that should be included in the current medical record text, is actually determined from a plurality of known disease coding identifiers. Therefore, by querying the general disease code identification database, each disease code identification and its corresponding standard disease description content, which is usually a standard disease name, can be obtained. And then, semantic coding processing is carried out on the description content corresponding to each disease coding identification.

In the embodiment of the invention, in order to improve the accuracy of the disease coding result of the medical record text, for any disease coding identifier, not only the corresponding standard description in the database but also the synonym description are considered. For example, assuming that the standard description corresponding to a disease code identifier in the above database is "typhoid", the synonym description corresponding to the disease code identifier can be determined by querying a known medical knowledge map, etc., such as "cold", "wind chill", etc. The creation of the knowledge-graph is not the focus of the embodiments of the present invention and is not described in detail.

That is to say, in the embodiment of the present invention, when the medical history text is automatically encoded, because the same disease may appear in terms of nouns with greatly different forms, the synonym information of the disease name can be fully utilized to complete automatic and accurate medical history text encoding.

Since it is not known which diseases are included in the current medical record text when the medical record text is coded, it is necessary to perform a determination process of corresponding semantic representation for each known disease code identifier in the database, and finally determine the disease code identifier included in the medical record text based on the semantic representation corresponding to each disease code identifier.

Since the processing procedure for each disease code identification is the same, for convenience of description, only any one of the disease codes is used for identification

The description is given for the sake of example.

It is assumed that the disease code identification is known from the database

The corresponding standard is described as

The synonym descriptions of the inquired synonyms are respectively as follows:

. Thus, by this

The description constitutes the disease code identification

A corresponding set of descriptions.

The preset value can be set according to the requirements, and it should be noted that if a certain disease code identifier is not found, the corresponding disease code identifier cannot be found

The description may then be completed by copying a plurality of its standard descriptions.

For each description therein

Suppose it is made of

Individual words (or words) are formed, expressed as:

。

then, the identification is carried out according to the disease code

Corresponding to

Species description, determining disease coding identity

The corresponding second semantic representation may optionally be implemented as:

are respectively paired with

The description is encoded to obtain

Description is corresponding

A fourth semantic representation;

according to

A fourth semantic representation for determining disease code identification

A corresponding second semantic representation.

Wherein, optionally, are respectively paired

The description is encoded to obtain

Description is corresponding to

A fourth semantic representation, which may be implemented as: aiming at any description, coding each word in any description to obtain semantic representation corresponding to each word; and performing maximum pooling on the semantic representations corresponding to the words to obtain a fourth semantic representation corresponding to any description.

Wherein, optionally, according to

A fourth semantic representation for determining disease code identification

The corresponding second semantic representation may be implemented as: to pair

Performing maximum pooling on the fourth semantic representation to obtain disease code identification

A corresponding second semantic representation.

The above-described process for each description may be expressed as:

wherein, in order

Any of the descriptions

By way of example, the above

Description of the representation

Is contained in

The words are respectively corresponding to word vectors, which can be defined by

The word vectors are sequentially input into the neural network model for semantic coding of the medical record text, such as the Bi-LSTM model, to be coded, and semantic coding results corresponding to the word vectors are obtained, that is, the semantic coding results corresponding to the word vectors are obtained

Corresponding to each word

And (4) semantic representation.

Then, to this

Maximal pooling of semantic representations (i.e., as described above)

) Process, can be described

Corresponding fourth semantic representation

。

Then, the identification is coded for the disease

Corresponding to

Performing maximal pooling on the fourth semantic representations corresponding to the species descriptions to obtain disease coding identification

Corresponding second semantic representation

The process can be expressed as:

by the above-mentioned coding identification for diseases

The semantic coding processing of the corresponding multiple descriptions can be known, and finally the obtained disease coding identification

The corresponding second semantic representation includes semantic information of each description, not only semantic information of standard description.

Then, the identification is carried out according to the disease code

Corresponding multiple descriptions and a first semantic representation corresponding to the medical record text are determined, and the medical record text is determined to correspond to the disease coding identification

The third semantic representation of (2). Since the medical record text is semantically coded and the relation between the medical record text and each disease coding mark is considered, the medical record text corresponds to the disease coding mark

The third semantic representation of "can be understood to mean, in effect, that the determination of the medical history text is based on tags

(disease-coding identifiers are considered as a sort label), and in the semantic representation determination process, the association relationship between the medical record text and each disease-coding identifier is established. The association may be implemented by an Attention (Attention) mechanism.

In general terms, identification is based on disease codes

Determining that the medical record text corresponds to the disease coding identifier

The third semantic representation of (2) may be implemented as:

determining attention coefficient vectors of words in the medical record text corresponding to each fourth semantic representation according to a plurality of fourth semantic representations corresponding to a plurality of descriptions and the first semantic representation; determining that the medical record text corresponds to the disease coding identification according to the attention coefficient vector and the first semantic representation

The third semantic representation of (3).

Wherein the identification is carried out by disease codes

Corresponding to

Any of the description

For example, as can be seen from the above example, the fourth semantic representation corresponding to the description is

The first languageMeaning is expressed as

Determining that a word in the medical record text corresponds to a fourth semantic representation

The attention coefficient vector of (2) is based on the principle of attention mechanism, and actually

Computing, as a Query (Query), a first semantic representation of a text of a medical record

The calculation of the attention coefficient is actually the calculation of the attention coefficient contained in the medical record text

The attention coefficient values corresponding to the words, i.e. the composition

Of (2) above

A semantic vector

The respective corresponding attention coefficient. From this

The attention coefficient value constitutes a fourth semantic representation of words in the medical record text corresponding to the fourth semantic representation

The attention coefficient vector of (3).

The words in the case history text correspond to a fourth semantic representation

The physical meaning of the attention coefficient vector of (1) can be understood as: each word contained in the medical record text is used for judging that the medical record text contains description

A respective corresponding degree of contribution, which is reflected by the attention factor.

Corresponding words in the obtained medical record text to a fourth semantic representation

By using the attention coefficient vector to represent the first semantic meaning

Is contained in

A semantic vector

Weighted summation processing is carried out to obtain the medical record text corresponding to the disease code identification

The third semantic representation of (3).

In fact, the words in the case history text correspond to the fourth semantic representation

The attention coefficient vector of (1) is a dimension of

Of a vector of (A) A

Each vector element corresponds to the above

Multiplying semantic vectors one by one, and then carrying out vector addition and calculation to finally obtain a dimension of

The vector of (b) is the third semantic representation.

Finally, calculating the medical record text corresponding to the disease code identification

Third semantic representation and disease coding identification

The similarity of the corresponding second semantic representation is used for determining that the medical record text should be marked with the disease coding identification when the similarity meets the set condition

。

To facilitate understanding of the above-described automatic encoding process, it is schematically illustrated in conjunction with fig. 2.

As shown in fig. 2, in order to realize disease coding of medical record text, a coding system comprising a plurality of functional modules illustrated in the figure can be used, and the coding system can actually form a coding model comprising a semantic coding module, a maximum pooling processing module, an attention calculating module and a similarity output module illustrated in the figure.

Wherein the semantic coding module may be the Bi-LSTM model introduced above, and the max-pooling processing module is used to achieve the max-pooling described above: (

) The processing and similarity output module is actually the output layer of the coding model and is used for calculating loss in the training stageA function, except that the loss function is defined by the similarity of the third semantic representation to the second semantic representation.

As shown in FIG. 2, for the medical record text mentioned above, the word vectors corresponding to the words contained in the medical record text are input into the semantic coding module, and then the first semantic representation is output

. The word vector contained in each description corresponding to any disease code identification is input into the semantic coding module, the semantic vector of each word in one description output by the semantic coding is input into the maximum pooling processing module, and the fourth semantic representation corresponding to the description is obtained, as described above, the disease code identification

Corresponding to

The description describes the fourth semantic representation corresponding to each as:

. The fourth semantic representations are further processed by a maximum pooling processing module to obtain disease coding identification

The corresponding second semantic representation:

。

and for each fourth semantic representation, calculating an attention coefficient corresponding to each word in the medical record text by an attention calculation module in combination with the first semantic representation to obtain an attention coefficient vector corresponding to each fourth semantic representation:

. Then, based on each calculated attention coefficient vector, the first and second attention coefficient vectors are respectively matchedA semantic representation

The weighted summation is carried out on a plurality of semantic vectors contained in the semantic expression vector, and a plurality of weighted semantic expressions are obtained:

. Finally, maximum pooling processing is carried out on the weighted semantic representations to obtain medical record texts corresponding to the disease coding identifications

Third semantic representation of

。

Thereafter, a third semantic representation is computed

With a second semantic representation

The similarity of (c).

As shown in fig. 2, the calculation of the similarity may be defined as: calculating medical record text contains labels

(i.e., disease code identification)

) Log probability of (d):

. Wherein the content of the first and second substances,

it is shown that the Sigmoid function is,

the transpose is represented by,

representing a dual affine transformation matrix.

In the coding model training stage, when the medical record text is used as a training sample, the disease coding identification contained in the medical record text is labeled in advance and used as supervision information. The similarity defined by the logarithmic probability actually reflects the medical record text and any disease code identification

The similarity value of the medical record text and each disease code identification can be obtained by traversing each disease code identification contained in the disease code identification database, a similarity threshold can be set, and if the similarity value of the medical record text and a certain disease code identification is greater than the threshold, the medical record text is considered to contain the disease code identification. Therefore, the actually determined disease coding identification contained in the case history text is compared with the pre-marked supervision information, namely, the coding model parameters can be adjusted according to the loss function value, and when the model is trained to be convergent, the double affine transformation matrix suitable for various diseases can be obtained

. Based on the training of the matrix, the coding model can overcome the dependence on long-tail data, namely, the influence of sample imbalance is overcome, and the sample imbalance is mainly embodied as that the number of descriptions corresponding to some disease coding identifiers which can be collected is less.

For the above mentioned: after the multiple descriptions corresponding to the disease coding identification are respectively coded to obtain multiple fourth semantic representations corresponding to the multiple descriptions, according to the multiple fourth semantic representations and the first semantic representation corresponding to the medical record text, the attention coefficient vector of each fourth semantic representation corresponding to the word in the medical record text is determined. An alternative way of determining the attention coefficient vector is provided in the embodiments of the present invention, as shown in fig. 3.

Fig. 3 is a flowchart of a method for encoding information based on synonyms according to an embodiment of the present invention, as shown in fig. 3, the method may include the following steps:

301. and coding a plurality of words in the medical record text to obtain a first semantic representation corresponding to the medical record text, wherein the first semantic representation is formed by a plurality of semantic vectors corresponding to the words.

302. The method comprises the steps of obtaining multiple descriptions formed by standard descriptions and synonym descriptions corresponding to preset disease coding identifications, coding the multiple descriptions respectively to obtain multiple fourth semantic representations corresponding to the multiple descriptions, and determining a second semantic representation corresponding to the disease coding identification according to the multiple fourth semantic representations.

The execution process of the above steps can refer to the related description in the foregoing embodiments, which is not described herein again.

303. And segmenting the first semantic representation into a plurality of semantic blocks, wherein each semantic block comprises a plurality of sub-semantic vectors corresponding to the plurality of words, each sub-semantic vector is formed by partial dimensions in the corresponding semantic vector, and the number of the semantic blocks is equal to that of the plurality of descriptions.

Is accepted in the first semantic representation

And any disease code identification

Corresponding to

The description of the species:

，

here, the first semantic is expressed as

Cutting into

And semantic blocks with the same size. Wherein the medical record text includes

The semantic vectors corresponding to the words are as follows:

。

wherein, the segmentation mode does: assumptions form a first semantic representation

The above-mentioned

A semantic vector forms one

A matrix of rows and columns, where each semantic vector is assumed to be K-dimensional. Equally divide the K columns into

Groups, then each group will constitute a semantic block. For example, the number of bits of K =100,

every 10 columns are grouped, thus 10 semantic blocks are obtained, wherein each semantic block comprises

Partial dimensions in the rowlock meaning vector, called

A sub-semantic vector.

For convenience of description, will be

The division result of (a) is expressed as:

。

304. determining an attention coefficient vector in which a plurality of sub-semantic vectors in the target semantic block correspond to a target fourth semantic representation, wherein the target fourth semantic representation is the same as the target semantic block in sequence number, and the target fourth semantic representation is any one of the plurality of fourth semantic representations.

For any fourth semantic representation, accepting the example above

Computing target semantic blocks

Attention coefficient vector corresponding to the fourth semantic representation, i.e. with

Computing target semantic blocks as queries (Query)

In

The sub-semantic vectors each correspond to an attention coefficient. Wherein the target semantic block

Number of

With a fourth semantic representation

Is numbered

Are the same. To summarizeNamely: for the purpose of

A fourth semantic representation of each of the descriptions, an

And the semantic blocks are used for performing attention calculation on the fourth semantic representation and the semantic blocks in a one-to-one correspondence mode. By the aid of the calculation mode, the trained coding model can better focus on semantic information which is more important for the predicted disease coding identification during attention calculation, namely, a larger attention coefficient is distributed to the semantic information which is more important for the accurate predicted disease coding identification.

Expressed in a fourth semantic

For example, with target semantic blocks

The attention calculation result of (a) may be expressed as:

wherein, tanh is an arc tangent function, which can be replaced by a relu function, etc.,

to solve for the attention coefficient vector.

And

is a matrix of weight coefficients.

305. And respectively carrying out weighted summation on a plurality of semantic vectors contained in the first semantic representation by using a plurality of determined attention coefficient vectors corresponding to a plurality of fourth semantic representations to obtain a plurality of weighted semantic representations, and carrying out maximum pooling processing on the plurality of weighted semantic representations to obtain a third semantic representation of the medical record text corresponding to the disease coding identification.

The third semantic representation

The calculation process of (a) can be expressed as:

wherein the attention coefficient vectors corresponding to the plurality of fourth semantic representations are respectively:

. The plurality of weighted semantic representations are respectively:

。

306. and determining whether the medical record text is marked with the disease coding identification according to the similarity between the third semantic representation and the second semantic representation.

Optionally, the third semantic representation, the second semantic representation and the trained affine-double transformation matrix can be used

Determining the similarity of the third semantic representation and the second semantic representation:

. If the similarity is larger than the set threshold, the medical record text is considered to comprise the disease code identification

The corresponding disease is the coded mark of the disease on the medical record text mark

。

The above embodiment introduces a scheme for disease coding of medical history texts. Indeed, similar application requirements may exist in many other application scenarios. This application requirement can be summarized as: under the condition that a plurality of category identifications are preset and standard category (name) description corresponding to each category identification is set in advance, the category identification corresponding to the text is required to be determined for the currently input text. Since only the correspondence between the category identifier and the corresponding labeling category description is established in advance, the labeling of the category identifier associated with the text is performed only based on the correspondence, and the accuracy is limited because the category identifier is labeled on the text only in the case where the labeling category description corresponding to a certain category identifier is included in the text.

In order to improve the accuracy of the result of the category identification tag associated with the text, based on the scheme provided by the above embodiment, the embodiment of the present invention provides a general solution, as shown in fig. 4, the method includes the following steps:

401. and coding the words in the target text to obtain a first semantic representation corresponding to the target text.

402. And acquiring multiple types of descriptions corresponding to preset type identifications, wherein the multiple types of descriptions comprise standard descriptions and synonym descriptions corresponding to the type identifications.

403. And determining a second semantic representation corresponding to the category identification according to the plurality of category descriptions.

404. And determining a third semantic representation of the target text corresponding to the category identification according to the multiple category descriptions and the first semantic representation.

405. And determining whether the target text is marked with the category identification according to the similarity of the third semantic representation and the second semantic representation.

The target text can be, for example, the medical record text in the foregoing embodiment, and accordingly, the category identifier is a variety of disease code identifiers. The target text may also be a description text of the product, and the category identifier may be a category name of the product, for example, a category identifier of a potato may correspond to multiple descriptions: potatoes, yam eggs, and the like.

The implementation of this embodiment may refer to the related descriptions in the foregoing other embodiments, and will not be described herein.

As described above, the information encoding method based on synonyms provided by the present invention can be executed in the cloud, and a plurality of computing nodes may be deployed in the cloud, and each computing node has processing resources such as computation and storage. In the cloud, a plurality of computing nodes may be organized to provide a service, and of course, one computing node may also provide one or more services. The way that the cloud provides the service may be to provide a service interface to the outside, and the user calls the service interface to use the corresponding service. The service Interface includes Software Development Kit (SDK), application Programming Interface (API), and other forms.

Aiming at the scheme provided by the embodiment of the invention, the cloud end can provide a service interface of the information coding service, and a user calls the service interface through user equipment to trigger a calling request to the cloud end, wherein the request comprises a medical record text. The cloud determines the compute nodes that respond to the request, and performs the following steps using processing resources in the compute nodes:

acquiring multiple descriptions corresponding to preset disease code identifiers, wherein the multiple descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifiers;

and determining whether the medical record text is marked with the disease coding identification according to the similarity of the third semantic representation and the second semantic representation.

In addition, the model training task described in the foregoing embodiment may also be completed by the computing node in the cloud.

For ease of understanding, the description is exemplified in conjunction with fig. 5. The user can call an information coding service interface (API interface in the figure) through the user equipment E1 illustrated in fig. 5, and upload a service request containing a medical record text through the interface. In the cloud, as shown in the figure, besides a plurality of computing nodes, a management node E2 running a management and control service is also deployed, after receiving a service request sent by the user equipment E1, the management node E2 determines a computing node E3 responding to the service request, after receiving a medical record text, the computing node E3 executes the steps, and finally outputs each disease code identification associated with the medical record text to be sent to the user equipment E1, and the user equipment E1 displays a final detection result. For the detailed implementation process, reference is made to the descriptions in the foregoing embodiments, and details are not described herein.

The synonym-based information encoding device according to one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means can each be constructed using commercially available hardware components and configured through the steps taught in this scheme.

Fig. 6 is a schematic structural diagram of an information encoding device based on synonyms according to an embodiment of the present invention, as shown in fig. 6, the device includes: a medical record coding module 11, a description obtaining module 12 and a semantic processing module 13.

And the medical record encoding module 11 is configured to encode words in a medical record text to obtain a first semantic representation corresponding to the medical record text.

The description obtaining module 12 is configured to obtain multiple descriptions corresponding to a preset disease code identifier, where the multiple descriptions include a standard description and a synonym description corresponding to the disease code identifier.

The semantic processing module 13 is configured to determine, according to the multiple descriptions, a second semantic representation corresponding to the disease coding identifier; determining a third semantic representation of the medical record text corresponding to the disease coding identification according to the plurality of descriptions and the first semantic representation; and determining whether the medical record text is marked with the disease coding identifier according to the similarity between the third semantic representation and the second semantic representation.

Optionally, in the process of determining the second semantic representation corresponding to the disease coding identifier, the semantic processing module 13 is specifically configured to: respectively coding the multiple descriptions to obtain multiple fourth semantic representations corresponding to the multiple descriptions; determining the second semantic representation corresponding to the disease coding identification according to the plurality of fourth semantic representations.

Optionally, the semantic processing module 13 is specifically configured to: aiming at any description, coding each word in any description to obtain semantic representation corresponding to each word; performing maximum pooling on the semantic representations corresponding to the words to obtain a fourth semantic representation corresponding to any description; performing maximal pooling on the fourth semantic representations to obtain the second semantic representation corresponding to the disease coding identifier.

Optionally, in the process of determining, according to the multiple descriptions and the first semantic representation, that the medical record text corresponds to a third semantic representation of the disease coding identifier, the semantic processing module 13 is specifically configured to: determining an attention coefficient vector of a word in the medical record text corresponding to each fourth semantic representation according to the plurality of fourth semantic representations and the first semantic representation; determining a third semantic representation of the medical record text corresponding to the disease-encoding marker based on the attention coefficient vector and the first semantic representation.

Optionally, the medical record text includes a plurality of words, and the first semantic representation is formed by a plurality of semantic vectors corresponding to the plurality of words. Based on this, in the process of determining the attention coefficient vector corresponding to each fourth semantic representation of the words in the medical record text, the semantic processing module 13 is specifically configured to: segmenting the first semantic representation into a plurality of semantic blocks, wherein each semantic block comprises a plurality of sub-semantic vectors corresponding to the plurality of words, each sub-semantic vector is composed of partial dimensions in the corresponding semantic vector, and the number of the semantic blocks is equal to that of the plurality of descriptions; determining an attention coefficient vector for a plurality of sub-semantic vectors in a target semantic block corresponding to a target fourth semantic representation, wherein the target fourth semantic representation has the same sequence number as the target semantic block, and the target fourth semantic representation is any one of the plurality of fourth semantic representations.

Optionally, in the process of determining that the medical record text corresponds to the third semantic representation of the disease coding identifier, the semantic processing module 13 is specifically configured to: respectively carrying out weighted summation on a plurality of semantic vectors contained in the first semantic representation by using a plurality of determined attention coefficient vectors corresponding to the fourth semantic representations to obtain a plurality of weighted semantic representations; performing maximal pooling on the weighted semantic representations to obtain a third semantic representation of the medical record text corresponding to the disease coding identification.

Optionally, the semantic processing module 13 is specifically configured to: and determining the similarity of the third semantic representation and the second semantic representation according to the third semantic representation, the second semantic representation and a trained dual affine transformation matrix.

The apparatus shown in fig. 6 can perform the steps provided in the foregoing embodiments, and the detailed performing process and technical effects refer to the description in the foregoing embodiments, which are not described herein again.

In one possible design, the structure of the synonym-based information encoding apparatus shown in fig. 6 may be implemented as an electronic device. As shown in fig. 7, the electronic device may include: a processor 21, a memory 22, and a communication interface 23. Wherein the memory 22 has stored thereon executable code which, when executed by the processor 21, makes the processor 21 at least to implement the synonym-based information encoding method as provided in the previous embodiments.

In addition, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to implement at least the synonym-based information encoding method as provided in the foregoing embodiments.

The above described embodiments of the apparatus are merely illustrative, wherein the network elements illustrated as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A synonym-based information coding method is characterized by comprising the following steps:

coding words in a medical record text to obtain a first semantic representation corresponding to the medical record text;

determining whether the medical record text is marked with the disease coding identifier according to the similarity between the third semantic representation and the second semantic representation;

wherein the determining of the third semantic representation comprises:

respectively coding the multiple descriptions to obtain multiple fourth semantic representations corresponding to the multiple descriptions;

determining an attention coefficient vector of a word in the medical record text corresponding to each fourth semantic representation according to the plurality of fourth semantic representations and the first semantic representation;

respectively carrying out weighted summation on a plurality of semantic vectors contained in the first semantic representation by using a plurality of determined attention coefficient vectors corresponding to the fourth semantic representations to obtain a plurality of weighted semantic representations; performing maximal pooling on the weighted semantic representations to obtain a third semantic representation of the medical record text corresponding to the disease coding identification.

2. The method of claim 1, wherein determining a second semantic representation corresponding to the disease coding identifier from the plurality of descriptions comprises:

determining the second semantic representation corresponding to the disease coding identifier according to the plurality of fourth semantic representations.

3. The method according to claim 2, wherein said separately encoding said plurality of descriptions to obtain a plurality of fourth semantic representations corresponding to said plurality of descriptions comprises:

aiming at any description, coding each word in any description to obtain semantic representation corresponding to each word;

and performing maximum pooling on the semantic representation corresponding to each word to obtain a fourth semantic representation corresponding to any description.

4. The method according to claim 2, wherein the determining the second semantic representation to which the disease coding identifier corresponds according to the plurality of fourth semantic representations comprises:

performing maximal pooling on the fourth semantic representations to obtain the second semantic representation corresponding to the disease coding identifier.

5. The method of claim 1, wherein the medical record text includes a plurality of words, and the first semantic representation is formed by a plurality of semantic vectors corresponding to the plurality of words;

the determining, from the plurality of fourth semantic representations and the first semantic representation, an attention coefficient vector for a word in the medical record text corresponding to each fourth semantic representation includes:

segmenting the first semantic representation into a plurality of semantic blocks, wherein each semantic block comprises a plurality of sub-semantic vectors corresponding to the plurality of words, each sub-semantic vector is formed by partial dimensions in the corresponding semantic vector, and the number of the semantic blocks is equal to that of the plurality of descriptions;

determining attention coefficient vectors of a plurality of sub-semantic vectors in a target semantic block corresponding to a target fourth semantic representation, wherein the target fourth semantic representation has the same sequence number as the target semantic block, and the target fourth semantic representation is any one of the plurality of fourth semantic representations.

6. The method of claim 1, further comprising:

and determining the similarity of the third semantic representation and the second semantic representation according to the third semantic representation, the second semantic representation and a trained double affine transformation matrix.

7. An information encoding device based on synonyms, comprising:

the system comprises a description acquisition module, a database acquisition module and a database processing module, wherein the description acquisition module is used for acquiring a plurality of descriptions corresponding to a preset disease code identifier, and the plurality of descriptions comprise standard descriptions and synonym descriptions corresponding to the disease code identifier;

the semantic processing module is used for determining a second semantic representation corresponding to the disease coding identification according to the plurality of descriptions; determining a third semantic representation of the medical record text corresponding to the disease coding identification according to the plurality of descriptions and the first semantic representation; determining whether the medical record text is marked with the disease coding identifier according to the similarity between the third semantic representation and the second semantic representation;

wherein, in the process of determining the third semantic representation, the semantic processing module is specifically configured to: respectively coding the multiple descriptions to obtain multiple fourth semantic representations corresponding to the multiple descriptions; determining an attention coefficient vector of a word in the medical record text corresponding to each fourth semantic representation according to the plurality of fourth semantic representations and the first semantic representation; respectively carrying out weighted summation on a plurality of semantic vectors contained in the first semantic representation by using a plurality of determined attention coefficient vectors corresponding to the fourth semantic representations to obtain a plurality of weighted semantic representations; performing maximal pooling on the weighted semantic representations to obtain a third semantic representation of the medical record text corresponding to the disease coding identification.

8. An electronic device, comprising: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the synonym-based information encoding method of any one of claims 1-6.

9. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the synonym-based information encoding method of any one of claims 1-6.

10. A synonym-based information coding method is characterized by comprising the following steps:

determining a third semantic representation of the target text corresponding to the category identification according to the multiple category descriptions and the first semantic representation;

determining whether the category identification is marked in the target text or not according to the similarity of the third semantic representation and the second semantic representation;

wherein the determining of the third semantic representation comprises:

respectively encoding the multiple category descriptions to obtain multiple fourth semantic representations corresponding to the multiple category descriptions;

determining, from the plurality of fourth semantic representations and the first semantic representation, an attention coefficient vector for a word in the target text corresponding to each fourth semantic representation;

respectively carrying out weighted summation on a plurality of semantic vectors contained in the first semantic representation by using a plurality of determined attention coefficient vectors corresponding to the fourth semantic representations to obtain a plurality of weighted semantic representations; and performing maximum pooling on the plurality of weighted semantic representations to obtain a third semantic representation of the target text corresponding to the category identifier.