CN113051373B

CN113051373B - Text analysis method, text analysis device, electronic equipment and storage medium

Info

Publication number: CN113051373B
Application number: CN202110420438.5A
Authority: CN
Inventors: 甘露; 胡加学; 赵景鹤; 贺志阳
Original assignee: Iflytek Medical Technology Co ltd
Current assignee: Iflytek Medical Technology Co ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2024-02-13
Anticipated expiration: 2041-04-19
Also published as: CN113051373A

Abstract

The invention provides a text analysis method, a text analysis device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a disease description text to be analyzed; based on correlations between the disease description text and medical knowledge of a plurality of independent sources, respectively, a disease type corresponding to the disease description text is determined. The invention is based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources, so that the medical knowledge of each source can be fused to determine the disease type corresponding to the disease description text.

Description

Text analysis method, text analysis device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a text analysis method, a text analysis device, an electronic device, and a storage medium.

Background

With the popularization of the internet, a patient can reserve on line a doctor of a corresponding department for diagnosis according to own disease types, and if the corresponding department is to be accurately determined, the disease types of the patient need to be accurately estimated.

At present, the disease type of a patient is determined by acquiring a disease description text of the patient and analyzing the text based on an end-to-end model, but the disease description text of the patient is focused on the spoken language expression, so that the end-to-end model obtained based on professional medical sample training cannot accurately determine the corresponding disease type from the disease description text of the spoken language expression.

Disclosure of Invention

The invention provides a text analysis method, a text analysis device, electronic equipment and a storage medium, which are used for solving the defect that the disease type corresponding to a disease description text cannot be accurately determined in the prior art.

The invention provides a text analysis method, which comprises the following steps:

determining a disease description text to be analyzed;

and determining the disease type corresponding to the disease description text based on the correlation between the disease description text and medical knowledge of a plurality of independent sources respectively.

According to the text analysis method provided by the invention, the disease type corresponding to the disease description text is determined based on the correlation between the disease description text and medical knowledge of a plurality of independent sources, and the method comprises the following steps:

Determining a disease representation of the disease description text based on correlations between the disease description text and medical knowledge of a plurality of independent sources, respectively, and textual representations of the disease description text under each independent source;

and determining the disease type corresponding to the disease description text based on the disease representation.

According to the text analysis method provided by the invention, the determining of the disease representation of the disease description text based on the correlation between the disease description text and medical knowledge of a plurality of independent sources and the text representation of the disease description text under each independent source further comprises:

determining a textual representation of the disease description text under each independent source based on independent text encoding rules for each independent source, the independent text encoding rules being determined based on medical knowledge under the corresponding independent source;

and performing self-attention calculation on the text representations under each independent source to obtain the correlation between the disease description text and medical knowledge of a plurality of independent sources respectively.

According to the text analysis method provided by the invention, the determining of the disease representation of the disease description text based on the correlation between the disease description text and medical knowledge of a plurality of independent sources and the text representation of the disease description text under each independent source comprises the following steps:

Taking the correlation between the disease description text and the medical knowledge of each independent source as a weight, and carrying out weighted summation on text representations of the disease description text under each independent source to obtain a first disease representation;

determining a second disease representation of the disease description text based on a generic text encoding rule, the generic text encoding rule determined based on medical knowledge that mixes the plurality of independent sources;

a disease representation of the disease description text is determined based on the first disease representation and the second disease representation.

According to the text analysis method provided by the invention, the disease type corresponding to the disease description text is determined based on the disease representation, and the method comprises the following steps:

determining a correlation between the disease description text and each candidate disease type based on candidate disease representations of each candidate disease type and the disease representations;

and determining the disease type corresponding to the disease description text based on the correlation between the disease description text and each candidate disease type and each candidate disease representation.

According to the text analysis method provided by the invention, the determining of the disease description text to be analyzed comprises the following steps:

Determining an initial disease description text to be analyzed;

performing sequence labeling on the initial disease description text, and determining objective description information in the initial disease description text;

extracting the text of the initial disease description text, and determining subjective description information in the initial disease description text;

and determining the disease description text to be analyzed based on the objective description information and the subjective description information.

According to the text analysis method provided by the invention, the medical knowledge of the independent sources comprises at least two of human-computer interaction inquiry knowledge, outpatient medical record data, inpatient medical record data, disease knowledge base and Internet disease encyclopedia knowledge.

The invention also provides a text analysis device, comprising:

a text determining unit for determining a disease description text to be analyzed;

and the text analysis unit is used for determining the disease type corresponding to the disease description text based on the correlation between the disease description text and medical knowledge of a plurality of independent sources.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any one of the text analysis methods described above when executing the computer program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the text analysis method as described in any of the above.

According to the text analysis method, the device, the electronic equipment and the storage medium, based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources, the disease types corresponding to the disease description text can be determined by fusing the medical knowledge of each source.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a text analysis method provided by the invention;

FIG. 2 is a flow chart of a disease type acquisition method provided by the present invention;

FIG. 3 is a flow chart of a disease representation determination method provided by the present invention;

FIG. 4 is a flow chart of another disease type acquisition method provided by the present invention;

FIG. 5 is a flow chart of a method for obtaining a disease description text provided by the invention;

FIG. 6 is a flow chart of a text analysis model training method provided by the invention;

FIG. 7 is a schematic diagram of a text analysis device according to the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The traditional method is to obtain a disease description text of a patient and analyze the text based on an end-to-end model to determine the disease type of the patient, but because the disease description text of the patient is focused on the spoken language expression, the end-to-end model obtained based on professional medical sample training cannot accurately determine the corresponding disease type from the disease description text of the spoken language expression. In addition, the traditional method takes the disease description text of the patient as input, adopts a task type dialogue system (pipeline) to identify the disease type of the patient, but each module in the pipeline is connected in series, so that the error rate of each module can be transmitted to the last output module, and the identification accuracy is affected.

In this regard, the present invention provides a text analysis method. Fig. 1 is a schematic flow chart of a text analysis method provided by the invention, as shown in fig. 1, the method comprises the following steps:

step 110, determining a disease description text to be analyzed;

step 120, determining a disease type corresponding to the disease description text based on correlation between the disease description text and medical knowledge of a plurality of independent sources, respectively.

In particular, the disease description text is used to describe the status of a patient's disease, such as the patient's symptoms, signs, causes, past history, and the like. The disease description text may be an electronic text, a text obtained by performing optical character recognition OCR on a paper text, or a text obtained by sorting according to a recording of a patient, which is not particularly limited in the embodiment of the present invention.

In general, the disease description text is mostly text obtained by the oral description of a patient according to the condition of the patient, namely, the disease description text focuses on the oral expression, a large number of divergent oral corpus descriptions exist, and the professional terms are fewer. For example, the patient's lungs are uncomfortable, but because the patient does not have specialized medical knowledge, the patient will describe the condition as "itching of the lungs", but medically the lungs are free of the condition of "itching", and the specialized description should be "lung discomfort".

Because the disease description text is focused on the spoken language expression, the professional terms are fewer, if the end-to-end model obtained based on professional medical sample training in the traditional method is adopted to identify the disease description text, the model cannot identify the corresponding medical knowledge from the text of the spoken language expression, and further the corresponding disease type cannot be accurately determined according to the disease description text. If machine learning is performed based on the spoken disease description text alone, a large number of spoken samples need to be additionally collected for training, not only a large number of labels need to be manually performed, but also various corpus exists in the spoken samples, such as a large number of different expression modes possibly exist for the same semantic meaning, and the expression modes are random, that is, it is difficult to collect a sufficient number of spoken samples for training in practical situations.

Therefore, the embodiment of the invention fuses a large amount of medical knowledge of a plurality of independent sources after being regulated in the existing medical field to determine the disease type corresponding to the disease description text. The medical knowledge of a plurality of independent sources not only comprises human-computer interaction corpus with more spoken language expression, but also comprises a database with professional medical knowledge, wherein the corpus of the database with the professional medical knowledge is medical corpus which is processed by professional doctors, the corpus is clearer, and the database has a large number of professional terms. The medical knowledge of each source is mutually independent, and the medical knowledge of each source has corresponding disease types. Wherein, the medical knowledge of the plurality of independent sources may include a disease knowledge base, hospitalization medical record data, an internet disease encyclopedia knowledge base, and the like, which is not particularly limited in the embodiment of the present invention.

In addition, the correlation between the disease description text and the medical knowledge of a plurality of independent sources is used for representing the correlation between the information described in the disease description text and the medical information corresponding to the medical knowledge of each source, and the higher the correlation is, the more similar the corpus in the disease description text and the corpus of the medical knowledge of the corresponding source are indicated, so that the higher the reliability of judging the disease type represented by the disease description text based on the disease knowledge of the source is, the higher the accuracy of the corresponding obtained disease type is.

Further, the process of determining the disease type corresponding to the disease description text in step 120 based on the correlation between the disease description text and the medical knowledge of the plurality of independent sources, respectively, may be implemented by a text analysis model. The text analysis model may also be trained in advance before executing step 120, specifically by training as follows: firstly, collecting a plurality of medical knowledge samples from independent sources, and mixing the medical knowledge samples from the independent sources to obtain a mixed sample from the independent sources. Then, training the initial model based on the medical knowledge samples from the independent sources, the mixed samples from the independent sources and the disease types corresponding to the mixed samples, so as to obtain a text analysis model.

According to the text analysis method provided by the embodiment of the invention, based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources, the disease type corresponding to the disease description text can be determined by fusing the medical knowledge of each source, compared with the traditional method that the disease type corresponding to the disease description text with more spoken language expression cannot be accurately identified based on an end-to-end model, and the disease type of a patient is identified by adopting pipeline in the traditional method, the disease type corresponding to the disease description text can be accurately determined by combining the correlation between the disease description text and the medical knowledge of each source, and the identification rate of the disease type is improved.

It should be noted that, the method provided by the embodiment of the invention takes the disease description text as an object, and obtains the disease type corresponding to the disease description text instead of taking the patient as an object. In addition, the method provided by the embodiment of the invention aims to analyze the disease type corresponding to the disease description text, is used for rapidly determining the corresponding department according to the disease type and helping the doctor of the corresponding department to visit the patient on line in an appointment manner, and is not aimed at directly obtaining the disease diagnosis result or the health condition. Therefore, the method provided by the embodiment of the invention is not a disease diagnosis method.

Based on the above embodiment, as shown in fig. 2, step 120 includes:

step 121, determining a disease representation of the disease description text based on correlations between the disease description text and medical knowledge of the plurality of independent sources, respectively, and text representations of the disease description text under the respective independent sources;

step 122, determining the disease type corresponding to the disease description text based on the disease representation.

Specifically, the correlation between the disease description text and the medical knowledge of a plurality of independent sources respectively characterizes the correlation between the information described in the disease description text and the medical information corresponding to the medical knowledge of each source, and meanwhile, the text representation of the disease description text under each independent source retains the independent information of the medical knowledge of each source, so that the disease representation of the disease description text determined based on the two information is fused with both the independent information of the medical knowledge of each source and the correlation information between the disease description text and the medical knowledge of each source. Wherein, the text representation of the disease description text under each independent source can be determined based on the medical knowledge under each independent source, and is used for describing the disease description text in the expression style of the text under each independent source, which is not particularly limited by the embodiment of the present invention. For example, correlations between the disease description text and medical knowledge from a plurality of independent sources, respectively, may be used as weights, text representations of the disease description text under the respective independent sources may be weighted and summed, and the result of the summation may be used as a disease representation of the disease description text.

After determining the disease representation of the disease description text, the disease representation corresponding to each candidate disease may be compared with the disease representation corresponding to each candidate disease, and if the similarity between the disease representation corresponding to any candidate disease and the disease representation of the disease description text exceeds a threshold, the disease type corresponding to the candidate disease may be used as the disease type of the disease description text.

According to the text analysis method provided by the embodiment of the invention, based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources and the text representation of the disease description text under each independent source, the disease representation of the disease description text is determined, so that the disease representation of the disease description text is fused with the independent information of the medical knowledge of each source and the correlation information between the disease description text and the medical knowledge of each source, and the disease type corresponding to the disease description text can be determined more accurately.

Based on any of the above embodiments, step 121 further includes:

In particular, the textual representation of the disease description text under each individual source refers to representing the disease description text with medical knowledge under each individual source, such that the textual representation may characterize the individual relevance of the disease description text to the medical knowledge of each source. The independent text coding rule can be obtained by adaptively optimizing the expression style of the medical knowledge text under the corresponding source on the basis of the universal coding rule.

For example, when determining the text representation of the disease description text under each independent source, word vectors of each word in the disease description text are obtained after word list mapping sequentially for each independent source, and vector representations of the disease description text under each independent source, that is, text representations of the disease description text under each independent source, are obtained by combining the expression styles of the medical knowledge text under the corresponding source on the basis of the general coding rule. For example, a bert network obtained by performing migration learning based on medical knowledge texts under each independent source can be adopted to determine text representation of the disease description texts under each independent source, if the length of the limited text is 128 words, the hidden layer dimension of the input ID vector dimension is x (1, 128) 12 layers of berts is 768, the hidden layer vector ht dimension (128,768) of the disease description texts under each independent source obtained after passing through the bert network is finally obtained, and the text representation of the disease description texts under each independent source is finally obtained as a vector Assuming that the number of independent sources is 5, i=0, 1, … 4, n is the hidden layer vector dimension, and if 12 layers bert is used, the hidden layer vector is 768.

Since the text representations of the disease description texts under each independent source are independent from each other, in order to improve the recognition rate of the disease type, it is also necessary to obtain the correlation between the text representations of the disease description texts under each independent source, i.e. the correlation between the disease description texts and the medical knowledge of a plurality of independent sources, respectively. Thus, performing a self-attention calculation on the textual representations under each of the independent sources may yield correlations between the disease description text and the medical knowledge of the plurality of independent sources, respectively.

For example, taking medical knowledge of 5 independent sources as an example, the disease description text under each independent source is encoded according to an independent text encoding rule to obtain vectorsWhere i=0, 1, … 4, and then the vectors are contacted to obtain a [5*n ]]Vector matrix->Then self-attention calculation self-attention is performed on the vector matrix X to obtain a weight vector A= [ a ] of 5*1 ₀ a ₁ … a ₄ ]And obtaining the correlation between the disease description text and the medical knowledge of 5 independent sources respectively.

According to the text analysis method provided by the embodiment of the invention, the text representation of the disease description text under each independent source is determined based on the independent text coding rule of each independent source, so that the text representation can represent the independent correlation degree of the disease description text and the medical knowledge of each source, and then the text representation under each independent source is subjected to self-attention calculation to obtain the correlation between the disease description text and the medical knowledge of a plurality of independent sources, so that the disease representation of the disease description text is fused with the independent information of the medical knowledge of each source and the correlation information of the medical knowledge of each source, and the disease type corresponding to the disease description text is accurately identified.

Based on any of the above embodiments, as shown in fig. 3, step 121 includes:

step 1211, taking the correlation between the disease description text and the medical knowledge of each independent source as a weight, and carrying out weighted summation on text representations of the disease description text under each independent source to obtain a first disease representation;

step 1212, determining a second disease representation of the disease description text based on a generic text encoding rule, the generic text encoding rule being determined based on medical knowledge of the mixed plurality of independent sources;

step 1213, determining a disease representation of the disease description text based on the first disease representation and the second disease representation.

Specifically, the correlation between the disease description text and the medical knowledge of each independent source is used as a weight, and the text representations of the disease description text under each independent source are weighted and summed to obtain a first disease representation, so that the first disease representation can represent independent information of the disease description text under each independent source. Because the universal text encoding rule is determined based on the medical knowledge of the mixed multiple independent sources, the second disease representation determined based on the universal text encoding rule can reflect the disease characteristics characterized by the disease description text in combination with the correlation between the medical knowledge of the sources, so that the disease representation of the disease description text determined based on the first disease representation and the second disease representation is fused with both the independent information of the medical knowledge of the sources and the correlation information between the medical knowledge of the sources.

For example, based on the above embodiment, the matrix X is multiplied by the weight vector A to obtain the first disease representation U ₁ ＝X*A ^T Simultaneously obtaining a second disease representation U by the bert encoding of medical knowledge mixed with a plurality of independent sources ₂ The vector characterization U= [ U ] of the final input is obtained by stitching ₁ ；U ₂ ]I.e. a disease representation of the disease description text.

According to the text analysis method provided by the embodiment of the invention, the text representations of the disease description text under the independent sources are weighted and summed, the obtained first disease representation can represent the independent information of the disease description text under the independent sources, and the second disease representation determined based on the common coding rule can be combined with the correlation between the medical knowledge of the independent sources to reflect the disease characteristics represented by the disease description text, so that the disease representation of the disease description text is fused with the independent information of the medical knowledge of the independent sources and the correlation information between the medical knowledge of the disease description text and the medical knowledge of the independent sources.

Based on any of the above embodiments, as shown in fig. 4, step 122 includes:

step 1221, determining a correlation between the disease description text and each candidate disease type based on the candidate disease representations of each candidate disease type and the disease representations;

Step 1222, determining the disease type corresponding to the disease description text based on the correlation between the disease description text and each candidate disease type, and each candidate disease representation.

Specifically, the correlation between the disease description text and each candidate disease type is used to characterize the weight of the disease type corresponding to the disease description text in each candidate disease type, and the greater the correlation, the higher the probability that the disease type corresponding to the disease description text is identical to the candidate disease type. Based on the correlation between the disease description text and each candidate disease type, and the representation of each candidate disease, a score of the disease description text and each candidate disease type may be obtained, where the higher the score value, the greater the probability that the corresponding candidate disease type is the disease type corresponding to the disease description text, for example, a preset number of candidate disease types with a greater score value may be selected as the disease types of the disease description text, and the candidate disease type with a score value greater than a threshold may also be selected as the disease type of the disease description text.

For example, the disease representation of the candidate disease type may be represented by a code, assuming 100 diseases, the dimension of the training target vector is 100 dimensions, each dimension representing one disease, eg: the headache and dizziness are 10 days, the corresponding diagnosis is likely to be hypertension and post-cycle ischemia, the training target vector is positioned at 1 after the hypertension, namely the diseases of the candidate disease type are expressed as After the disease representation U is subjected to Attention calculation, the correlation between the disease description text and each candidate disease type is obtained, namely a correlation weight matrix V= [ V ] ₀ v ₁ … v _n ]Then, score vectors O=V×Y of the disease description text and the candidate disease types are obtained ^T And weighting the components through sigmoid, judging whether the weight is larger than a threshold value, and if so, taking the type of the corresponding candidate disease as the disease type of the disease description text.

Based on any of the above embodiments, as shown in fig. 5, step 110 includes:

step 111, determining an initial disease description text to be analyzed;

112, carrying out sequence labeling on the initial disease description text, and determining objective description information in the initial disease description text;

step 113, extracting text from the initial disease description text, and determining subjective description information in the initial disease description text;

step 114, determining the disease description text to be analyzed based on the objective description information and the subjective description information.

Specifically, the initial disease description text is a text in which a patient performs a spoken description of the state of the patient, and most of the patient focuses on subjective feelings of the patient when performing the description, for example, for the description of foot twitches, the initial disease description text is "the foot is involuntary, and the intermittent is jittered", and the medical professional description is "the foot twitches".

Therefore, in order to further improve the recognition efficiency of disease types, it is necessary to obtain corresponding key information, such as symptoms, signs, causes, past history, and the like, from the initial disease text. The key information can be divided into objective description information and subjective description information, the objective description information refers to normalized information, corpus of the objective description information is converged, for example, description of past history of whether drug allergy exists or not, and corresponding description information is usually two normalized descriptions of existence or nonexistence. Subjective descriptive information is information obtained by focusing on subjective feeling descriptions of patients, corpus of the subjective feeling descriptions is divergent, such as symptom descriptions of foot twitches, and various spoken descriptions can exist in corresponding descriptive information.

Because the objective description information is more regular information, the initial disease description text can be subjected to sequence labeling and obtained, and the subjective description information is usually concentrated on the self feeling of a patient, and the corpus is more divergent, so that the initial disease description text can be subjected to text extraction and obtaining. After the objective description information and the subjective description information are extracted, the key information in the initial disease description text is obtained, so that the objective description information and the subjective description information can be used as the disease description text, and the corresponding disease type can be rapidly analyzed according to the key information.

For example, the general causes and the past history are described in the initial disease description text more regularly, so that the general causes and the past history can be used as objective description information, the direct acquisition can be performed by adopting a sequence labeling method, and for symptom acquisition, the description is more concentrated on subjective feeling of a patient due to lack of medical background knowledge, so that the general causes and the past history can be used as subjective description information, the direct acquisition can be performed through an end-to-end model, and medical standard symptoms can be acquired based on correlation between model learning symptoms and patient expressions.

It should be noted that, since the medical knowledge data from a plurality of independent sources has a emphasis, the data distribution is different. Therefore, in order to better utilize the medical knowledge of each source, the diagnosis knowledge of each source is fused, and the objective description information and the subjective description information in the medical knowledge of each source can be extracted by adopting the same method so as to uniformly map the unstructured information in each source into the same knowledge system and uniformly and regularly form a structured text, so that the relevance between the disease description text and the medical knowledge of each source can be obtained more quickly, the identification efficiency of the disease type is improved, and further, the patient is helped to determine the corresponding department quickly.

Based on any of the above embodiments, the plurality of independent sources of medical knowledge includes at least two of human-machine interaction inquiry knowledge, outpatient medical record data, inpatient medical record data, disease knowledge base, and internet disease encyclopedia knowledge.

Specifically, the existing medical field includes medical knowledge from multiple independent sources, including human-computer interaction inquiry knowledge with more spoken language expressions, and also includes outpatient medical record data, inpatient medical record data, disease knowledge base and internet disease encyclopedia knowledge which are regulated by a professional doctor. According to the embodiment of the invention, through fusing medical knowledge of a plurality of independent sources, not only can the disease type corresponding to the disease description text be accurately identified, but also the problem that a large number of labels are required to be manually carried out on the basis of machine learning of the spoken disease description text is avoided, and the identification efficiency of the disease type is improved.

The human-computer interaction inquiry knowledge has the advantages of more spoken expressions, non-convergence of corpus divergence, different meaning of each sentence of expressions and less medical expertise. The outpatient medical record data, the inpatient medical record data, the disease knowledge base and the internet disease encyclopedia knowledge are medical corpus which is arranged by a professional doctor, the corpus is clearer, and the professional terms are more.

Based on any one of the above embodiments, the present invention provides a text analysis method, including:

and inputting the disease description text to be analyzed into the text analysis model to obtain the disease type corresponding to the disease description text output by the text analysis model. The text analysis model is trained based on medical knowledge of a plurality of independent sources, mixed disease description text and corresponding candidate disease types. The medical knowledge of the plurality of independent sources comprises human-computer interaction inquiry knowledge, outpatient medical record data, inpatient medical record data, a disease knowledge base and internet disease encyclopedia knowledge, and the mixed disease description text is obtained by data confusion of the medical knowledge of the plurality of independent sources.

As shown in fig. 6, the text analysis model may include an input layer, an encoding layer, a fusion layer and an output layer, where, when the text analysis model is trained, the structured multiple independent sources of medical knowledge, mixed disease description text and corresponding candidate disease types thereof are input into the text analysis model, the encoding layer encodes the multiple independent sources of medical knowledge and mixed disease description text respectively by using a bert network, the fusion layer determines correlation between the mixed disease description text and the respective independent sources of medical knowledge based on a gating mechanism Gate, and performs weighted summation on text representations of the mixed disease description text under the respective independent sources to obtain a first mixed disease representation, and determines a disease representation of the mixed disease description text after being spliced with a second mixed disease representation, then determines correlation between the mixed disease description text and each candidate disease type after performing self-Attention attribute calculation on the disease representation of the mixed disease description text, and determines a disease type corresponding to the mixed disease description text. After model training is completed, the disease description text is input from the position of the mixed disease description text in the input layer, and the disease type corresponding to the disease description text is output by the output layer.

Wherein, the loss function cross entropy loss of the text analysis model:

for a single training sample, m=0, 1 … M, representing M candidate disease types corresponding to the training sample, y _m Represents the m candidate disease type corresponding to sample X, p (w _m ) Representing the correlation between the training samples and the respective candidate disease types. It will be appreciated that the text analysis model may optimize the loss function based on BP algorithm, adam method, etc.

The text analysis device provided by the invention is described below, and the text analysis device described below and the text analysis method described above can be referred to correspondingly.

Based on any of the above embodiments, the present invention further provides a text analysis device, as shown in fig. 7, including:

a text determination unit 710 for determining a disease description text to be analyzed;

the text analysis unit 720 is configured to determine a disease type corresponding to the disease description text based on correlations between the disease description text and medical knowledge of a plurality of independent sources, respectively.

Based on any of the above embodiments, the text analysis unit 720 includes:

a disease representation unit for determining a disease representation of the disease description text based on correlations between the disease description text and medical knowledge of a plurality of independent sources, respectively, and text representations of the disease description text under each independent source;

And the disease determining unit is used for determining the disease type corresponding to the disease description text based on the disease representation.

Based on any of the above embodiments, further comprising:

a text representation unit for determining a text representation of the disease description text under each independent source based on an independent text encoding rule of each independent source, the independent text encoding rule being determined based on medical knowledge under a corresponding independent source, before determining a disease representation of the disease description text based on correlations between the disease description text and medical knowledge of a plurality of independent sources, respectively, and the text representations of the disease description text under each independent source;

and the self-attention unit is used for carrying out self-attention calculation on the text representations under each independent source to obtain the correlation between the disease description text and medical knowledge of a plurality of independent sources.

Based on any of the above embodiments, the disease representation unit comprises:

the first disease representation unit is used for taking the correlation between the disease description text and the medical knowledge of each independent source as a weight, and carrying out weighted summation on text representations of the disease description text under each independent source to obtain a first disease representation;

A second disease representation unit for determining a second disease representation of the disease description text based on a generic text encoding rule determined based on medical knowledge mixing the plurality of independent sources;

a third disease representation unit for determining a disease representation of the disease description text based on the first and second disease representations.

Based on any one of the above embodiments, the disease determination unit includes:

a correlation determination unit for determining a correlation between the disease description text and each candidate disease type based on candidate disease representations of each candidate disease type and the disease representations;

and the type determining unit is used for determining the disease type corresponding to the disease description text based on the correlation between the disease description text and each candidate disease type and each candidate disease representation.

Based on any of the above embodiments, the text determining unit 710 includes:

an initial text determining unit, configured to determine an initial disease description text to be analyzed;

the objective information determining unit is used for carrying out sequence labeling on the initial disease description text and determining objective description information in the initial disease description text;

The subjective information determining unit is used for extracting the text of the initial disease description text and determining subjective description information in the initial disease description text;

and the descriptive text determining unit is used for determining the disease descriptive text to be analyzed based on the objective descriptive information and the subjective descriptive information.

Fig. 8 is a schematic structural diagram of an electronic device according to the present invention, as shown in fig. 8, the electronic device may include: a processor 810, a memory 820, a communication interface 830 and a communication bus 840, wherein the processor 810, the memory 820 and the communication interface 830 perform communication with each other through the communication bus 840. Processor 810 may invoke logic instructions in memory 820 to perform a text analysis method comprising: determining a disease description text to be analyzed; and determining the disease type corresponding to the disease description text based on the correlation between the disease description text and medical knowledge of a plurality of independent sources respectively.

Further, the logic instructions in memory 820 described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the text analysis method provided by the above methods, the method comprising: determining a disease description text to be analyzed; and determining the disease type corresponding to the disease description text based on the correlation between the disease description text and medical knowledge of a plurality of independent sources respectively.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the text analysis methods provided above, the method comprising: determining a disease description text to be analyzed; and determining the disease type corresponding to the disease description text based on the correlation between the disease description text and medical knowledge of a plurality of independent sources respectively.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of text analysis, comprising:

determining a disease description text to be analyzed;

determining a disease type corresponding to the disease description text based on correlations between the disease description text and medical knowledge of a plurality of independent sources, respectively;

the determining the disease type corresponding to the disease description text based on the correlation between the disease description text and medical knowledge of a plurality of independent sources respectively comprises the following steps:

determining a disease representation of the disease description text based on correlations between the disease description text and medical knowledge of a plurality of independent sources, respectively, and textual representations of the disease description text under each independent source; the text representation of the disease description text under each independent source refers to the use of medical knowledge under each independent source to represent the disease description text;

Determining a disease type corresponding to the disease description text based on the disease representation;

said determining a disease representation of said disease description text based on correlations between said disease description text and medical knowledge of a plurality of independent sources, respectively, and textual representations of said disease description text under each independent source, comprising:

2. The text analysis method of claim 1, wherein the determining a disease representation of the disease description text based on correlations between the disease description text and medical knowledge of a plurality of independent sources, respectively, and textual representations of the disease description text under each independent source, further comprises:

3. The text analysis method according to claim 1, wherein the determining a disease type corresponding to the disease description text based on the disease representation includes:

4. A text analysis method according to any one of claims 1 to 3, wherein the determining of the disease description text to be analyzed comprises:

determining an initial disease description text to be analyzed;

5. A method of text analysis according to any one of claims 1 to 3, wherein the plurality of independently sourced medical knowledge includes at least two of human interactive questionnaires, outpatient medical records data, hospitalized medical records data, disease repositories, and internet disease encyclopedias.

6. A text analysis device, comprising:

a text analysis unit, configured to determine a disease type corresponding to the disease description text based on correlations between the disease description text and medical knowledge of a plurality of independent sources, respectively;

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the text analysis method according to any of claims 1 to 5 when the program is executed by the processor.

8. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the text analysis method according to any one of claims 1 to 5.