CN113051373A

CN113051373A - Text analysis method and device, electronic equipment and storage medium

Info

Publication number: CN113051373A
Application number: CN202110420438.5A
Authority: CN
Inventors: 甘露; 胡加学; 赵景鹤; 贺志阳
Original assignee: Anhui Iflytek Medical Information Technology Co ltd
Current assignee: Anhui Iflytek Medical Information Technology Co ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-06-29
Anticipated expiration: 2041-04-19
Also published as: CN113051373B

Abstract

The invention provides a text analysis method, a text analysis device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a disease description text to be analyzed; and determining the disease type corresponding to the disease description text based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources respectively. The method is based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources, so that the disease types corresponding to the disease description text can be determined by fusing the medical knowledge of each source, and compared with the traditional method in which the disease types corresponding to the disease description text with more spoken language expressions cannot be accurately identified based on an end-to-end model, the method can accurately determine the disease types corresponding to the disease description text by combining the correlation between the disease description text and the medical knowledge of each source, and improves the identification rate of the disease types.

Description

Text analysis method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a text analysis method, an apparatus, an electronic device, and a storage medium.

Background

With the popularization of the internet, a patient can reserve a corresponding department doctor on line for seeing and diagnosing according to the disease type of the patient, and if the corresponding department needs to be accurately determined, the disease type of the patient needs to be accurately evaluated.

At present, the disease type of a patient is determined by acquiring a disease description text of the patient and analyzing the text based on an end-to-end model, but the disease description text of the patient is focused on spoken expressions, so that the end-to-end model obtained based on professional medical sample training cannot accurately determine the corresponding disease type from the spoken expressed disease description text.

Disclosure of Invention

The invention provides a text analysis method, a text analysis device, electronic equipment and a storage medium, which are used for solving the defect that the disease type corresponding to a disease description text cannot be accurately determined in the prior art.

The invention provides a text analysis method, which comprises the following steps:

determining a disease description text to be analyzed;

and determining the disease type corresponding to the disease description text based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources respectively.

According to a text analysis method provided by the invention, the determining of the disease type corresponding to the disease description text based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources respectively comprises:

determining a disease representation of the disease description text based on correlations between the disease description text and medical knowledge of a plurality of independent sources, respectively, and text representations of the disease description text under the respective independent sources;

and determining the disease type corresponding to the disease description text based on the disease representation.

According to a text analysis method provided by the present invention, before determining a disease representation of a disease description text based on correlations between the disease description text and medical knowledge from a plurality of independent sources, and text representations of the disease description text from the independent sources, the method further includes:

determining a text representation of the disease description text under each independent source based on independent text encoding rules of each independent source, the independent text encoding rules being determined based on medical knowledge under the corresponding independent source;

and performing self-attention calculation on the text representation under each independent source to obtain the correlation between the disease description text and the medical knowledge of a plurality of independent sources.

According to a text analysis method provided by the invention, the determining of the disease representation of the disease description text based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources respectively and the text representation of the disease description text under each independent source comprises the following steps:

taking the correlation between the disease description text and the medical knowledge of each independent source as a weight, and performing weighted summation on text representations of the disease description text under each independent source to obtain a first disease representation;

determining a second disease representation of the disease description text based on a universal text encoding rule determined based on medical knowledge of blending the plurality of independent sources;

determining a disease representation of the disease description text based on the first disease representation and the second disease representation.

According to a text analysis method provided by the present invention, the determining a disease type corresponding to the disease description text based on the disease representation includes:

determining a correlation between the disease description text and each candidate disease type based on a candidate disease representation for each candidate disease type and the disease representation;

and determining the disease type corresponding to the disease description text based on the correlation between the disease description text and each candidate disease type and each candidate disease representation.

According to a text analysis method provided by the invention, the determining of the disease description text to be analyzed comprises the following steps:

determining an initial disease description text to be analyzed;

carrying out sequence annotation on the initial disease description text, and determining objective description information in the initial disease description text;

extracting texts from the initial disease description texts, and determining subjective description information in the initial disease description texts;

and determining the disease description text to be analyzed based on the objective description information and the subjective description information.

According to the text analysis method provided by the invention, the medical knowledge of multiple independent sources comprises at least two of man-machine interaction inquiry knowledge, outpatient medical record data, inpatient medical record data, disease knowledge base and internet disease encyclopedia knowledge.

The present invention also provides a text analysis apparatus, comprising:

the text determining unit is used for determining a disease description text to be analyzed;

and the text analysis unit is used for determining the disease type corresponding to the disease description text based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources respectively.

The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of any one of the text analysis methods.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the text analysis method as described in any one of the above.

The text analysis method, the text analysis device, the electronic equipment and the storage medium provided by the invention are based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources respectively, so that the disease types corresponding to the disease description text can be determined by fusing the medical knowledge of each source, and compared with the traditional method in which the disease types corresponding to the disease description text with more spoken expressions cannot be accurately identified based on an end-to-end model, the text analysis method, the device, the electronic equipment and the storage medium can accurately determine the disease types corresponding to the disease description text by combining the correlation between the disease description text and the medical knowledge of each source, and improve the identification rate of the disease types.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a text analysis method provided by the present invention;

FIG. 2 is a schematic flow chart of a disease type acquisition method provided by the present invention;

FIG. 3 is a schematic flow diagram of a disease representation determination method provided by the present invention;

FIG. 4 is a schematic flow chart of another disease type acquiring method provided by the present invention;

FIG. 5 is a schematic flow chart of a method for acquiring a disease description text according to the present invention;

FIG. 6 is a schematic flow chart of a text analysis model training method provided by the present invention;

FIG. 7 is a schematic structural diagram of a text analysis apparatus provided in the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the traditional method, the disease type of the patient is mostly determined by acquiring the disease description text of the patient and analyzing the text based on the end-to-end model, but the disease description text of the patient is mainly expressed in a spoken language, so that the end-to-end model trained based on a professional medical sample cannot accurately determine the corresponding disease type from the disease description text expressed in the spoken language. In addition, the traditional method also takes the disease description text of the patient as input, and adopts a task-based dialog system (pipeline) to identify the disease type of the patient, but the modules in the pipeline are connected in series, so that the error rate of each module in the front can be conducted to the final output module, and the identification accuracy is influenced.

In view of the above, the present invention provides a text analysis method. Fig. 1 is a schematic flow chart of a text analysis method provided by the present invention, and as shown in fig. 1, the method includes the following steps:

step 110, determining a disease description text to be analyzed;

and step 120, determining the disease type corresponding to the disease description text based on the correlation between the disease description text and the medical knowledge of the independent sources respectively.

In particular, the disease description text is used to describe the state of a patient's disease, such as the patient's symptoms, signs, causes, past history, and the like. The disease description text may be an electronic text, an OCR text which is obtained by performing optical character recognition on a paper text, or a text which is obtained by collating a recording of a patient, which is not specifically limited in the embodiment of the present invention.

Generally, most of the disease description texts are texts obtained by oral descriptions of patients according to self conditions, namely the disease description texts focus on the oral expressions, a large number of divergent oral corpus descriptions exist, and the number of professional terms is small. For example, the lungs of a patient are uncomfortable, but because the patient does not have professional medical knowledge, the patient may describe the condition as "itchy lungs", but medically the absence of "itchy" lungs, the professional description should be "lung discomfort".

Because the disease description text focuses on spoken language expression and has fewer professional terms, if the disease description text is identified by adopting an end-to-end model obtained by training based on a professional medical sample in the traditional method, the model cannot identify the corresponding medical knowledge from the spoken language expression text, and further cannot accurately determine the corresponding disease type according to the disease description text. If the machine learning is performed based on the spoken disease description text alone, a large number of spoken samples need to be collected additionally for training, which not only requires a large number of labels to be performed manually, but also the spoken samples have various corpora, for example, a large number of different expression modes may exist for the same semantic meaning, and the expression modes are random, that is, it is difficult to collect a sufficient number of spoken samples for training in an actual situation.

Therefore, the embodiment of the invention fuses a large amount of medical knowledge of a plurality of regulated independent sources in the existing medical field to determine the disease type corresponding to the disease description text. The medical knowledge from multiple independent sources comprises a man-machine interaction corpus with more spoken expressions and a database with professional medical knowledge, wherein the corpus of the database with the professional medical knowledge is medical corpus which is arranged by a professional doctor, is clear, and has a large number of professional terms. The medical knowledge of each source is independent, and the medical knowledge of each source has a corresponding disease type. The medical knowledge from multiple independent sources may include a disease knowledge base, medical history data of hospitalization, an internet disease encyclopedia knowledge base, and the like, which is not particularly limited in this embodiment of the present invention.

In addition, the correlations between the disease description text and the medical knowledge of the plurality of independent sources are used for representing the correlation between the information described in the disease description text and the medical information corresponding to the medical knowledge of each source, and the higher the correlation is, the more similar the linguistic data in the disease description text and the linguistic data of the medical knowledge of the corresponding source are, so that the higher the reliability of judging the disease type represented by the disease description text based on the disease knowledge of the source is, the higher the accuracy of the corresponding obtained disease type is.

Further, the process of determining the disease type corresponding to the disease description text in step 120 based on the correlation between the disease description text and the medical knowledge from multiple independent sources, respectively, may be implemented by a text analysis model. Before step 120 is executed, the text analysis model may be obtained by training in advance, and specifically, the text analysis model may be obtained by training in the following manner: firstly, medical knowledge samples of a plurality of independent sources are collected, and the medical knowledge samples of the independent sources are mixed to obtain mixed samples of the independent sources. And then, training the initial model based on the medical knowledge samples of the independent sources, the mixed samples of the independent sources and the corresponding disease types, thereby obtaining a text analysis model.

The text analysis method provided by the embodiment of the invention is based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources, so that the disease types corresponding to the disease description text can be determined by fusing the medical knowledge of each source, compared with the traditional method in which the disease types corresponding to the disease description text with more spoken expressions cannot be accurately identified based on an end-to-end model, and the traditional method in which the disease types of patients are identified by using pipeline, the embodiment of the invention combines the correlation between the disease description text and the medical knowledge of each source, and can accurately determine the disease types corresponding to the disease description text, thereby improving the identification rate of the disease types.

It should be noted that, the method provided in the embodiment of the present invention is to acquire the disease type corresponding to the disease description text with the disease description text as an object, and not to take the patient himself as an object. In addition, the method provided by the embodiment of the invention aims to analyze the disease type corresponding to the disease description text, is used for quickly determining the corresponding department according to the disease type, and helps a patient to online reserve a doctor of the corresponding department for seeing and examining, and does not aim at obtaining a disease diagnosis result or health condition directly. Therefore, the method provided by the embodiment of the invention does not belong to a disease diagnosis method.

Based on the above embodiment, as shown in fig. 2, step 120 includes:

step 121, determining disease representation of the disease description text based on correlation between the disease description text and medical knowledge of a plurality of independent sources respectively and text representation of the disease description text under each independent source;

and step 122, determining the disease type corresponding to the disease description text based on the disease representation.

Specifically, the relevance between the disease description text and the medical knowledge of multiple independent sources represents the relevance between the information described in the disease description text and the medical information corresponding to the medical knowledge of each source, and meanwhile, the text representation of the disease description text under each independent source reserves the independent information of the medical knowledge of each source, so that the disease representation of the disease description text determined based on the disease description text and the disease description text integrates the independent information of the medical knowledge of each source and also integrates the relevance information between the disease description text and the medical knowledge of each source. The text representation of the disease description text in each independent source may be determined based on medical knowledge in each independent source, and is used to describe the disease description text in the expression style of the text in each independent source, which is not specifically limited in the embodiment of the present invention. For example, the correlations between the disease description texts and the medical knowledge of multiple independent sources respectively can be used as weights, the text representations of the disease description texts under the independent sources are subjected to weighted summation, and the result of the summation is used as the disease representation of the disease description text.

After determining the disease representation of the disease description text, the disease representation of the disease description text may be compared with the disease representation corresponding to each candidate disease, and if the similarity between the disease representation corresponding to any candidate disease and the disease representation of the disease description text exceeds a threshold, the disease type corresponding to the candidate disease may be used as the disease type of the disease description text.

The text analysis method provided by the embodiment of the invention determines the disease representation of the disease description text based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources and the text representation of the disease description text under each independent source, so that the disease representation of the disease description text is fused with the independent information of the medical knowledge of each source and the correlation information between the disease description text and the medical knowledge of each source, and the disease type corresponding to the disease description text can be more accurately determined.

Based on any of the above embodiments, step 121 further includes:

determining text representations of the disease description texts under each independent source based on independent text coding rules of each independent source, wherein the independent text coding rules are determined based on medical knowledge under the corresponding independent source;

In particular, the textual representation of the disease description text under each independent source refers to representing the disease description text with medical knowledge under each independent source, such that the textual representation may characterize the independent relevance of the disease description text to the medical knowledge of each source. The independent text coding rule can be obtained by combining the expression style of the medical knowledge text under the corresponding source to perform adaptive optimization on the basis of the general coding rule.

For example, when text representation of the disease description text under each independent source is determined, for each independent source, word vectors of each word in the disease description text are obtained after word list mapping in sequence, and the vectors of the disease description text under each independent source are obtained by combining expression styles of medical knowledge texts under corresponding sources on the basis of a universal coding ruleCharacterization, i.e., a textual representation of the disease description text under each independent source. For example, a bert network obtained by performing transfer learning based on medical knowledge texts under each independent source can be used for determining text representations of disease description texts under each independent source, if the length of a text is limited to 128 words, the input ID vector dimension is 768 of x (1,128) and 12-layer bert hidden layer dimensions, then hidden layer vector ht dimensions (128,768) of the disease description texts under each independent source are obtained after passing through the bert network, and finally the text representations of the disease description texts under each independent source are obtained as vector representations

Assuming that the number of independent sources is 5, i is 0, 1, … 4, n is the hidden-layer vector dimension, and if 12 layers of berts, the hidden-layer vector is 768.

Since the text representations of the disease description texts from the independent sources are independent from each other, in order to improve the recognition rate of the disease type, the correlations between the text representations of the disease description texts from the independent sources, that is, the correlations between the disease description texts and the medical knowledge from the independent sources, respectively, need to be obtained. Therefore, by performing self-attention calculation on the text representation under each independent source, the correlation between the disease description text and the medical knowledge of a plurality of independent sources can be obtained.

For example, using medical knowledge of 5 independent sources as an example, coding disease description texts under each independent source according to an independent text coding rule to obtain a vector

Where i is 0, 1, … 4, and then obtaining a [5 n ] n component after each component contact]Vector matrix

Then, self-attention calculation self-attention is carried out on the vector matrix X, and a weight vector A [ a ] of 5X 1 is obtained₀ a₁ … a₄]That is, the disease description text is respectively between the medical knowledge of 5 independent sourcesAnd (4) correlation.

The text analysis method provided by the embodiment of the invention determines the text representation of the disease description text under each independent source based on the independent text coding rule of each independent source, so that the text representation can represent the independent correlation degree of the disease description text and the medical knowledge of each source, and then performs self-attention calculation on the text representation under each independent source to obtain the correlation between the disease description text and the medical knowledge of a plurality of independent sources, so that the disease representation of the disease description text is fused with the independent information of the medical knowledge of each source and the correlation degree information of the medical knowledge of each source, and further accurately identifies the disease type corresponding to the disease description text.

Based on any of the above embodiments, as shown in fig. 3, step 121 includes:

step 1211, taking the correlation between the disease description text and the medical knowledge of each independent source as a weight, and performing weighted summation on the text representation of the disease description text under each independent source to obtain a first disease representation;

step 1212, determining a second disease representation of the disease description text based on a universal text encoding rule, the universal text encoding rule being determined based on medical knowledge blending a plurality of independent sources;

step 1213 determines a disease representation of the disease description text based on the first disease representation and the second disease representation.

Specifically, the relevance between the disease description text and the medical knowledge of each independent source is used as a weight, and the text representations of the disease description text under each independent source are subjected to weighted summation to obtain a first disease representation, so that the first disease representation can represent the independent information of the disease description text under each independent source. Since the universal text encoding rule is determined based on medical knowledge of a plurality of mixed independent sources, the second disease representation determined based on the universal text encoding rule can reflect the disease characteristics represented by the disease description text by combining the correlation between the medical knowledge of each source, so that the disease representation of the disease description text determined based on the first disease representation and the second disease representation is fused with the independent information of the medical knowledge of each source and the correlation information between the medical knowledge of each source.

For example, based on the above embodiment, the matrix X is multiplied by the weight vector A to obtain the first disease representation U₁＝X*A^TSimultaneously, medical knowledge mixed with a plurality of independent sources is coded by bert to obtain a second disease representation U₂And the vector characteristics U ═ U are obtained by splicing together the final input₁；U₂]I.e. a disease representation of the disease description text.

The text analysis method provided by the embodiment of the invention performs weighted summation on the text representations of the disease description texts under each independent source to obtain the first disease representation capable of representing the independent information of the disease description texts under each independent source, and the second disease representation determined based on the universal coding rule can reflect the disease characteristics represented by the disease description texts in combination with the correlation between the medical knowledge of each source, so that the disease representation of the disease description texts is fused with the independent information of the medical knowledge of each source and the correlation information between the medical knowledge of each source.

Based on any of the above embodiments, as shown in fig. 4, step 122 includes:

step 1221, determining the correlation between the disease description text and each candidate disease type based on the candidate disease representation of each candidate disease type and the disease representation;

step 1222, based on the correlation between the disease description text and each candidate disease type and each candidate disease representation, determining the disease type corresponding to the disease description text.

Specifically, the correlation between the disease description text and each candidate disease type is used to characterize the weight of the disease type corresponding to the disease description text in each candidate disease type, and the greater the correlation, the higher the probability that the disease type corresponding to the disease description text is the same as the candidate disease type. Based on the correlation between the disease description text and each candidate disease type and the representation of each candidate disease, scores of the disease description text and each candidate disease type may be obtained, where a higher score indicates that the probability that the corresponding candidate disease type is the disease type corresponding to the disease description text is higher, for example, a preset number of candidate disease types with a higher score may be selected as the disease type of the disease description text, and a candidate disease type with a score greater than a threshold may also be used as the disease type of the disease description text, which is not specifically limited in the embodiment of the present invention.

For example, the disease representation of the candidate disease type may be represented by encoding, assuming there are 100 diseases, the dimension of the training target vector is 100 dimensions, each dimension representing one disease, eg: if the corresponding diagnosis is hypertension and post-circulation ischemia, the training target vector is set to be 1 in the hypertension and the post-circulation ischemia, namely the disease of the candidate disease type is expressed as

After the disease expression U and the disease expression are calculated through Attention attribution, the correlation between the disease description text and each candidate disease type is obtained, namely a correlation weight matrix V ═ V₀ v₁ … v_n]Then, a score vector O ═ V × Y of the disease description text and each candidate disease type is obtained^TWeighting each component by sigmoid, judging whether the weighted component is greater than a threshold value, and if so, taking the type of the corresponding candidate disease as the disease type of the disease description text.

Based on any of the above embodiments, as shown in fig. 5, step 110 includes:

step 111, determining an initial disease description text to be analyzed;

step 112, carrying out sequence annotation on the initial disease description text, and determining objective description information in the initial disease description text;

step 113, extracting texts of the initial disease description texts, and determining subjective description information in the initial disease description texts;

and step 114, determining a disease description text to be analyzed based on the objective description information and the subjective description information.

Specifically, the initial disease description text is a text for colloquially describing the state of the patient, and most patients focus on subjective feelings of the patient when describing the state, for example, for describing foot twitching, the initial disease description text is "the foot will be involuntary and shake intermittently", and the professional description in medicine is "foot twitching".

Therefore, in order to further improve the recognition efficiency of disease types, it is necessary to obtain corresponding key information, such as symptoms, signs, causes, past history and the like, from the initial disease text. The key information can be divided into objective description information and subjective description information, the objective description information is regularized information, the linguistic data of the objective description information is more convergent, and for example, for the description of the past history, whether a drug allergy exists or not, the corresponding description information is usually two regular descriptions of existence or nonexistence. The subjective description information is information obtained by focusing on the subjective feeling description of the patient, and the corpus of the subjective description information is more divergent, for example, for the symptom description of foot twitch, various spoken descriptions may exist in the corresponding description information.

The objective description information is regular information, so that the initial disease description text can be subjected to sequence labeling acquisition, the subjective description information is generally concentrated on the self feeling of a patient, and the linguistic data is more divergent, so that the initial disease description text can be extracted and acquired. After the objective description information and the subjective description information are extracted, the key information in the initial disease description text is obtained, so that the objective description information and the subjective description information can be used as the disease description text, and the corresponding disease type can be analyzed quickly according to the key information in the disease description text.

For example, generally, the cause and the past history are described regularly in an initial disease description text, so that the cause and the past history can be directly obtained in a sequence labeling method as objective description information, and for symptom obtaining, because the medical background knowledge of a patient is deficient, the description of the patient is more concentrated on subjective feeling of the patient, so that the cause and the past history can be directly obtained through an end-to-end model as subjective description information, and medical standard symptoms are obtained based on the correlation between model learning symptoms and patient expressions.

It should be noted that, since medical knowledge data from a plurality of independent sources are emphasized, the data distribution is different. Therefore, in order to better utilize the medical knowledge of each source and fuse the diagnosis knowledge of each source, the same method can be adopted to extract objective description information and subjective description information in the medical knowledge of each source so as to uniformly map unstructured information in each source into the same knowledge system and uniformly regulate the unstructured information into structured texts, so that the relevance between the disease description texts and the medical knowledge of each source can be acquired more quickly, the identification efficiency of disease types is improved, and a patient is further helped to determine the corresponding department quickly.

Based on any of the above embodiments, the medical knowledge from multiple independent sources includes at least two of human-computer interaction inquiry knowledge, outpatient medical record data, inpatient medical record data, a disease knowledge base, and internet disease encyclopedia knowledge.

Specifically, the existing medical field includes medical knowledge from multiple independent sources, including human-computer interactive inquiry knowledge with more spoken expressions, and also includes outpatient medical record data, inpatient medical record data, disease knowledge base and internet disease encyclopedia knowledge organized by professional doctors. According to the embodiment of the invention, by fusing medical knowledge from a plurality of independent sources, the disease type corresponding to the disease description text can be accurately identified, the problem that a large amount of labels are required to be manually marked when the disease description text based on spoken language is independently used for machine learning is solved, and the identification efficiency of the disease type is improved.

The human-computer interaction inquiry knowledge oral expression is abundant, the corpus divergence is not convergent, the expression intentions of each sentence are different, and the corpus contains less medical professional knowledge. The outpatient medical record data, the inpatient medical record data, the disease knowledge base and the Internet disease encyclopedia knowledge are medical linguistic data which are arranged by a professional doctor, the linguistic data are clear, and the professional terms are more.

Based on any one of the above embodiments, the present invention provides a text analysis method, including:

and inputting the disease description text to be analyzed into the text analysis model to obtain the disease type corresponding to the disease description text output by the text analysis model. The text analysis model is trained based on medical knowledge from multiple independent sources, mixed disease description texts and corresponding candidate disease types. The medical knowledge from multiple independent sources comprises man-machine interaction inquiry knowledge, outpatient medical record data, inpatient medical record data, a disease knowledge base and internet disease encyclopedia knowledge, and the mixed disease description text is obtained by data confusion of the medical knowledge from multiple independent sources.

As shown in fig. 6, the text analysis model may include an input layer, an encoding layer, a fusion layer, and an output layer, wherein when training the text analysis model, the structured medical knowledge from multiple independent sources, the mixed disease description text, and the corresponding candidate disease types are input into the text analysis model, the encoding layer encodes the medical knowledge from multiple independent sources and the mixed disease description text using a bert network, the fusion layer determines the correlation between the mixed disease description text and the medical knowledge from each independent source based on a Gate control mechanism Gate, and performs weighted summation on the text representations of the mixed disease description text from each independent source to obtain a first mixed disease representation, and determines the disease representation of the mixed disease description text after splicing with a second mixed disease representation, and then performs self-Attention calculation on the disease representation of the mixed disease description text and the candidate disease representation, and determining the correlation between the mixed disease description text and each candidate disease type, representing each candidate disease, and determining the disease type corresponding to the mixed disease description text. After the training of the model is completed, the disease description text is input from the position of the mixed disease description text in the input layer, and the disease type corresponding to the disease description text is output by the output layer.

Wherein, the loss function cross entropy loss of the text analysis model is as follows:

for a single training sample, M is 0, 1 … M, which represents M candidate disease types corresponding to the training sample, y_mRepresents the m-th candidate disease type, p (w), corresponding to the sample X_m) Representing the correlation between the training sample and each candidate disease type. It is understood that the text analysis model may optimize the loss function based on the BP algorithm, Adam method, etc.

The following describes the text analysis device provided by the present invention, and the text analysis device described below and the text analysis method described above may be referred to in correspondence with each other.

Based on any of the above embodiments, the present invention further provides a text analysis apparatus, as shown in fig. 7, the apparatus including:

a text determination unit 710 for determining a disease description text to be analyzed;

and a text analysis unit 720, configured to determine a disease type corresponding to the disease description text based on correlations between the disease description text and medical knowledge from multiple independent sources, respectively.

Based on any of the above embodiments, the text analysis unit 720 includes:

a disease representation unit, configured to determine a disease representation of the disease description text based on correlations between the disease description text and medical knowledge of multiple independent sources, respectively, and text representations of the disease description text under the respective independent sources;

and the disease determining unit is used for determining the disease type corresponding to the disease description text based on the disease representation.

Based on any embodiment above, still include:

a text representation unit, configured to determine a text representation of the disease description text in each independent source based on an independent text encoding rule of each independent source, before determining a disease representation of the disease description text based on correlations between the disease description text and medical knowledge of a plurality of independent sources, respectively, and a text representation of the disease description text in each independent source, where the independent text encoding rule is determined based on the medical knowledge of the corresponding independent source;

and the self-attention unit is used for performing self-attention calculation on the text representations under the independent sources to obtain the correlation between the disease description texts and the medical knowledge of the independent sources respectively.

In accordance with any of the above embodiments, the disease representation unit comprises:

the first disease representation unit is used for taking the correlation between the disease description text and the medical knowledge of each independent source as a weight, and performing weighted summation on the text representation of the disease description text under each independent source to obtain a first disease representation;

a second disease representation unit for determining a second disease representation of the disease description text based on a universal text encoding rule determined based on medical knowledge mixing the plurality of independent sources;

a third disease representation unit for determining a disease representation of the disease description text based on the first disease representation and the second disease representation.

In any of the above embodiments, the disease determination unit includes:

a correlation determination unit for determining a correlation between the disease description text and each candidate disease type based on a candidate disease representation of each candidate disease type and the disease representation;

and the type determining unit is used for determining the disease type corresponding to the disease description text based on the correlation between the disease description text and each candidate disease type and each candidate disease representation.

Based on any of the above embodiments, the text determining unit 710 includes:

an initial text determination unit for determining an initial disease description text to be analyzed;

the objective information determining unit is used for carrying out sequence marking on the initial disease description text and determining objective description information in the initial disease description text;

a subjective information determining unit, configured to perform text extraction on the initial disease description text, and determine subjective description information in the initial disease description text;

and the description text determining unit is used for determining the disease description text to be analyzed based on the objective description information and the subjective description information.

Based on any of the above embodiments, the plurality of independent sources of medical knowledge includes at least two of human-computer interaction inquiry knowledge, outpatient medical record data, inpatient medical record data, a disease knowledge base, and internet disease encyclopedia knowledge.

Fig. 8 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a memory (memory)820, a communication interface (communications interface)830 and a communication bus 840, wherein the processor 810, the memory 820 and the communication interface 830 communicate with each other via the communication bus 840. Processor 810 may invoke logic instructions in memory 820 to perform a text analysis method comprising: determining a disease description text to be analyzed; and determining the disease type corresponding to the disease description text based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources respectively.

Furthermore, the logic instructions in the memory 820 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a text analysis method provided by the above methods, the method comprising: determining a disease description text to be analyzed; and determining the disease type corresponding to the disease description text based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources respectively.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the text analysis method provided above, the method comprising: determining a disease description text to be analyzed; and determining the disease type corresponding to the disease description text based on the correlation between the disease description text and the medical knowledge of a plurality of independent sources respectively.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of text analysis, comprising:

determining a disease description text to be analyzed;

2. The method according to claim 1, wherein the determining the disease type corresponding to the disease description text based on the correlation between the disease description text and the medical knowledge from a plurality of independent sources respectively comprises:

3. The text analysis method of claim 2, wherein the determining the disease representation of the disease description text based on the correlations between the disease description text and the medical knowledge from the independent sources, respectively, and the text representations of the disease description text from the independent sources, further comprises:

4. The method of claim 2, wherein determining the disease representation of the disease description text based on the correlations between the disease description text and the medical knowledge from the independent sources, respectively, and the text representations of the disease description text from the independent sources comprises:

5. The text analysis method of claim 2, wherein the determining a disease type corresponding to the disease description text based on the disease representation comprises:

6. The text analysis method according to any one of claims 1 to 5, wherein the determining of the disease description text to be analyzed comprises:

determining an initial disease description text to be analyzed;

7. The text analysis method of any one of claims 1 to 5, wherein the plurality of independently sourced medical knowledge comprises at least two of human-machine interaction inquiry knowledge, outpatient medical record data, in-patient medical record data, a disease knowledge base, and Internet disease encyclopedia knowledge.

8. A text analysis apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the text analysis method according to any of claims 1 to 7 are implemented when the processor executes the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the text analysis method according to any one of claims 1 to 7.