CN115688735A - Text processing method, apparatus, medium, and program product - Google Patents

Text processing method, apparatus, medium, and program product Download PDF

Info

Publication number
CN115688735A
CN115688735A CN202110872803.6A CN202110872803A CN115688735A CN 115688735 A CN115688735 A CN 115688735A CN 202110872803 A CN202110872803 A CN 202110872803A CN 115688735 A CN115688735 A CN 115688735A
Authority
CN
China
Prior art keywords
text
standard
training
model
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110872803.6A
Other languages
Chinese (zh)
Inventor
许梦竹
田聪
袁亚娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Koninklijke Philips NV
Original Assignee
Koninklijke Philips NV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips NV filed Critical Koninklijke Philips NV
Priority to CN202110872803.6A priority Critical patent/CN115688735A/en
Publication of CN115688735A publication Critical patent/CN115688735A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

Embodiments of the present disclosure relate to text processing methods, apparatuses, media, and program products. A text processing method includes: obtaining a first training text and a first standard text marked as matching with the first training text in a standard text base, wherein the standard text base comprises a plurality of standard texts used in the knowledge field; generating a second training text by modifying a second standard text in the standard text library, the second training text being marked as matching the second standard text; and training a model configured to generate feature representations of the text in accordance with a training objective determined at least to enable a first feature representation generated by the model for the first training text to be reconstructed as a first standard text and a second feature representation generated by the model for the second training text to be reconstructed as a second standard text, using the first and second training texts and the first and second standard texts. In this way, the resulting model makes the normalization process of the text more accurate.

Description

Text processing method, apparatus, medium, and program product
Technical Field
Embodiments of the present disclosure relate generally to the field of computers, and more particularly, to a text processing method, an electronic device, a computer-readable storage medium, and a program product.
Background
Clinical medicine terminology is an important component of medical data, and the term standardization technology is very important for clinical research and clinical information management systems. However, different medical personnel or the same medical personnel may have different expressions for the same medical term at different occasions and times, and therefore, the text extracted from the electronic medical record (such as a diagnosis report) needs to be converted into standard text in a standard term library (such as the international disease classification code ICD-10). However, manual labeling usually requires a lot of labor and time costs, and there may be situations where the professional knowledge of the medical professional is insufficient or where the medical term is incorrectly standardized due to incorrect manipulation.
Existing term normalization techniques typically evaluate similarity of texts by measuring string-based distances or vector-based distances. However, the accuracy of this technique is low. Therefore, it is desirable to provide a method capable of making the normalization processing of text more accurate.
Disclosure of Invention
According to an embodiment of the present disclosure, a scheme for text processing is provided for improving the accuracy of the standardized processing of text by using a hybrid training method.
In a first aspect of the disclosure, a method of text processing is provided. The method comprises the following steps: obtaining a first training text and a first standard text marked as matching with the first training text in a standard text base, wherein the standard text base comprises a plurality of standard texts used in the knowledge field; generating a second training text by modifying a second standard text in the standard text library, the second training text being marked as matching the second standard text; and training a model configured to generate feature representations of the text using the first training text, the second training text, the first standard text, and the second standard text, and in accordance with training objectives determined at least to enable a first feature representation generated by the model for the first training text to be reconstructed as the first standard text and a second feature representation generated by the model for the second training text to be reconstructed as the second standard text.
According to some alternative embodiments, modifying the second standard text in the standard text corpus to generate the second training text comprises modifying the second standard text by at least one of: deleting at least one character, word or phrase in the second standard text; replacing at least one character in the second standard text with a character having the same or similar pronunciation; replacing words in the second standard text with words having the same root word; and changing the order of characters, words or phrases in the second standard text.
According to some optional embodiments, the method further comprises: generating a third training text by modifying the first training text, the third training text being marked as matching the first standard text; and training the model further using the third training text and the first standard text, and in accordance with a training objective, the training objective further determined to enable a third feature representation generated by the model for the third training text to be reconstructed into the first standard text.
According to some optional embodiments, the method further comprises: performing pre-processing on the first training text, the first standard text, and the second standard text to format the first training text, the first standard text, and the second standard text, wherein the second training text is generated based on the pre-processed second standard text.
According to some alternative embodiments, training the model comprises: determining vectorization representation corresponding to the training text aiming at each training text in the first training text and the second training text; generating training feature representations corresponding to the training texts by applying the vectorized representations to the model; generating a reconstructed text corresponding to the training text from the training feature representation; and updating the parameter set of the model to meet the training goal by reducing the difference between the reconstructed text and the standard text that the training text matches.
According to some alternative embodiments, determining the corresponding vectorized representation of the training text comprises: extracting a plurality of single-dimensional vectorization representations of the training text on a plurality of dimensions; and determining a vectorized representation by merging the plurality of single-dimensional vectorized representations.
According to some optional embodiments, extracting the plurality of single-dimensional vectorized representations comprises extracting at least one of the following plurality of single-dimensional vectorized representations: extracting semantic vectorization representation corresponding to the training text on semantic dimensions; extracting a plurality of unit vectorization representations corresponding to a plurality of text units included in the training text on the text dimension, wherein the plurality of text units include at least one of characters, words and phrases; and extracting pronunciation vectorization representation corresponding to all or part of pronunciations of the training text in the pronunciation dimension.
In a second aspect of the present disclosure, a text processing method is provided. The method comprises the following steps: determining a target feature representation corresponding to the target text by using the model trained according to the method of the first aspect; obtaining a plurality of standard feature representations corresponding to a plurality of standard texts in a standard text library; determining a plurality of representation similarity scores between the target feature representation and the plurality of standard feature representations; and determining a standard text matching the target text in the plurality of standard texts based on at least the plurality of representation similarity scores.
According to some alternative embodiments, determining the standard text that matches the target text based at least on the plurality of representation similarity scores comprises: selecting a plurality of candidate standard texts for the target text from the plurality of standard texts based on the plurality of representation similarity scores; determining a plurality of text similarity scores between the target text and a plurality of candidate standard texts; determining a plurality of confidence scores between the target text and the plurality of candidate standard texts based on the plurality of representation similarity scores and the plurality of text similarity scores; and selecting a standard text matching the target text from the plurality of candidate standard texts based on the plurality of confidence scores.
According to some alternative embodiments, the plurality of standard feature representations are determined by a model.
In a third aspect of the disclosure, an electronic device is provided. The apparatus comprises: a processing unit; and a memory coupled to the processing unit and containing instructions stored thereon that, when executed by the processing unit, cause the apparatus to perform the actions of: obtaining a first training text and a first standard text marked as matching with the first training text in a standard text base, wherein the standard text base comprises a plurality of standard texts used in the knowledge field; generating a second training text by modifying a second standard text in the standard text library, the second training text being marked as matching the second standard text; and training a model configured to generate feature representations of the text using the first training text, the second training text, the first standard text, and the second standard text, and in accordance with a training objective determined at least to enable a first feature representation generated by the model for the first training text to be reconstructed as the first standard text and a second feature representation generated by the model for the second training text to be reconstructed as the second standard text.
According to some optional embodiments, the apparatus may implement various embodiments of the method of the first aspect.
In a fourth aspect of the present disclosure, an electronic device is provided. The apparatus comprises: a processing unit; and a memory coupled to the processing unit and containing instructions stored thereon that, when executed by the processing unit, cause the apparatus to perform the actions of: determining a target feature representation corresponding to the target text by using the model trained according to the method of the first aspect; obtaining a plurality of standard feature representations corresponding to a plurality of standard texts in a standard text library; determining a plurality of representation similarity scores between the target feature representation and the plurality of standard feature representations; and determining a standard text matching the target text in the plurality of standard texts based on at least the plurality of representation similarity scores.
According to some alternative embodiments, the device may implement various embodiments of the method of the second aspect.
In a fifth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, implement various embodiments of the method according to the first aspect or the method according to the second aspect.
In a sixth aspect of the disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements various embodiments of the method according to the first aspect or the method according to the second aspect.
According to various embodiments of the present disclosure, by using a hybrid training method, a model is enabled to better learn a feature representation of a text, thereby improving the accuracy of the text normalization process.
Drawings
The above and other objects, structures and features of the present disclosure will become more apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 illustrates a block diagram of an environment in which implementations of the present disclosure can be implemented;
FIG. 2 illustrates a schematic diagram of a model training system for a feature representation generative model, according to some embodiments of the present disclosure;
FIG. 3 illustrates a block diagram of an example structure of a feature representation generative model, according to some embodiments of the present disclosure;
FIG. 4 illustrates a flow diagram of a text processing process for training a feature representation generative model, according to some embodiments of the present disclosure;
FIG. 5 illustrates a block diagram of training of a feature representation generative model, according to some embodiments of the present disclosure;
FIG. 6 illustrates a flow diagram of a process of training a feature representation generative model, according to some embodiments of the present disclosure;
FIG. 7 shows a schematic diagram of a process of determining reconstructed text in accordance with some embodiments of the present disclosure;
FIG. 8 illustrates a block diagram of the determination of standard text in accordance with some embodiments of the present disclosure;
FIG. 9 illustrates a flow diagram of a process of determining standard text in accordance with some embodiments of the present disclosure; and
FIG. 10 illustrates a block diagram of a computing device in which one or more embodiments of the disclosure may be implemented.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
In the description of the embodiments of the present disclosure, the words "comprise" and variations such as "comprises" and "comprising" should be understood to be open-ended, i.e., "including but not limited to. The expression "based on" should be understood as "based at least in part on". The expression "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The expressions "first", "second" etc. may refer to different or the same objects. Other explicit and implicit definitions are also possible below.
As used herein, the expression "model" may learn from training data the associations between respective inputs and outputs, such that after training is complete, for a given input, a corresponding output may be generated. The generation of the model may be based on machine learning techniques. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs using multiple layers of processing units. Neural network models are one example of deep learning based models. A "model" may also be referred to herein as a "machine learning model," "machine learning network," or "learning network," which expressions are used interchangeably herein.
A "neural network" is a machine learning network based on deep learning. Neural networks are capable of processing inputs and providing corresponding outputs, which typically include an input layer and an output layer and one or more hidden layers between the input and output layers. Neural networks used in deep learning applications typically include many hidden layers, thereby increasing the depth of the network. The layers of the neural network are connected in sequence such that the output of a previous layer is provided as the input of a subsequent layer, wherein the input layer receives the input of the neural network and the output of the output layer is the final output of the neural network. Each layer of the neural network includes one or more nodes (also referred to as processing nodes or neurons), each node processing an input from a previous layer.
In general, machine learning can roughly include three phases, namely a training phase, a testing phase, and a use phase (also referred to as an inference phase). In the training phase, a given model may be trained using a large amount of training data, with parameter values being updated iteratively until the model is able to derive consistent inferences from the training data that meet the desired objectives. By training, the model may be considered to be able to learn from the training data the association between inputs to outputs (also referred to as input to output mapping). Parameter values of the trained model are determined. In the testing phase, test inputs are applied to the trained model to test whether the model can provide the correct outputs, thereby determining the performance of the model. In the use phase, the model may be used to process the actual input and determine the corresponding output based on the trained parameter values.
As briefly described above, it is desirable to improve the accuracy of the normalization process of text, particularly in areas where there is a high demand for the normalization and normalization of text. For example, in the field of clinical research and clinical information management, the text in a diagnostic report written by a physician may differ somewhat from standard text in a standard term base. Such differences may be due to typographical errors, shorthand, abbreviations, or a reverse order. It is desirable to establish a mapping system between the input text and the corresponding standard text.
Some research directions are devoted to assessing text similarity by measuring text string-based distances or vector-based distances. According to the Minimum Edit Distance (MED) method, similarity between two texts is measured by calculating the steps required to convert one text into another. This text string-based approach is strict on the character similarity between texts, but satisfactory accuracy cannot be obtained for the problems of shorthand and order reversal (e.g., the input text "lung cancer" and the corresponding standard text "lung malignancy"). For the vector-based method, although the semantics and semantic information of the text can be captured by a huge and complicated calculation, because the method relies on the corpus used in the training phase, satisfactory accuracy cannot be obtained by using the method alone. Furthermore, in terms of standardization of medical terms, the existing deep learning models of attention-based automatic coding methods can solve the problem of typographical errors, but cannot achieve satisfactory accuracy for a large amount of non-standardized data written by doctors. Therefore, a more efficient way to improve accuracy is needed.
Example Environment
FIG. 1 illustrates a block diagram of an environment 100 in which implementations of the present disclosure can be implemented. In the environment 100 of FIG. 1, it is desirable to train out a machine learning model that enables the model to generate a feature representation of the input text. The feature representation can better represent the semantics of the text, thereby being beneficial to measuring the similarity between the texts in the text standardization task and further realizing the mapping from the target text to the standard text.
As shown in FIG. 1, environment 100 includes a model training system 106 and a model application system 112. The model training system 106 may be configured to train the feature representation generative model 108. The training data for the feature representation generative model 108 may include the standard text corpus 102 and the documents 104. The standard text library 102 may include a plurality of standard texts used in the knowledge domain. The document 104 may include one or more non-normalized texts. Through the training process, the feature representation generation model 108 may have a trained set of parameters. The feature representations generated by the trained feature representation generation model 108 may be universally used for normalization processing of various target texts.
In FIG. 1, model application system 112 may be configured to determine standard text 118 corresponding to target text 110 using feature representation generation model 108 and text similarity determination model 116. In particular, the model application system 112 may be configured to utilize the feature representation generation model 108 to determine a target feature representation corresponding to the target text 110 and a plurality of standard feature representations corresponding to a plurality of standard texts in the standard text repository 114. The model application system 112 may be further configured to determine a standard text 118 corresponding to the target text 110 using the text similarity determination model 116 based on the plurality of standard feature representations and the target feature representation corresponding to the target text 110. The plurality of standard feature representations may be stored in a memory of the model application system 112 or other computing device and reused by the model application system 112.
Model training system 106 and model application system 112 may be implemented on a single computing device or multiple computing devices. Model training system 106 can be implemented on a different device than the device implementing model application system 112. Of course, in some cases, model training and model use may also be implemented on the same device or set of devices. Depending on the needs of the actual computing resource deployment.
Standard text 118 corresponding to target text 110 may be included in standard text repository 114. The standard text repository 114 may include a plurality of standard texts used in a particular knowledge domain. The standard text included in the standard text repository 114 may be identical, partially identical, or completely different from the standard text in the standard text repository 102. By way of example, in the medical field, the standard text corpus 102 and the standard text corpus 114 may be international disease classification codes ICD-10, and the documents 104 and target text 110 may be derived from electronic medical records written by a physician, and/or may include textual descriptions such as various medical diagnoses and treatment methods, and the like.
In this context, a "document" refers to an object that partially or wholly renders text in natural language. Some documents may include images in which text may be recognized. The document in image format may be, for example, a handwritten, printed or scanned version of the document, or a digitally captured image. Other examples of documents may include digitally generated documents, such as text files, PDF files, extended markup language (XML) files, or other structured or semi-structured documents, as well as other documents from which text strings can be extracted.
The processing of the document 104 and the target text 110 may be implemented on a text unit basis. Herein, "text unit" refers to a unit of text used in natural language processing. The granularity of the text units may vary and be set according to the particular application and/or language in which the text is deployed. For example, a text element may include a character, word, phrase, symbol, combination of the foregoing, or any other element that may appear in a natural language expression. For example, for Chinese, a unit of text may be a single word, or phrase, etc. For English, a text unit may include characters, words, phrases made up of multiple words, and the like. The division of the text units may be achieved by various word segmentation techniques. The number of characters and/or words in each text unit may depend on the granularity of the participles.
Text standardization can be applied in various fields for implementing automation terms standardization, data management, data analysis, and the like. By way of example, in the medical field, it is desirable to automatically convert text in electronic medical records written by physicians into standard text in a standard term library (e.g., ICD-10) via text standardization techniques for archiving to a Hospital Information System (HIS) and for clinical data research. Text normalization may also be applied in various other fields. In the following, some embodiments of the present disclosure may be described with reference to the medical field. However, it should be understood that the text normalization methods presented in this disclosure may also be applied to other knowledge fields, such as education, finance, industrial manufacturing, and so forth. The division of the "knowledge domain" here can be of various granularities.
The feature representation generative model 108 may be configured to support the entry of text of various lengths. The feature representation of the text generated by the feature representation generation model 108 may generally consist of a certain dimension of numerical values. The dimensions of the feature representations of different texts may be the same, but contain different values.
The feature representation of text is intended to be able to distinguish as much as possible the different semantics of different texts. The accuracy of the feature representation depends mainly on the training of the model. Many training methods are currently proposed to train models for generating feature representations. However, many current models are still insufficient in accuracy, and for some texts with more similar components but different semantics, or for some texts with fewer similar components but the same semantics, the determined feature representation cannot accurately represent the semantic difference. As an example, the standard texts "multiple puncture wound" and "multiple cutting wound" look very similar but have different semantics; the text "lung cancer" and the standard text "lung malignancy" have fewer similar components but are semantically identical. Existing models may generate the feature representations of "multiple puncture wounds" and "multiple cuts" very similar because there are more common components of the two texts (including "multiple" and "wounds"), and the texts are structurally similar; but the feature representations of "lung cancer" and "lung malignancy" may be generated very differently because the common components of the two texts (including "lung") are rare and the text structure is very different.
Embodiments of the present disclosure propose a scheme for training a feature representation generative model. According to this scheme, text from a document is utilized as first training text, and standard text that matches the first training text in a standard text library is labeled and used as the first standard text. The standard text from the standard text corpus is modified to serve as the second training text, and the corresponding standard text from the standard text corpus is marked as matching the second training text and is used as the second standard text. The feature representation generation model is supervised trained using the first training text and the first standard text, and unsupervised or self-supervised training is performed using the second training text and the second standard text. The training objectives of the feature representation generation model may be determined at least to enable a first feature representation generated by the model for a first training text to be reconstructed as a first standard text and a second feature representation generated by the model for a second training text to be reconstructed as a second standard text.
In a model training process according to an embodiment of the present disclosure, a first training text has a pre-labeled first standard text, constituting a supervised training sample pair. Second training texts are generated by automatically modifying the second standard texts, thereby forming additional training sample pairs. The training sample pair so constructed is referred to as an "unsupervised" training sample pair because the second training sample is not pre-labeled as matching the second standard text. The model obtained by the mixed training scheme can better determine the feature representation of the input text to be closer to the feature representation of the text with the same semantic meaning, and to be more different from the feature representation of the text with different semantic meanings, so that the feature representation of the text is more accurate in semantic distinction. Due to advantages in semantic differentiation, the generated feature representations may facilitate better performance of subsequent text processing tasks.
Example implementation of model training
Some example embodiments of the disclosure will now be described with continued reference to the accompanying drawings.
FIG. 2 illustrates a schematic diagram of the model training system 106 for characterizing the generative model 108, according to some embodiments of the present disclosure. For ease of discussion, training of the model is discussed with reference to FIG. 1, and thus it is shown in FIG. 2 that the model training system 106 is configured to train the feature representation generative model 108. Model training system 106 may include a pre-processing module 202, a text modification module 204, a feature representation generation model 108, and a parameter update module 206.
The pre-processing module 202 may be configured to perform one or more pre-processing operations on text input to the model training system 106 to format the pre-processed text to conform to the same writing specification. The preprocessing operations may include, but are not limited to: punctuation removal, case-to-case conversion, full-angle half-angle conversion, number normalization (e.g., uniformly replacing the number "5" with "five"), shorthand replacement, removal of duplicate identical text, and the like. The text input to the model training system 106 may include standard text from the standard text library 102 and non-normalized text from the documents 104. The preprocessing operations on the standard text and the non-standardized text may be performed together or separately.
After the preprocessing operation, the non-normalized text from the document 104 may be used as the first training text of the feature representation generative model 108. The standard text included in the standard text library 102 that matches the first training text is labeled and used as the first standard text of the feature representation generation model 108 after the preprocessing operation. For example, the first training text may be "gallstone" and the first standard text appended in the standard text corpus 102 may be "gallstone". In some embodiments, the standard text matching the first training text may be identified by a user manual annotation. In addition or as an alternative, the standard text matching the first training text can also be labeled by means of other tools. Similarly, other standard text in the standard text library 102 may be preprocessed.
The text modification module 204 may be configured to modify the standard text from the standard text repository 102 to serve as the second training text for the feature representation generation model 108. The corresponding standard text from the standard text library 102 is marked as matching the second training text and is used as the second standard text for the feature representation generative model 108.
In one embodiment, the text modification module 204 may generate the second training text by modifying the second standard text with a probability in at least one of: deleting at least one character, word or phrase in the second standard text; replacing at least one character in the second standard text with a character having the same or similar pronunciation; replacing words in the second standard text with words having the same root word; and changing the order of characters, words or phrases in the second standard text. As an example, the text modification module 204 may modify "acute hemorrhagic necrotic enteritis" to "acute hemorrhagic necrotic enteritis" or "ulcerative enterocolitis" to "ulcerative colitis" to simulate a situation in which a physician misses typing, seldom types, or takes shorthand or abbreviation; the 'influenza pharyngitis' can be modified into 'influenza type pharyngitis' to simulate the situation that a doctor generates incorrect input of homophones when adopting a pinyin input method; or "early gastric cancer" may be modified to "early gastric cancer" to simulate the writing habits of different physicians.
Text modification module 204 may apply different text modification rules to the text. It should be understood that in addition to the homonym substitution that may occur in the text, the text may be modified according to the characteristics of different languages to simulate various possible variations of different people to the same text due to different language characteristics, language habits, or input methods. As an example, for english, it is also possible to modify an english word into a corresponding same root word, for example, "short of break" into "short of break" or "short of break".
In one embodiment, the second standard text may not be modified, but is used directly as the second training text for the feature representation generative model 108. In other words, the standard text in the standard text library may be used as the second training text of the feature representation generation model 108 with or without modification at random (e.g., with a certain probability). The number and ratio of the standard text modified to be used as the training text to the standard text directly used as the training text may be preset and may be changed.
The feature representation generation model 108 may be configured to utilize at least the first training text, the second training text, the first standard text, and the second standard text, and to train in accordance with a training goal. The training targets may be determined at least to enable a first feature representation generated by the feature representation generation model 108 for a first training text to be reconstructed as a first standard text and a second feature representation generated by the feature representation generation model 108 for a second training text to be reconstructed as a second standard text.
In some embodiments, the text modification module 204 may be configured to modify the first training text to be used as a third training text for the feature representation generation model 108. The first standard text is marked as matching the third training text. The feature representation generation model 108 may also be configured to utilize the third training text and the first standard text, and to train according to a training goal. The training targets may also be determined to enable a third feature representation generated by the feature representation generation model 108 for a third training text to be reconstructed into the first standard text.
The above describes (first training text, first standard text), (second training text, second standard text) and (third training text, third standard text) these different training sample pairs, where the standard text is considered to be the text in the standard text library that matches the training text. It should be appreciated that to achieve the training goal, a greater number of similar training sample pairs may be generated to train the feature representation generative model 108.
The feature representation generative model 108 is described in detail below in conjunction with FIG. 3.
FIG. 3 illustrates a block diagram of an example structure of a feature representation generative model 108, according to some embodiments of the present disclosure. The feature representation generation model 108 may include a vectorized representation extraction module 304, a feature representation generation module 312, and a reconstruction module 316.
The vectorized representation extraction module 304 may be configured to receive the input text 302 and determine a vectorized representation 310 to which the input text 302 corresponds. The input text 302 may be training text provided to the feature representation generation model 108 during a model training phase, such as the first training text, the second training text, or the third training text described with respect to fig. 2. The input text 302 may also be standard text from a standard text library 114 or text to be standardized (otherwise known as "target text") provided to the feature representation generation model 108 during the model application phase. The vectorized representation extraction module 304 may include a single-dimensional vectorized representation extraction module 306 and a vectorized representation merge module 308.
The single-dimensional vectorized representation extraction module 306 may be configured to receive the input text 302 and extract a plurality of single-dimensional vectorized representations q of the input text 302 in multiple dimensions 1 、q 2 、…、q K Wherein K is a positive integer greater than or equal to 1.
In one embodiment, the plurality of dimensions may include one or more of a semantic dimension, a text dimension, and a pronunciation dimension. In one embodiment, the single-dimension vectorized representation extraction module 306 may extract a semantic vectorized representation corresponding to the input text 302 in a semantic dimension. For example, the single-dimensional vectorized representation extraction module 306 may use a pre-trained language model, such as a pre-trained language model (BERT) from a converter-based bi-directional encoder, to extract a corresponding semantic vectorized representation of the input text 302. The pre-trained language model can be pre-trained through a large number of full-length texts (such as Wikipedia and the like), so that semantic information corresponding to the texts is obtained. In one embodiment, the single-dimension vectorized representation extraction module 306 may directly employ parameters obtained by pre-training a pre-trained language model. In another embodiment, the parameters of the pre-trained language model may also be fine-tuned by utilizing a knowledge domain database. For example, in the medical field, the parameters of the pre-trained language model may be fine-tuned using a medical-field database (e.g., standards for diagnosis and treatment of various diseases, professional dictionaries, textbooks, etc.).
Alternatively or additionally, the single-dimensional vectorized representation extraction module 306 may extract a plurality of cell vectorized representations corresponding to a plurality of text cells included in the input text 302 in a text dimension, the plurality of text cells including at least one of a character, a word, and a phrase. The division of the text units may be achieved by various word segmentation techniques. Alternatively or additionally, the single-dimensional vectorized representation extraction module 306 may extract a pronunciation-vectorized representation corresponding to all or part of the pronunciation of the input text 302 in the pronunciation dimension.
The vectorized representation merging module 308 may represent the q vectorized representation in multiple dimensions by merging multiple single-dimensional representations of the input text 302 in multiple dimensions 1 、q 2 、…、q K To determine the corresponding vectorized representation 310 of the input text 302. By merging multiple single-dimensional vectorized representations in multiple dimensions, the accuracy of the normalization process of the text can be significantly improved.
The feature representation generation module 312 may be configured to generate a feature representation 314 corresponding to the vectorized representation 310. The reconstruction module 316 may be configured to generate reconstructed text 318 corresponding to the input text 302 from the feature representation 314. In some implementations, the feature representation generation module 312 and the reconstruction module 316 may be implemented by an AutoEncoder model, where the AutoEncoder model includes an encoder and a decoder. The feature representation generation module 312 may be implemented as an encoder in an AutoEncoder model for encoding the input text into a feature representation, and the reconstruction module 316 may be implemented as a decoder in the AutoEncoder model for decoding the corresponding text from the feature representation output by the encoder.
Returning to fig. 2, the feature representation generation model 108 may be configured to generate training feature representations corresponding to training texts (e.g., a first training text, a second training text, or a third training text), and generate reconstructed texts corresponding to the training texts from the training feature representations. The parameter update module 206 may be configured to generate an updated set of parameters 208 for the feature representation generation model 108 by reducing differences between reconstructed text generated by the feature representation generation model 108 and standard text that the training text matches (i.e., first, second, or first standard text that the first, second, or third training text matches, respectively). Through the training process, the set of parameters of the feature representation generative model 108 is further updated and fine-tuned. Such parameter set updates may be performed iteratively until a training goal is met. After training is complete, the feature representation generative model 108 has trained parameter values. Based on such parameter values, the feature representation generation model 108 can be used to implement a normalization process of the target text.
The process of training the feature representation generative model is described in detail below in conjunction with FIGS. 4-7.
FIG. 4 illustrates a flow diagram of a text processing procedure 400 for training the feature representation generative model 108, according to some embodiments of the present disclosure. Text process 400 may be implemented by model training system 106.
At block 410, model training system 106 obtains first training text from document 104 and first standard text from standard text library 102 that is marked as matching the first training text.
At block 420, the model training system 106 generates a second training text by modifying a second standard text in the standard text library 102, the second training text being marked as matching the second standard text. Alternatively or additionally, the model training system 106 may also generate a third training text by modifying the first training text, the third training text being marked as matching the first standard text.
At block 430, the model training system 106 utilizes at least the first training text, the second training text, the first standard text, and the second standard text, and trains the feature representation generating model 108 according to training objectives determined at least to enable a first feature representation generated by the feature representation generating model 108 for the first training text to be reconstructed as the first standard text and a second feature representation generated by the feature representation generating model 108 for the second training text to be reconstructed as the second standard text. Alternatively or additionally, the training target may also be determined to enable a third feature representation generated by the feature representation generation model 108 for a third training text to be reconstructed into the first standard text.
FIG. 5 illustrates a block diagram of the training of the feature representation generative model 108, according to some embodiments of the present disclosure. FIG. 6 illustrates a flow diagram of a process 600 of training the feature representation generative model 108, according to some embodiments of the disclosure. Process 600 will be described below in conjunction with fig. 5.
At block 610, the vectorized representation extraction module 304 receives the training text 502 and determines the vectorized representation 504 to which the training text 502 corresponds. The training text 502 may include a first training text, a second training text, or optionally a third training text.
At block 620, the feature representation generation module 312 takes the vectorized representation 504 as input to generate a training feature representation 506 corresponding to the vectorized representation 504.
At block 630, reconstruction module 316 generates reconstructed text 508 corresponding to training text 502 from training feature representations 506.
At block 640, the parameter update module 206 generates the updated set of parameters 208 that characterize the generative model 108 by reducing the difference between the reconstructed text 508 and the standard text 510 that matches the training text 502. The standard text 510 may include a first standard text that matches the first training text and an optional third training text and a second standard text that matches the second training text.
In one embodiment, the number and proportion of the first training texts, the second training texts, and the optional third training texts included in the training texts 502 may be preset and may be changed.
Fig. 7 shows a schematic diagram of a process 700 of determining reconstructed text according to some embodiments of the disclosure. Fig. 7 shows three exemplary text processes 702, 704, and 706.
In the text processing 702, the text "acute necrotizing pancreatitis, severe" from the standard text corpus 102 may first be pre-processed through a pre-processing stage 710 to "acute necrotizing pancreatitis severe" for use as standard text; then through a text modification stage 720 to be modified to "acute necrotic pancreatitis heavy" to be used as training text; generating a training feature representation corresponding to a training text 'acute necrotic pancreatitis weight' in a feature representation generation stage 730; finally, a reconstructed text corresponding to the training text "acute necrotic pancreatitis heavy" is generated through the reconstruction stage 740. The training goal of the feature expression generative model 108 is to make the reconstructed text finally obtained by the text processing 702 the standard text "acute necrotizing pancreatitis severe".
Alternatively or additionally, in the text processing process 702, the preprocessed text "acute necrotizing pancreatitis severe" may not be modified in the text modification stage 720, so that the text "acute necrotizing pancreatitis severe" is characterized as a training text for the generative model 108. The training goal of the feature representation generative model 108 is to enable the feature representation generated by the feature representation generative model 108 for the training text "acute necrotizing pancreatitis severe" to be reconstructed to the standard text "acute necrotizing pancreatitis severe". In text processing 702, the standard text may be modified randomly (e.g., with a certain probability) or unmodified to generate training text. The number and proportion of the standard texts modified to be used as the training texts and the number and proportion of the standard texts directly used as the training texts may be preset and may be changed.
In the text processing process 704, the text "lung cancer (malignant)" from the document 104 may first be pre-processed to "lung cancer malignant" through a pre-processing stage 710; then goes through a text modification stage 720 to be modified as "lung cancer bakes" to be used as a training text; generating a training feature representation corresponding to a training text 'lung cancer malignancy' in a feature representation generation stage 730; finally, a reconstructed text corresponding to the training text "lung cancer bad" is generated through a reconstruction stage 740. The training goal of the feature representation generation model 108 is to make the reconstructed text finally obtained by the text processing 702 the standard text "lung malignancy" labeled as matching the training text "lung cancer malignancy" in the standard text library 102.
In the text processing process 704, the text "lung cancer (malignant)" from the document 104 may first go through a pre-processing stage 710 to be pre-processed as "lung cancer malignant" for use as training text; and not modified in the text modification stage 720; then, a training feature representation corresponding to a training text 'lung cancer malignancy' is generated through a feature representation generation stage 730; finally, a reconstructed text corresponding to the training text "lung cancer malignant" is generated through a reconstruction stage 740. The training goal of the feature representation generation model 108 is to make the reconstructed text ultimately obtained by the text processing 702 the standard text "lung malignancy" in the standard text library 102 labeled as matching the training text "lung cancer malignancy".
In one embodiment, the number and proportion of texts undergoing the text modification process during the text modification stage 720 may be preset and may be changed. Through the text modification processing, the number of training samples for training the feature representation generation model 108 can be increased, the efficiency of acquiring the training samples can be improved, and meanwhile, the diversity of the training samples can be enriched, so that the capability of reconstructing different input texts by the feature representation generation model 108 can be improved.
It is to be appreciated that in the text processing 702, the feature representation generative model 108 is trained unsupervised or self-supervised; and the feature representation generative model 108 is supervised trained in text processes 704 and 706. In one embodiment, the number and proportion of text processes 702, 704, and 706 may be preset and may be changed. The model obtained by the mixed training scheme can better determine the characteristic representation of the input text to be closer to the characteristic representation of the text with the same semantic meaning, and the characteristic representation of the text with different semantic meanings is more different, so that the characteristic representation of the text is more accurate in semantic distinction.
The trained feature representation generation model 108 may be used to implement a normalization process or the like for the target text through the training process discussed with reference to fig. 2-7. Since the feature representation generation model 108 has learned how well to extract feature representations during the training phase, it can exhibit very good performance in the normalization process for a specific target text.
Example implementation of model application
Fig. 8 illustrates a block diagram of determination of standard text in accordance with some embodiments of the present disclosure. Fig. 9 illustrates a flow diagram of a process 900 of determining standard text according to some embodiments of the present disclosure. Process 900 may be implemented by model application system 112. The model application system 112 may include a trained feature representation generation model 108 and a text similarity determination model 116. The feature representation generative model 108 used in the model application system 112 may be implemented as the feature representation generative model 108 described with reference to FIG. 3. It should be appreciated that in the feature representation generation model 108 used in the model application system 112, the reconstruction module 316 may be omitted and the feature representation generation model 108 may be used to generate a corresponding feature representation of the input text (e.g., the target text 110 or standard text from the standard text repository 114). The text similarity determination model 116 may include a representation similarity score calculation module 804 and a text similarity score calculation module 812. In one embodiment, the text similarity determination model 116 may also include a confidence model 816.
At block 910, model application system 112 generates model 108 using the trained feature representations and determines a target feature representation 802 corresponding to target text 110.
At block 920, model application system 112 obtains a plurality of standard feature representations 806 corresponding to the plurality of standard texts in standard text repository 114. The plurality of standard feature representations 806 may be configured to be determined by the trained feature representation generation model 108. In one embodiment, after the feature representation generation model 108 is trained, a plurality of standard texts in the standard text library 114 may be input as input texts into the feature representation generation model 108 to obtain a plurality of standard feature representations 806 corresponding to the plurality of standard texts. The plurality of standard feature representations 806 may be stored in a memory of the model application system 112 or other computing device and reused during the process 900, thereby increasing the computational efficiency of the model application process.
At block 930, the model application system 112 determines a plurality of representation similarity scores 808 between the target feature representation 802 and the plurality of standard feature representations 806 using the representation similarity score calculation module 804. The representation similarity score may be used to measure the similarity or correlation between the target feature representation 802 and one of the standard feature representations 806. For example, indicating a higher similarity score means a higher similarity or correlation between the target feature representation 802 and the standard feature representation 806.
In one embodiment, since the feature representations may be considered multi-dimensional vectors, a representation similarity score between the feature representations may be determined by calculating cosine similarities between the target feature representation 802 and the plurality of standard feature representations 806. Of course, other metrics that characterize the distance or difference between vectors may be used to calculate the representation similarity score between the feature representations. Because the feature representation generation model 108 is trained such that the generated feature representations are able to better characterize the semantics of the input text, the representation similarity score 808 may more accurately measure the similarity between the target feature representation and the corresponding standard feature representation.
At block 940, the model application system 112 determines the standard text 118 of the plurality of standard texts that matches the target text based at least on the plurality of representation similarity scores 808.
In one embodiment, the model application system 112 may select a list of candidate standard texts 810 for the target text 110 from the plurality of standard texts based on the plurality of representation similarity scores 808. Candidate standard text list 810 may include a plurality of candidate standard texts. For example, for the target text "necrotizing enterocolitis," model application system 112 may determine a list of candidate standard texts with higher representation similarity scores based on a plurality of representation similarity scores between the feature representation of the target text "necrotizing enterocolitis" and a plurality of standard feature representations, for example: "acute necrotizing enterocolitis", "neonatal necrotizing enterocolitis", "acute hemorrhagic necrotizing enterocolitis", "acute enterocolitis" and "ulcerative enterocolitis". It will be appreciated that the number of candidate standard texts in the candidate standard text list may be preset or adjusted according to the requirements of a specific application, and may also be filtered according to whether the similarity score reaches a threshold value.
The text similarity score calculation module 812 may determine a plurality of text similarity scores 814 between the target text 110 and the plurality of candidate standard texts. The model application system 112 may select the standard text 118 matching the target text 110 from the plurality of candidate standard texts based on the plurality of text similarity scores 814. The standard text 118 may include one or more standard texts. The number of standard texts included in the standard text 118 may be preset or adjusted according to the requirements of a specific application, and may also be filtered according to whether the text similarity score reaches a threshold value.
In one embodiment, the text similarity determination model 116 may also include a confidence model 816. The confidence model 816 may determine a plurality of confidence scores 818 between the target text 110 and the plurality of candidate standard texts based on the plurality of representation similarity scores 808 and the plurality of text similarity scores 814. In one embodiment, the confidence model 816 may perform a weighted summation of the plurality of representation similarity scores 808 and the plurality of text similarity scores 814 according to pre-trained parameters to determine a plurality of confidence scores 818. The model application system 112 may select the standard text 118 that matches the target text 110 from the plurality of candidate standard texts based on the plurality of confidence scores 818.
The model application system 112 can normalize the text in a pre-structured manner or a post-structured manner. For example, in the medical field, when in a pre-structured manner, the physician enters target text into the model application system 112 after making a diagnosis, the model application system 112 may provide the physician with a list of best matching standard text in real-time for the physician to select based on the confidence scores. The standard text in the standard text list may be ordered according to the confidence score. Alternatively or additionally, the model application system 112 may also present the standard text list to the physician along with the confidence scores corresponding to the standard text list. When the correct standard text is not included in the list of standard texts, the physician may provide feedback to the model application system 112 to instruct the feature representation generation model 108 to learn the mapping between the target text and the correct standard text. When in a post-structured manner, the model application system 112 can automatically convert medical text in an electronic medical record (e.g., a diagnostic report) to standard text in a standard term library (e.g., the international disease classification code ICD-10) based on the confidence scores.
Example apparatus
FIG. 10 illustrates a block diagram of a computing device 1000 in which one or more embodiments of the disclosure may be implemented. All or a portion of the components of model training system 106 and model application system 112 of FIG. 1 may be implemented in device 1000.
As shown, device 1000 includes a processing unit 1002 that can perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 1004 or loaded from a storage unit 1016 into a Random Access Memory (RAM) 1006. In the RAM 1006, various programs and data required for the operation of the device 1000 may also be stored. The processing unit 1002, the ROM 1004, and the RAM 1006 are connected to each other by a bus 1008. An input/output (I/O) interface 1010 is also connected to bus 1008.
Various components in device 1000 are connected to I/O interface 1010, including: an input unit 1012 such as a keyboard, a mouse, and the like; an output unit 1014 such as various types of displays, speakers, and the like; a storage unit 1016 such as a magnetic disk, optical disk, or the like; and a communication unit 1018 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 1018 allows the device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The processing unit 1002 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processing unit 1002 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. Processing unit 1002 may perform the various methods and processes described above, such as process 400, process 600, and/or process 900. For example, in some embodiments, process 400, process 600, and/or process 900 may be implemented as a computer software program tangibly embodied on a computer-readable medium, such as storage unit 1016. In some embodiments, some or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1004 and/or communications unit 1018. When the computer programs are loaded into RAM 1006 and executed by processing unit 1002, one or more steps of process 400, process 600, and/or process 900 described above may be performed. Alternatively, in other embodiments, processing unit 1002 may be configured to perform process 400, process 600, and/or process 900 in any other suitable manner (e.g., by way of firmware).
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
According to an exemplary implementation of the present disclosure, a computer-readable storage medium is provided, on which computer-executable instructions or a program are stored, wherein the computer-executable instructions or the program are executed by a processor to implement the above-described method or function. The computer-readable storage medium may include a non-transitory computer-readable medium. According to an exemplary implementation of the present disclosure, there is also provided a computer program product comprising computer executable instructions or a program, which are executed by a processor to implement the above described method or function. The computer program product may be tangibly embodied on a non-transitory computer-readable medium.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, devices and computer program products implemented in accordance with the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-executable instructions or programs.
In the context of this disclosure, a computer-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable medium may be a machine readable signal medium or a machine readable storage medium. A computer readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (22)

1. A text processing method, comprising:
obtaining a first training text and a first standard text marked as matching with the first training text in a standard text base, wherein the standard text base comprises a plurality of standard texts used in the knowledge field;
generating a second training text by modifying a second standard text in the standard text library, the second training text being marked as matching the second standard text; and
training a model configured to generate feature representations of text with the first training text, the second training text, the first standard text, and the second standard text, and in accordance with training objectives determined at least to enable a first feature representation generated by the model for the first training text to be reconstructed as the first standard text and a second feature representation generated by the model for the second training text to be reconstructed as the second standard text.
2. The method of claim 1, wherein modifying the second standard text in the standard text corpus to generate second training text comprises modifying the second standard text by at least one of:
deleting at least one character, word or phrase in the second standard text;
replacing at least one character in the second standard text with a character having the same or similar pronunciation;
replacing words in the second standard text with words having the same root word; and
changing the order of characters, words or phrases in the second standard text.
3. The method of claim 1, further comprising:
generating a third training text by modifying the first training text, the third training text being marked as matching the first standard text; and
the model is also trained using the third training text and the first standard text, and in accordance with the training objectives, which are also determined to enable a third feature representation generated by the model for the third training text to be reconstructed as the first standard text.
4. The method of claim 1, further comprising:
performing pre-processing on the first training text, the first standard text, and the second standard text to format the first training text, the first standard text, and the second standard text, wherein the second training text is generated based on the pre-processed second standard text.
5. The method of claim 1, wherein training the model comprises: for each of the first training text and the second training text,
determining a vectorization representation corresponding to the training text;
generating training feature representations corresponding to the training texts by applying the vectorized representation to the model;
generating a reconstructed text corresponding to the training text from the training feature representation; and
updating a set of parameters of the model to meet the training goal by reducing a difference between the reconstructed text and standard text that the training text matches.
6. The method of claim 5, wherein determining the vectorized representation to which the training text corresponds comprises:
extracting a plurality of single-dimensional vectorized representations of the training texts on a plurality of dimensions; and
determining the vectorized representation by merging the plurality of single-dimensional vectorized representations.
7. The method of claim 6, wherein extracting the plurality of single-dimensional vectorized representations comprises extracting at least one of:
extracting semantic vectorization representation corresponding to the training text on semantic dimensions;
extracting a plurality of unit vectorization representations corresponding to a plurality of text units included in the training text on a text dimension, wherein the plurality of text units include at least one of characters, words and phrases; and
and extracting pronunciation vectorization representation corresponding to all or part of pronunciations of the training text in a pronunciation dimension.
8. A text processing method, comprising:
determining a target feature representation corresponding to a target text by using a model trained according to the method of any one of claims 1 to 7;
obtaining a plurality of standard feature representations corresponding to a plurality of standard texts in the standard text library;
determining a plurality of representation similarity scores between the target feature representation and the plurality of standard feature representations; and
determining a standard text of the plurality of standard texts matching the target text based on at least the plurality of representation similarity scores.
9. The method of claim 8, wherein determining standard text that matches the target text based at least on the plurality of representation similarity scores comprises:
selecting a plurality of candidate standard texts for the target text from the plurality of standard texts based on the plurality of representation similarity scores;
determining a plurality of text similarity scores between the target text and the plurality of candidate standard texts;
determining a plurality of confidence scores between the target text and the plurality of candidate standard texts based on the plurality of representation similarity scores and the plurality of text similarity scores; and
selecting the standard text matching the target text from the plurality of candidate standard texts based on the plurality of confidence scores.
10. The method of claim 8, wherein the plurality of standard feature representations are determined by the model.
11. An electronic device, comprising:
a processing unit; and
a memory coupled to the processing unit and containing instructions stored thereon that, when executed by the processing unit, cause the apparatus to:
obtaining a first training text and a first standard text marked as matching with the first training text in a standard text base, wherein the standard text base comprises a plurality of standard texts used in the knowledge field;
generating a second training text by modifying a second standard text in the standard text library, the second training text being marked as matching the second standard text; and
training a model configured to generate feature representations of text with the first training text, the second training text, the first standard text, and the second standard text, and in accordance with training objectives at least determined to enable a first feature representation generated by the model for the first training text to be reconstructed as the first standard text and a second feature representation generated by the model for the second training text to be reconstructed as the second standard text.
12. The device of claim 11, wherein modifying the second standard text in the standard text corpus to generate a second training text comprises modifying the second standard text by at least one of:
deleting at least one character, word or phrase in the second standard text;
replacing at least one character in the second standard text with a character having the same or similar pronunciation;
replacing words in the second standard text with words having the same root word; and
changing the order of characters, words or phrases in the second standard text.
13. The apparatus of claim 11, wherein the actions further comprise:
generating a third training text by modifying the first training text, the third training text being marked as matching the first standard text; and
the model is also trained using the third training text and the first standard text, and in accordance with the training objectives, which are also determined to enable a third feature representation generated by the model for the third training text to be reconstructed as the first standard text.
14. The apparatus of claim 11, wherein the actions further comprise:
performing pre-processing on the first training text, the first standard text, and the second standard text to format the first training text, the first standard text, and the second standard text, wherein the second training text is generated based on the pre-processed second standard text.
15. The apparatus of claim 11, wherein training the model comprises: for each of the first training text and the second training text,
determining a vectorization representation corresponding to the training text;
generating training feature representations corresponding to the training texts by applying the vectorized representation to the model;
generating a reconstructed text corresponding to the training text from the training feature representation; and
updating the set of parameters of the model to meet the training goal by reducing a difference between the reconstructed text and standard text that matches the training text.
16. The apparatus of claim 15, wherein determining the vectorized representation to which the training text corresponds comprises:
extracting a plurality of single-dimensional vectorized representations of the training text in a plurality of dimensions; and
determining the vectorized representation by merging the plurality of single-dimensional vectorized representations.
17. The apparatus of claim 16, wherein extracting the plurality of single-dimensional vectorized representations comprises extracting at least one of:
extracting semantic vectorization representation corresponding to the training text on semantic dimensions;
extracting a plurality of unit vectorization representations corresponding to a plurality of text units included in the training text on a text dimension, wherein the plurality of text units include at least one of characters, words and phrases; and
and extracting pronunciation vectorization representation corresponding to all or part of pronunciations of the training text in a pronunciation dimension.
18. An electronic device, comprising:
a processing unit; and
a memory coupled to the processing unit and containing instructions stored thereon that, when executed by the processing unit, cause the apparatus to perform the actions of:
determining a target feature representation corresponding to a target text by using a model trained according to the method of any one of claims 1 to 7;
obtaining a plurality of standard feature representations corresponding to a plurality of standard texts in the standard text library;
determining a plurality of representation similarity scores between the target feature representation and the plurality of standard feature representations; and
determining a standard text of the plurality of standard texts matching the target text based on at least the plurality of representation similarity scores.
19. The apparatus as recited in claim 18, wherein determining standard text matching said target text based at least on said plurality of representation similarity scores comprises:
selecting a plurality of candidate standard texts for the target text from the plurality of standard texts based on the plurality of representation similarity scores;
determining a plurality of text similarity scores between the target text and the plurality of candidate standard texts;
determining a plurality of confidence scores between the target text and the plurality of candidate standard texts based on the plurality of representation similarity scores and the plurality of text similarity scores; and
selecting the standard text matching the target text from the plurality of candidate standard texts based on the plurality of confidence scores.
20. The apparatus as recited in claim 18, wherein said plurality of standard feature representations are determined by said model.
21. A computer-readable storage medium having computer-executable instructions stored thereon, wherein the computer-executable instructions are executed by a processor to implement the method of any one of claims 1 to 7 or the method of claim 8 or 10.
22. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 7 or the method of claim 8 or 10.
CN202110872803.6A 2021-07-30 2021-07-30 Text processing method, apparatus, medium, and program product Pending CN115688735A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110872803.6A CN115688735A (en) 2021-07-30 2021-07-30 Text processing method, apparatus, medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110872803.6A CN115688735A (en) 2021-07-30 2021-07-30 Text processing method, apparatus, medium, and program product

Publications (1)

Publication Number Publication Date
CN115688735A true CN115688735A (en) 2023-02-03

Family

ID=85057927

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110872803.6A Pending CN115688735A (en) 2021-07-30 2021-07-30 Text processing method, apparatus, medium, and program product

Country Status (1)

Country Link
CN (1) CN115688735A (en)

Similar Documents

Publication Publication Date Title
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
Banerjee et al. Radiology report annotation using intelligent word embeddings: Applied to multi-institutional chest CT cohort
CN109522552B (en) Normalization method and device of medical information, medium and electronic equipment
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
US11468989B2 (en) Machine-aided dialog system and medical condition inquiry apparatus and method
CN110287337A (en) The system and method for medicine synonym is obtained based on deep learning and knowledge mapping
CN110517767B (en) Auxiliary diagnosis method, auxiliary diagnosis device, electronic equipment and storage medium
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN111401084A (en) Method and device for machine translation and computer readable storage medium
CN112151183A (en) Entity identification method of Chinese electronic medical record based on Lattice LSTM model
CN113204611A (en) Method for establishing reading understanding model, reading understanding method and corresponding device
CN113657105A (en) Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN112071429A (en) Medical automatic question-answering system construction method based on knowledge graph
CN114881006A (en) Medical text error correction method and device, storage medium and electronic equipment
CN115205880A (en) Medical image report generation method and device
CN114218940B (en) Text information processing and model training method, device, equipment and storage medium
CN116881470A (en) Method and device for generating question-answer pairs
Spinks et al. Justifying diagnosis decisions by deep neural networks
CN112749277B (en) Medical data processing method, device and storage medium
CN114220505A (en) Information extraction method of medical record data, terminal equipment and readable storage medium
CN117422074A (en) Method, device, equipment and medium for standardizing clinical information text
CN113705207A (en) Grammar error recognition method and device
CN116595189A (en) Zero sample relation triplet extraction method and system based on two stages
WO2023088278A1 (en) Method and apparatus for verifying authenticity of expression, and device and medium
US20220375576A1 (en) Apparatus and method for diagnosing a medical condition from a medical image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication