CN117648923B - Chinese spelling error correction method suitable for medical context - Google Patents
Chinese spelling error correction method suitable for medical context Download PDFInfo
- Publication number
- CN117648923B CN117648923B CN202410120343.5A CN202410120343A CN117648923B CN 117648923 B CN117648923 B CN 117648923B CN 202410120343 A CN202410120343 A CN 202410120343A CN 117648923 B CN117648923 B CN 117648923B
- Authority
- CN
- China
- Prior art keywords
- sentence
- corrected
- chinese
- chinese characters
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000012937 correction Methods 0.000 title claims abstract description 16
- 238000012549 training Methods 0.000 claims abstract description 14
- 230000000007 visual effect Effects 0.000 claims abstract description 9
- 230000009466 transformation Effects 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 23
- 239000013598 vector Substances 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 13
- 238000003062 neural network model Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000033764 rhythmic process Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 206010020850 Hyperthyroidism Diseases 0.000 description 1
- 230000032683 aging Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 208000003532 hypothyroidism Diseases 0.000 description 1
- 230000002989 hypothyroidism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 239000002689 soil Substances 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention relates to the field of artificial intelligence, in particular to a Chinese spelling error correction method suitable for medical contexts, which comprises the steps of converting sentences into Chinese character label sequences, inputting the Chinese character label sequences into a BERT pre-training Chinese language model to obtain context information characteristics, and carrying out linear transformation on the context information characteristics to enable the dimensions of the context information characteristics to be aligned with a word list; before calculating each positionNormalized confidence of each candidate item, obtain the front of each positionConfidence of individual candidates; before calculating each positionThe candidate items correspond to the visual similarity and the voice similarity of the Chinese characters and the input Chinese characters, and the two are weighted to obtain the similarity; front of each position is calculated by fusing similarity and confidenceComprehensive weights of the individual candidates; and taking the Chinese character with the highest comprehensive weight at each position as the Chinese character after error correction. The invention solves the problem of similar character errors by modeling the visual similarity and the voice similarity of the Chinese characters.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to a Chinese spelling error correction method suitable for medical contexts.
Background
With the population growth and aging of China, the number of medical staff is greatly increased, so that doctors can take more time for taking a doctor, and the doctor cannot concentrate on other works, such as medical record writing, prescription making and the like. The heavy working pressure greatly improves the probability of misspelling of doctors in work, thereby causing information transmission deviation and even accidents. For example, misspelling of the drug name may result in the patient obtaining the wrong drug; careless mistakes of disease names may cause misdiagnosis; errors in the registration of surgical procedures can also severely affect the therapeutic effect. The automatic spelling error correction system can find out the wrong words in the words and propose modification suggestions, helps medical staff reduce spelling errors, improves the accuracy of medical records, saves time of doctors and applies more energy to treatment.
Some researchers have attempted to make spelling corrections in a medical environment based on deep learning methods. This spelling error correction method mainly requires the construction of neural network models that learn and understand complex patterns of language, including context, grammar structure, semantic meaning, etc., through a large amount of training data. The input misspelled text is encoded by the encoder into a fixed length vector, important information in the text is captured, and the decoder then generates the correct spelled text based on this vector. In the training process, the neural network model continuously adjusts internal parameters by comparing the generated text with the true correct text, so that the generated text is more similar to the true correct text.
These neural network models rely on local context information for prediction because the design of such neural network models makes it difficult to handle long-range dependence problems, and the models may not adequately understand phrases or sentences with rich meaning in the context, resulting in an inability to correct context-dependent errors. For example, hyperthyroidism in a condition description is wrongly written as hypothyroidism, and although a correct disease name can be presumed from specific symptoms in the condition description, it cannot correct spelling errors because the model cannot sufficiently understand the relevance and semantic meaning in the context information.
On the other hand, for spelling errors of similar chinese characters, for example, sinus rhythm errors are written as sinus heart rate, the rhythm and rate have the same pronunciation, and both heart rate and rhythm have respective practical meanings, existing deep-learning neural network models have difficulty in dealing with such complex nonlinear relationships, so that the models may not make a correct prediction in the face of morphology-similar or pronunciation-similar spelling errors.
Disclosure of Invention
In order to solve the problems, the invention provides a Chinese spelling error correction method suitable for medical contexts.
The method comprises the following steps:
dividing sentences to be corrected into Chinese characters as units to obtain Chinese characters, the first/>The individual Chinese characters are/>,/>Will/>Mapping each Chinese character through a word list to obtain a sequence, and adding/> before the sequenceAfter the sequence add/>Obtaining the Chinese character label sequence/>, of the sentence to be corrected;
Step two, the Chinese character label sequenceInputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>Contextual information feature/>Dimension conversion to/>Obtain confidence prediction/>;
Step three, defining confidence predictionChinese characters/>, corresponding toConfidence prediction of (1) is Chinese character confidence prediction/>Prediction/>, of Chinese character confidence coefficientBefore selecting all values from big to smallThe value is taken as the/>, in the sentence to be correctedThe candidate Chinese character probability sets at the positions are normalized, wherein the first/>, in the sentences to be corrected, of the candidate Chinese character probability sets are correctedFirst/>, at the individual locationsNormalized confidence of each candidate Chinese character is/>;
Step four, calculating the fourth sentence in the sentence to be corrected based on the edit distance algorithmChinese characters/>And the first sentence in the sentence to be correctedFirst/>, at the individual locationsSpeech similarity/>, between candidate Chinese characters;
Step five, calculating the fifth sentence in the sentence to be corrected based on the edit distance algorithmChinese characters/>And the first sentence in the sentence to be correctedFirst/>, at the individual locationsVisual similarity/>, between candidate Chinese characters;
Step six, based on voice similaritySimilarity to vision/>Calculating Chinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSimilarity/>, between candidate Chinese charactersBased on similarity/>Normalized confidence/>Calculating the/>, in the sentence to be correctedFirst/>, at the individual locationsComprehensive weight/>, of candidate Chinese charactersAccording to the comprehensive weight/>Calculating the/>, in the sentence to be correctedChinese character/>, corrected at each position。
Further, the step two is to sequence the Chinese character labelsInputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>Specifically refers to the Chinese character label sequence/>Inputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>:
;
Wherein,Representing feature extraction operations through the BERT pre-trained chinese language model.
Further, the step two is to characterize the context informationDimension conversion to/>Obtain confidence prediction/>In particular to the context information feature/>Performing dimension conversion to obtain confidence prediction/>:
;
Wherein,Representing linear transformation operations, confidence prediction/>Is/>。
Further, normalizing the confidence in step threeThe calculation method of (1) is as follows:
;
Wherein, Representing confidence prediction/>, of Chinese charactersThe vectors of (2) are ordered from large to small according to the numerical valueA value.
Further, the fourth step specifically includes: the phonetic sequence of each Chinese character is formed by the phonetic and tone codes of the Chinese character, and the first sentence of the sentences to be corrected is definedChinese characters/>The pinyin sequence of (2) is/>Calculating the/>, in the sentence to be corrected, based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSpeech similarity between candidate Chinese characters:
;
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsThe vocabulary index of the candidate Chinese characters,Representing a decoding function for converting a vocabulary index into a corresponding Chinese character,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsPinyin sequence of candidate Chinese characters,/>Representing an edit distance calculation function,/>Representing absolute value operations,/>Representing a maximizing function.
Further, the fifth step specifically includes: defining the first sentence in the sentence to be correctedChinese characters/>The ideographic description sequence of (1) is/>Calculating the/>, in the sentence to be corrected, based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsVisual similarity/>, between candidate Chinese characters:
;
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsThe vocabulary index of the candidate Chinese characters,Representing a decoding function for converting a vocabulary index into a corresponding Chinese character,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsIdeographic description sequence of candidate Chinese characters,/>Representing an edit distance calculation function,/>Representing absolute value operations,/>Representing a maximizing function.
Further, the ideographic description sequence specifically refers to:
Splitting each Chinese character by taking a single character as a unit to obtain an internal character forming part, and combining the split residual strokes and the nearest single character as an internal character forming part for the Chinese characters which cannot be completely split into the single characters;
Splitting each internal character forming part continuously according to the sequence of the Chinese character writing rule until individual strokes are obtained;
According to the splitting sequence, constructing an ideographic description tree of the Chinese characters with a tree structure, wherein the root node of the ideographic description tree is the structural information code for describing the relative positions of the internal character forming components obtained by the first splitting, the leaf nodes are the stroke codes of single strokes, and the middle nodes are the structural information codes for describing the relative positions of the internal character forming components or strokes;
the ideographic description sequence of the Chinese characters is a sequence obtained by traversing the ideographic description tree.
Further, the traversal ideographic tree specifically refers to: and traversing the ideographic description tree according to the preamble traversing sequence.
Further, the sixth step specifically includes calculating the first sentence in the sentence to be correctedChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSimilarity/>, between candidate Chinese characters:
;
Wherein,An adjustment factor for adjusting the voice similarity and the visual similarity;
Combining the similarity and the normalized confidence to obtain the first sentence in the sentence to be corrected First/>, at the individual locationsComprehensive weight/>, of candidate Chinese characters:
;
Then the first sentence in the sentence to be correctedChinese character/>, corrected at each positionThe method comprises the following steps:
;
Wherein, Representing a function of selecting the maximum value in brackets/(Representing a function that converts comprehensive weights into vocabulary indices,/>Representing a decoding function that converts the vocabulary index into a corresponding chinese character.
One or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:
The spelling error correction method provided by the invention is based on the context confidence and the Chinese character similarity, and introduces the BERT pre-training Chinese language model, thereby introducing background knowledge in the pre-training process, and enabling the input sentence to be encoded on the basis, thereby integrating the current context characteristics and solving the problem that partial correct but unsuitable words are difficult to identify. Meanwhile, the text structure, namely the visual similarity and the voice similarity of the Chinese characters are modeled, so that the model is helped to recognize similar wrongly written characters, and the problem of similar character errors is solved.
Drawings
FIG. 1 is a schematic diagram of two internal word forming components according to an embodiment of the present invention in a left-to-right relationship;
FIG. 2 is a schematic diagram of two internal word forming components according to an embodiment of the present invention in a top-down relationship;
FIG. 3 is a schematic diagram of three internal word forming components according to an embodiment of the present invention in a left-to-right relationship;
FIG. 4 is a schematic diagram of three internal word forming components according to an embodiment of the present invention in a top-down relationship;
FIG. 5 is a schematic diagram showing the outside-in relationship of two internal word forming components according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of two internal letter components with three sides surrounding and lower opening relationship according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of two internal letter components with three sides surrounding and upper opening relationship according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of two internal letter components with three-sided surrounding and right opening relationship according to an embodiment of the present invention;
FIG. 9 is a schematic diagram showing two inner letter components in a left-top to right-bottom two-sided surrounding relationship according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of two internal word forming components according to an embodiment of the present invention in a top right to bottom left two-sided surrounding relationship;
FIG. 11 is a schematic diagram showing two internal word forming parts according to an embodiment of the present invention in a surrounding relationship from bottom left to top right;
FIG. 12 is a schematic diagram of a partially overlapping relationship of two internal word forming components according to an embodiment of the present invention;
fig. 13 is a schematic diagram of a Chinese character "by" ideographic tree according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the drawings and detailed embodiments, and before the technical solutions of the embodiments of the present invention are described in detail, the terms and terms involved will be explained, and in the present specification, the components with the same names or the same reference numerals represent similar or identical structures, and are only limited for illustrative purposes.
The invention corrects sentences input by the user and outputs the most probable correction result. Specifically, firstly, word segmentation is carried out on an input sentence to obtain a sequence of single Chinese characters, and then a special identifier is added to obtain a Chinese character label sequence of the sentence to be corrected; inputting a Chinese character label sequence of a sentence to be corrected into the BERT pre-training Chinese language model to obtain context information characteristics given by the BERT pre-training Chinese language model, and performing linear transformation on the context information characteristics to enable the dimensions of the context information characteristics to be aligned with a word list; before calculating each positionNormalized confidence of each candidate item, get the front/>, of each locationConfidence of individual candidates; calculate the front/>, for each locationThe candidate items correspond to the visual similarity and the voice similarity of the Chinese characters and the input Chinese characters, and weight the two to obtain the front/>, of each positionSimilarity of the individual candidates; front/>, of each position is calculated by fusing similarity and confidenceComprehensive weights of the individual candidates; and taking the Chinese character with the highest comprehensive weight at each position as the Chinese character after error correction.
The method provided by the invention specifically comprises the following steps:
1. word segmentation of sentences
In order for the BERT pre-trained chinese language model to be able to process sentences composed of characters, it is necessary to word the sentences.
Sentences to be correctedDividing by taking Chinese characters as a unit, and obtaining/>, obtained by dividingMapping each Chinese character through a word list to obtain a sequence/>,/>。/>Representing the number of Chinese characters included in the sentence to be corrected,/>Representing the/>, in the sentence to be correctedChinese characters/>And mapping to obtain the Chinese character digital label. The vocabulary is a table for mapping Chinese characters into numbers, and the corresponding number range of the vocabulary of the BERT pre-training Chinese language model is 0-21127。
Based on the format requirement of BERT pre-trained Chinese language model, in sequencePlus/>Tagging followed by/>Marking to obtain the Chinese character label sequence of the sentence to be corrected。
2. Acquiring features containing contextual information
The BERT pre-trained Chinese language model can model context semantic information and sequence Chinese character labelsInputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>:
;
Wherein,Representing extraction of feature operations, contextual information features by BERT pre-trained Chinese language modelsIs a dimension/>Vector of/>Is the output dimension of the BERT pre-trained chinese language model.
In order to calculate the probability of possible Chinese characters at each position of the sentence to be corrected in the subsequent steps, the context information needs to be characterizedPerforming dimension conversion to obtain confidence prediction/>:
;
Wherein,Representing linear transformation operations, confidence prediction/>Is/>。
3. Calculating confidence
Defining confidence predictionsThe corresponding sentence to be corrected in the sentence/>Chinese characters/>Confidence prediction of (1) is Chinese character confidence prediction/>Chinese character confidence prediction/>For length/>Each value in the vector represents a Chinese character corresponding to a certain value in the vocabulary of the BERT pre-trained Chinese language model as the/>Chinese characters/>Is a probability of (2).
Prediction of confidence level of Chinese charactersAll values in the vector of (a) are sorted from big to small according to the numerical value, and then the front/> isselectedThe individual values form a set as the/>, in the sentence to be correctedCandidate Chinese character probability set at each position, wherein Chinese characters corresponding to the candidate Chinese character probability set are the most likely to be predicted by the BERT pre-trained Chinese language model to be used as the/> in the sentence to be correctedChinese characters/>/>Chinese characters, and the first/>, in the candidate Chinese character probability setChinese characters corresponding to the individual values are the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters. Normalizing the candidate Chinese character probability set, and carrying out/>, in the sentence to be correctedFirst/>, at the individual locationsNormalized confidence/>, of each candidate chinese characterThe method comprises the following steps:
;
Wherein, Representing confidence prediction/>, of Chinese charactersThe vectors of (2) are ordered from large to small according to the numerical valueA value.
The invention only considers Chinese character confidence prediction when calculating normalized confidenceMedium value is the largest/>The individual values are candidates, not all values in the vocabulary. The purpose of this design is to draw confidence gaps because of the Chinese confidence prediction/>The values of the first few candidate values are relatively close, and if normalization is performed by using all the values in the vocabulary, the calculated normalized confidence of each candidate value will be too close.
4. Calculating speech similarity
The pronunciation of Chinese characters can be directly represented by their corresponding pinyin and tone. The invention judges the phonetic similarity of Chinese characters by converting Chinese characters into corresponding phonetic sequences, and the phonetic sequences of each Chinese character are defined as phonetic and tone coding compositions. In this embodiment, the tone encoding is digital. For example, the pinyin sequence of the Chinese character "medical" is "yi1", where "1" represents the encoding of the first tone. Defining the first sentence in the sentence to be correctedChinese characters/>The pinyin sequence of (2) is/>。
Calculating the first sentence in the sentence to be corrected based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSpeech similarity/>, between candidate Chinese characters:
;
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsThe vocabulary index of the candidate Chinese characters,Representing a decoding function for converting a vocabulary index into a corresponding Chinese character,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsPinyin sequence of candidate Chinese characters,/>Representing an edit distance calculation function,/>Representing absolute value operations,/>Representing a maximizing function.
5. Calculating visual similarity
The invention adopts ideographic character description sequence to represent the visual information of Chinese characters, wherein the ideographic character description sequence comprises a plurality of structural information codes for describing the relative positions of character forming components in the Chinese characters and Chinese character stroke codes, and the strokes are ordered according to the writing rule of the Chinese characters. Based on the Chinese character structure information and the Chinese character stroke information, each ideographic description sequence can be in one-to-one correspondence with the Chinese characters described by the ideographic description sequence.
The invention describes the relative positions of character forming components in Chinese characters based on twelve structural information codes, thereby accurately representing the structural information of the Chinese characters. Fig. 1 shows a schematic view of two internal letter components in a left-to-right relationship, fig. 2 shows a schematic view of two internal letter components in a top-to-bottom relationship, fig. 3 shows a schematic view of three internal letter components in a left-to-right relationship, fig. 4 shows a schematic view of three internal letter components in a top-to-bottom relationship, fig. 5 shows a schematic view of two internal letter components in an outside-to-inside relationship, fig. 6 shows two internal letter components in a three-sided surrounding, a schematic view of an opening relationship below, fig. 7 shows two internal letter components in a three-sided surrounding, a schematic view of an opening relationship above, fig. 8 shows two internal letter components in a three-sided surrounding, a schematic view of an opening relationship to the right, fig. 9 shows two internal letter components in a two-sided surrounding relationship from the top left to the bottom, fig. 10 shows two internal letter components in a two-sided surrounding relationship from the top to the right, fig. 11 shows two internal letter components in a two-sided surrounding relationship from the top to the left, and fig. 12 shows two internal letter components in a partially overlapping relationship.
And for Chinese characters which cannot be completely split into individual characters, combining the split residual strokes and the nearest individual character as an internal character forming part, wherein the nearest individual character refers to the individual character closest to the split residual strokes in writing positions. And continuing to split each internal character forming part according to the sequence of the Chinese character writing rule until separate strokes are obtained. According to the splitting sequence, an ideographic description tree of the Chinese characters with a tree structure is constructed, the root node of the ideographic description tree is the structural information code for describing the relative positions of the internal character forming components obtained by splitting for the first time, the leaf nodes are the stroke codes of single strokes, and the middle nodes are the structural information codes for describing the relative positions of the internal character forming components or strokes. The ideographic description sequence of the Chinese characters is a sequence obtained by traversing the ideographic description tree.
In this embodiment, the ideographic tree is traversed sequentially by the preamble traversal. Fig. 13 shows the ideographic tree of the chinese character "by". Splitting the Chinese character 'from' for the first time to obtain a first internal character forming component '冂' and a second internal character forming component 'earth', wherein the first internal character forming component '冂' and the second internal character forming component 'earth' are in a partially overlapped relation corresponding to fig. 12, so that the root node of the ideographic description tree of the Chinese character 'from' is the structural information code of the partially overlapped relation corresponding to fig. 12; splitting the first internal word forming part 冂 to obtain a stroke I and a stroke I, wherein the stroke I and the stroke I are corresponding to the relation from left to right in FIG. 1; splitting the second internal character forming component 'soil' to obtain a third internal character forming component 'ten' and a stroke 'one', wherein the third internal character forming component 'ten' and the stroke 'one' are in a corresponding up-down relation of FIG. 2; splitting the third internal word forming part 'ten' to obtain a stroke 'one' and a stroke 'I', wherein the stroke 'one' and the stroke 'I' are corresponding partial overlapping relations in FIG. 12. To this end, the Chinese character "from" is completely split into individual strokes, and an ideographic tree as shown in fig. 13 is obtained, and the upper part of fig. 13 is a sequence obtained by performing a preface traversal on the ideographic tree of the Chinese character "from".
Defining the first sentence in the sentence to be correctedChinese characters/>The ideographic description sequence of (1) is/>Calculating the/>, in the sentence to be corrected, based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsVisual similarity/>, between candidate Chinese characters:
;
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsAnd the ideographic description sequence of the candidate Chinese characters.
6. Sentence correction
Calculating the first sentence in the sentence to be correctedChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSimilarity/>, between candidate Chinese characters:
;
Wherein,To adjust the adjustment factor of the voice similarity and the visual similarity.
Combining the similarity and the normalized confidence to obtain the first sentence in the sentence to be correctedFirst/>, at the individual locationsComprehensive weight/>, of candidate Chinese characters:
。
Then the first sentence in the sentence to be correctedChinese character/>, corrected at each positionThe method comprises the following steps:
;
Wherein, Representing a function of selecting the maximum value in brackets/(Representing a function that converts comprehensive weights into vocabulary indices,/>Representing slaveAnd selecting the comprehensive weight with the largest value, and calculating the vocabulary index of the comprehensive weight.
Chinese character after error correctionIs the/>, in the corrected sentenceChinese characters corresponding to the positions and correcting the errorPossibly with Chinese characters/>The same or different, and the corrected Chinese characters/>And Chinese characters/>The same indicates the/>, in the corrected sentenceThe individual positions are unmodified; corrected Chinese character/>And Chinese characters/>Inequality indicates the/>, of sentences to be correctedThe positions are modified.
Each corrected Chinese character is composed in sequenceI.e. an error corrected sentence.
The above embodiments are merely illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solution of the present invention should fall within the protection scope defined by the claims of the present invention without departing from the design spirit of the present invention.
Claims (9)
1. A method of chinese spelling error correction for medical contexts, comprising the steps of:
dividing sentences to be corrected into Chinese characters as units to obtain Chinese characters, the first/>The individual Chinese characters are/>,/>Will/>Mapping each Chinese character through a word list to obtain a sequence, and adding/> before the sequenceAfter the sequence add/>Obtaining the Chinese character label sequence/>, of the sentence to be corrected;
Step two, the Chinese character label sequenceInput into BERT pre-trained Chinese language model to obtain context information featuresContextual information feature/>Dimension conversion to/>Obtain confidence prediction/>;
Step three, defining confidence predictionChinese characters/>, corresponding toConfidence prediction of (1) is Chinese character confidence prediction/>Prediction/>, of Chinese character confidence coefficientBefore selecting all values from big to smallThe value is taken as the/>, in the sentence to be correctedThe candidate Chinese character probability sets at the positions are normalized, wherein the first/>, in the sentences to be corrected, of the candidate Chinese character probability sets are correctedFirst/>, at the individual locationsNormalized confidence of each candidate Chinese character is/>;
Step four, calculating the fourth sentence in the sentence to be corrected based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSpeech similarity/>, between candidate Chinese characters;
Step five, calculating the fifth sentence in the sentence to be corrected based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsVisual similarity/>, between candidate Chinese characters;
Step six, based on voice similaritySimilarity to vision/>Calculating Chinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSimilarity/>, between candidate Chinese charactersBased on similarity/>Confidence with normalizedCalculating the/>, in the sentence to be correctedFirst/>, at the individual locationsComprehensive weight/>, of candidate Chinese charactersAccording to the comprehensive weight/>Calculating the/>, in the sentence to be correctedChinese character/>, corrected at each position。
2. The method of claim 1, wherein in step two, the sequence of Chinese character labels is used for correcting the spelling of Chinese charactersInputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>Specifically refers to the Chinese character label sequence/>Inputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>:
;
Wherein,Representing feature extraction operations through the BERT pre-trained chinese language model.
3. A method of chinese spelling correction for medical context as recited in claim 1 wherein in step two the contextual information is characterized byDimension conversion to/>Obtain confidence prediction/>In particular to the context information feature/>Performing dimension conversion to obtain confidence prediction/>:
;
Wherein,Representing linear transformation operations, confidence prediction/>Is/>。
4. The method of claim 1, wherein the confidence level is normalized in step threeThe calculation method of (1) is as follows:
;
Wherein, Representing confidence prediction/>, of Chinese charactersThe vectors of (2) are ordered from large to small according to the numerical valueA value.
5. The method of claim 1, wherein the step four specifically comprises: the phonetic sequence of each Chinese character is formed by the phonetic and tone codes of the Chinese character, and the first sentence of the sentences to be corrected is definedChinese characters/>The pinyin sequence of (2) is/>Calculating the/>, in the sentence to be corrected, based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSpeech similarity/>, between candidate Chinese characters:
;
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsVocabulary index of candidate Chinese characters,/>Representing a decoding function for converting a vocabulary index into a corresponding Chinese character,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsPinyin sequence of candidate Chinese characters,/>Representing an edit distance calculation function,/>Representing absolute value operations,/>Representing a maximizing function.
6. The method of claim 1, wherein the fifth step comprises: defining the first sentence in the sentence to be correctedChinese characters/>The ideographic description sequence of (1) is/>Calculating the/>, in the sentence to be corrected, based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsVisual similarity/>, between candidate Chinese characters:
;
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsVocabulary index of candidate Chinese characters,/>Representing a decoding function for converting a vocabulary index into a corresponding Chinese character,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsIdeographic description sequence of candidate Chinese characters,/>Representing an edit distance calculation function,/>Representing absolute value operations,/>Representing a maximizing function.
7. The method for correcting spelling errors in chinese language for medical context according to claim 6, wherein the sequence of ideographic descriptions is specifically:
Splitting each Chinese character by taking a single character as a unit to obtain an internal character forming part, and combining the split residual strokes and the nearest single character as an internal character forming part for the Chinese characters which cannot be completely split into the single characters;
Splitting each internal character forming part continuously according to the sequence of the Chinese character writing rule until individual strokes are obtained;
According to the splitting sequence, constructing an ideographic description tree of the Chinese characters with a tree structure, wherein the root node of the ideographic description tree is the structural information code for describing the relative positions of the internal character forming components obtained by the first splitting, the leaf nodes are the stroke codes of single strokes, and the middle nodes are the structural information codes for describing the relative positions of the internal character forming components or strokes;
the ideographic description sequence of the Chinese characters is a sequence obtained by traversing the ideographic description tree.
8. The method of claim 7, wherein the traversing ideographic description tree specifically refers to: and traversing the ideographic description tree according to the preamble traversing sequence.
9. The method of claim 1, wherein the sixth step comprises calculating the first sentence in the sentence to be correctedChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSimilarity/>, between candidate Chinese characters:
;
Wherein,An adjustment factor for adjusting the voice similarity and the visual similarity;
Combining the similarity and the normalized confidence to obtain the first sentence in the sentence to be corrected First/>, at the individual locationsComprehensive weight/>, of candidate Chinese characters:
;
Then the first sentence in the sentence to be correctedChinese character/>, corrected at each positionThe method comprises the following steps:
;
Wherein, Representing a function of selecting the maximum value in brackets/(Representing a function that converts comprehensive weights into vocabulary indices,/>Representing a decoding function that converts the vocabulary index into a corresponding chinese character.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410120343.5A CN117648923B (en) | 2024-01-29 | 2024-01-29 | Chinese spelling error correction method suitable for medical context |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410120343.5A CN117648923B (en) | 2024-01-29 | 2024-01-29 | Chinese spelling error correction method suitable for medical context |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117648923A CN117648923A (en) | 2024-03-05 |
CN117648923B true CN117648923B (en) | 2024-05-10 |
Family
ID=90045479
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410120343.5A Active CN117648923B (en) | 2024-01-29 | 2024-01-29 | Chinese spelling error correction method suitable for medical context |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117648923B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111310443A (en) * | 2020-02-12 | 2020-06-19 | 新华智云科技有限公司 | Text error correction method and system |
CN112530597A (en) * | 2020-11-26 | 2021-03-19 | 山东健康医疗大数据有限公司 | Data table classification method, device and medium based on Bert character model |
CN113657098A (en) * | 2021-08-24 | 2021-11-16 | 平安科技(深圳)有限公司 | Text error correction method, device, equipment and storage medium |
CN113673228A (en) * | 2021-09-01 | 2021-11-19 | 阿里巴巴达摩院(杭州)科技有限公司 | Text error correction method, text error correction device, computer storage medium and computer program product |
CN113935317A (en) * | 2021-09-26 | 2022-01-14 | 平安科技(深圳)有限公司 | Text error correction method and device, electronic equipment and storage medium |
CN114881006A (en) * | 2022-03-30 | 2022-08-09 | 医渡云(北京)技术有限公司 | Medical text error correction method and device, storage medium and electronic equipment |
CN115081430A (en) * | 2022-05-24 | 2022-09-20 | 中国科学院自动化研究所 | Chinese spelling error detection and correction method and device, electronic equipment and storage medium |
CN115114919A (en) * | 2021-03-19 | 2022-09-27 | 富士通株式会社 | Method and device for presenting prompt information and storage medium |
CN115862040A (en) * | 2022-12-12 | 2023-03-28 | 杭州恒生聚源信息技术有限公司 | Text error correction method and device, computer equipment and readable storage medium |
CN116522905A (en) * | 2023-07-03 | 2023-08-01 | 腾讯科技(深圳)有限公司 | Text error correction method, apparatus, device, readable storage medium, and program product |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11593560B2 (en) * | 2020-10-21 | 2023-02-28 | Beijing Wodong Tianjun Information Technology Co., Ltd. | System and method for relation extraction with adaptive thresholding and localized context pooling |
-
2024
- 2024-01-29 CN CN202410120343.5A patent/CN117648923B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111310443A (en) * | 2020-02-12 | 2020-06-19 | 新华智云科技有限公司 | Text error correction method and system |
CN112530597A (en) * | 2020-11-26 | 2021-03-19 | 山东健康医疗大数据有限公司 | Data table classification method, device and medium based on Bert character model |
CN115114919A (en) * | 2021-03-19 | 2022-09-27 | 富士通株式会社 | Method and device for presenting prompt information and storage medium |
CN113657098A (en) * | 2021-08-24 | 2021-11-16 | 平安科技(深圳)有限公司 | Text error correction method, device, equipment and storage medium |
CN113673228A (en) * | 2021-09-01 | 2021-11-19 | 阿里巴巴达摩院(杭州)科技有限公司 | Text error correction method, text error correction device, computer storage medium and computer program product |
CN113935317A (en) * | 2021-09-26 | 2022-01-14 | 平安科技(深圳)有限公司 | Text error correction method and device, electronic equipment and storage medium |
CN114881006A (en) * | 2022-03-30 | 2022-08-09 | 医渡云(北京)技术有限公司 | Medical text error correction method and device, storage medium and electronic equipment |
CN115081430A (en) * | 2022-05-24 | 2022-09-20 | 中国科学院自动化研究所 | Chinese spelling error detection and correction method and device, electronic equipment and storage medium |
CN115862040A (en) * | 2022-12-12 | 2023-03-28 | 杭州恒生聚源信息技术有限公司 | Text error correction method and device, computer equipment and readable storage medium |
CN116522905A (en) * | 2023-07-03 | 2023-08-01 | 腾讯科技(深圳)有限公司 | Text error correction method, apparatus, device, readable storage medium, and program product |
Also Published As
Publication number | Publication date |
---|---|
CN117648923A (en) | 2024-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021139424A1 (en) | Text content quality evaluation method, apparatus and device, and storage medium | |
CN110489760A (en) | Based on deep neural network text auto-collation and device | |
CN111767718B (en) | Chinese grammar error correction method based on weakened grammar error feature representation | |
CN113779972B (en) | Speech recognition error correction method, system, device and storage medium | |
CN106776548A (en) | A kind of method and apparatus of the Similarity Measure of text | |
CN111985234B (en) | Voice text error correction method | |
CN113283236A (en) | Entity disambiguation method in complex Chinese text | |
CN111428104A (en) | Epilepsy auxiliary medical intelligent question-answering method based on viewpoint type reading understanding | |
CN116910272B (en) | Academic knowledge graph completion method based on pre-training model T5 | |
CN114386399A (en) | Text error correction method and device | |
JP2020030367A (en) | Voice recognition result formatted model learning device and its program | |
US20240346950A1 (en) | Speaking practice system with redundant pronunciation correction | |
CN114863948A (en) | CTCATtention architecture-based reference text related pronunciation error detection model | |
CN114548053A (en) | Text comparison learning error correction system, method and device based on editing method | |
CN114511084A (en) | Answer extraction method and system for automatic question-answering system for enhancing question-answering interaction information | |
CN117852528A (en) | Error correction method and system of large language model fusing rich semantic information | |
US11817079B1 (en) | GAN-based speech synthesis model and training method | |
CN111274826B (en) | Semantic information fusion-based low-frequency word translation method | |
CN111046663B (en) | Intelligent correction method for Chinese form | |
CN117648923B (en) | Chinese spelling error correction method suitable for medical context | |
CN116956944A (en) | Endangered language translation model method integrating syntactic information | |
BE1022627B1 (en) | Method and device for automatically generating feedback | |
CN115310433A (en) | Data enhancement method for Chinese text proofreading | |
CN113486160B (en) | Dialogue method and system based on cross-language knowledge | |
CN110955768B (en) | Question-answering system answer generation method based on syntactic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |