CN111783458A - Method and device for detecting overlapping character errors - Google Patents
Method and device for detecting overlapping character errors Download PDFInfo
- Publication number
- CN111783458A CN111783458A CN202010842426.7A CN202010842426A CN111783458A CN 111783458 A CN111783458 A CN 111783458A CN 202010842426 A CN202010842426 A CN 202010842426A CN 111783458 A CN111783458 A CN 111783458A
- Authority
- CN
- China
- Prior art keywords
- word segmentation
- word
- sentence
- model
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000011218 segmentation Effects 0.000 claims abstract description 187
- 238000012545 processing Methods 0.000 claims abstract description 29
- 238000004422 calculation algorithm Methods 0.000 claims description 47
- 238000001514 detection method Methods 0.000 claims description 37
- 239000013598 vector Substances 0.000 claims description 33
- 230000008859 change Effects 0.000 claims description 17
- 238000002372 labelling Methods 0.000 claims description 11
- 239000000470 constituent Substances 0.000 claims description 6
- 238000007637 random forest analysis Methods 0.000 claims description 4
- 238000003066 decision tree Methods 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 3
- 238000012706 support-vector machine Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 11
- 238000012549 training Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 241000590419 Polygonia interrogationis Species 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000013145 classification model Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
Embodiments of the present specification provide methods and apparatus for detecting a double-word error in a statement. In the method, word segmentation processing is performed on a sentence containing overlapped characters. In addition, aiming at the sentences of which the overlapped characters are respectively positioned at the adjacent word segmentation, word segmentation information of the word segmentation of the overlapped characters is obtained, and the word segmentation information comprises word segmentation part of speech and word segmentation pinyin. Then, the acquired word segmentation information is used to detect a word folding error in the sentence.
Description
Technical Field
The embodiments of the present disclosure relate generally to the field of language processing, and more particularly, to a method and apparatus for detecting a double-word error in a sentence.
Background
When the statement analysis is performed, the phenomenon of overlapping characters exists in the analyzed statement. The phenomenon of overlapping words means that words in adjacent positions in the same sentence are completely the same. These foldover in the sentence may be the result of erroneous repeated inputs, such as: the "volume" in "billing according to the validly checked volume" is incorrect, but some foldings may be correct, for example "barre" in "barre network technologies limited". In some formal documents, wrong input of stacked characters can leave a bad impression on the partner, and even legal risks or legal disputes can be generated, for example, when a contract is signed, stacked characters of 'ten thousand yuan paid by a first party (including tax) to a second party' can cause contract clauses to be wrong, and then the risk of generating legal disputes is generated.
Disclosure of Invention
In view of the foregoing, embodiments of the present specification provide a method and apparatus for detecting a double-word error in a sentence. By using the method and the device, the word stack error detection is carried out by using the word segmentation part of the word stack and the word segmentation pinyin of different word segments, so that the efficiency and the accuracy of the word stack error detection can be improved.
According to an aspect of embodiments of the present specification, there is provided a method for detecting a double-word error in a sentence, including: performing word segmentation processing on the sentence containing the character stack; when the superposed characters are respectively positioned at adjacent participles, acquiring participle information of the participle where the superposed characters are positioned, wherein the participle information comprises participle part-of-speech and participle pinyin; and detecting a word folding error in the sentence using the word segmentation information.
Optionally, in one example of the above aspect, the word segmentation information further includes a number of component words of a word segmentation.
Optionally, in one example of the above aspect, detecting a double-word error in the sentence using the word segmentation information comprises: determining a model characteristic vector of a character-overlapping judging model according to the word segmentation information; and providing the model feature vector to the foldover discrimination model to detect foldover errors in the statement.
Optionally, in an example of the above aspect, determining a model feature vector of a superimposed character discrimination model according to the word segmentation information includes: determining the part-of-speech consistency, the pinyin consistency and/or the number of forming words of the overlapped characters in the adjacent word segmentation according to the word segmentation information; and generating a model characteristic vector of the superimposed character discrimination model according to the part of speech consistency, the pinyin consistency and/or the number of the formed characters of the superimposed characters in the adjacent word segmentation.
Optionally, in an example of the above aspect, performing word segmentation processing on the sentence containing the superimposed word includes: and performing word segmentation on the sentence containing the overlapped characters by using a text word segmentation algorithm.
Optionally, in an example of the above aspect, the text word segmentation algorithm includes: a text word segmentation algorithm based on a word segmentation dictionary; text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a model-based text word segmentation algorithm; or a text word segmentation algorithm based on word labeling.
Optionally, in one example of the above aspect, the method further comprises: determining a confusion score change value of the sentence before and after removing the superimposed word, wherein detecting a superimposed word error in the sentence using the word segmentation information comprises: detecting a typographical error in the sentence using the word segmentation information and the confusion score change value.
Optionally, in an example of the above aspect, the foldover discrimination model includes one of the following models: a random forest model; a decision tree model; a gradient lifting tree model; a neural network model; a support vector machine; and a sensing machine.
Optionally, in one example of the above aspect, the method further comprises: carrying out sentence division on an input sentence; and determining the sentence containing the character folding from the divided sentences.
According to another aspect of embodiments of the present specification, there is provided an apparatus for detecting a double-word error in a sentence, including: the word segmentation processing unit is used for carrying out word segmentation processing on the sentences containing the overlapped characters; the word segmentation information acquisition unit is used for acquiring word segmentation information of a word segmentation where the superposed word is located when the superposed word is located at an adjacent word segmentation respectively, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin; and a superimposition error detection unit that detects a superimposition error in the sentence using the word segmentation information.
Optionally, in an example of the above aspect, the double-stack error detection unit includes: the model input determining module is used for determining a model characteristic vector of the character folding distinguishing model according to the word segmentation information; and the character folding error detection module is used for providing the model feature vector to the character folding discrimination model to detect character folding errors in the statement.
Optionally, in an example of the above aspect, the word segmentation information further includes a number of constituent words of a word segmentation, and the model input determination module: determining the part-of-speech consistency, the pinyin consistency and/or the number of forming words of the overlapped characters in the adjacent word segmentation according to the word segmentation information; and generating a model characteristic vector of the superimposed character discrimination model according to the part of speech consistency, the pinyin consistency and/or the number of the formed characters of the superimposed characters in the adjacent word segmentation.
Optionally, in one example of the above aspect, the word segmentation processing unit performs word segmentation processing on a sentence containing a word stack using a text word segmentation algorithm.
Optionally, in an example of the above aspect, the apparatus further comprises: and a confusion score change determination unit that determines a confusion score change value before and after the word superimposition is removed from the sentence, wherein the word superimposition error detection unit detects a word superimposition error in the sentence using the word segmentation information and the confusion score change value.
Optionally, in an example of the above aspect, the apparatus further comprises: a sentence division unit that performs sentence division on an input sentence; and a superimposed sentence specifying unit that specifies a sentence including a superimposed word from the divided sentences.
According to another aspect of embodiments of the present specification, there is provided an electronic apparatus including: at least one processor, and a memory coupled with the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform a method of shingle error detection as described above.
According to another aspect of embodiments of the present specification, there is provided a machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform a method of shingle error detection as described above.
Drawings
A further understanding of the nature and advantages of the present disclosure may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.
FIG. 1 shows an example schematic of a cover word statement in accordance with an embodiment of the present specification.
FIG. 2 illustrates an example flow diagram of a method for detecting a double-stack error in a statement in accordance with an embodiment of the present description.
FIG. 3 shows an example schematic of an input sentence according to an embodiment of the present description.
Fig. 4 illustrates an example schematic diagram of word segmentation information of a sentence according to an embodiment of the present specification.
FIG. 5 illustrates an example flow diagram of a training process for a foldover discriminant model in accordance with an embodiment of the present description.
FIG. 6 illustrates an example flow diagram of a model input determination process for a foldover discrimination model in accordance with embodiments of the present description.
FIG. 7 illustrates a block diagram of an apparatus for detecting foldover errors in a statement in accordance with an embodiment of the present description.
FIG. 8 shows a block diagram of one example of a foldover error detection unit according to embodiments of the present description.
FIG. 9 shows a schematic diagram of an electronic device for implementing a sentence overlap error detection process in accordance with an embodiment of the present specification.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It should be understood that these embodiments are discussed only to enable those skilled in the art to better understand and thereby implement the subject matter described herein, and are not intended to limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as needed. For example, the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may also be combined in other examples.
As used herein, the term "include" and its variants mean open-ended terms in the sense of "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment". The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. The definition of a term is consistent throughout the specification unless the context clearly dictates otherwise.
When text sentence processing is performed, text sentence analysis is required. When the text sentence is analyzed, it is found that the text sentence analyzed has a phenomenon of overlapping characters, for example: "billing according to the quantity of valid checks", "Alibab network technology Limited", etc. These foldover phenomena may be correct foldover or foldover due to text entry errors. In this specification, a double-character due to a text input error is referred to as a "double-character error". In some cases, the typographical error in the text sentence may leave an unfavorable impression on the partner, and may even cause legal risks or legal disputes, for example, if the following typographical error "payment of fourteen ten thousand yuan (including tax) by the first party to the second party" exists in the contract when the contract is signed, the contract clauses may be wrong, and the subsequent legal dispute risk may be caused.
In view of the above, embodiments of the present specification propose methods and apparatuses for detecting a double-word error in a sentence. By utilizing the method and the device, the words of the sentence containing the overlapped characters are segmented, the word segmentation part of speech and the word segmentation pinyin of different segmented words where the overlapped characters are located are extracted, and the word segmentation part of speech and the word segmentation pinyin are used for carrying out word overlapping error detection, so that the efficiency and the accuracy of word overlapping error detection can be improved.
A method and apparatus for detecting a double-word error in a sentence according to an embodiment of the present specification are described below with reference to the accompanying drawings.
FIG. 1 shows an example schematic of a cover word statement in accordance with an embodiment of the present specification.
In fig. 1, an example of three cover sentences is shown. The ' volume ' in the overlapping sentence 1 ' charged according to the effective checking volume ' is overlapping, the ' Wanwan ' in the overlapping sentence 2 ' paying a whole amount (including tax) to the second party ' is overlapping, and the ' all the target money amounts are the same ' in the overlapping sentence 3 ' is overlapping.
It is noted that each of the example statements shown in fig. 1 contains only one foldover. In other examples of this specification, there may be multiple foldovers in a single foldover statement.
FIG. 2 illustrates an example flow diagram of a method 200 for detecting foldover errors in a statement in accordance with an embodiment of the present description.
As shown in fig. 2, at 210, a sentence containing a double-letter is determined from the input sentences. In an example of the present specification, the input sentence may be a text sentence input by the user in real time, or may be a text sentence acquired from another text processing system or apparatus, for example, a text sentence acquired from a contract text database of the contract text storage device. FIG. 3 shows an example schematic of an input sentence according to an embodiment of the present description.
In one example, in retrieving the input sentence, the input sentence may be sentence divided. For example, the input sentences may be sentence-divided by sentence dividers, examples of which may include, but are not limited to, periods, semicolons, question marks, and exclamation marks, for example. For example, the input sentence in fig. 3, is divided into 3 sentences: "all unit price amounts are the same", "billing according to the validly checked amount", and "payment of the fee is completed within a single month of receipt of the first party account".
Then, a sentence including a character superimposition is determined from the divided sentences. For example, after the divided sentences are obtained, it is determined for each sentence whether or not the sentence includes a double-letter. If no ply word is contained, the statement is discarded. For example, for the above-mentioned 3 sentences "all the target unit price amounts are the same", "charge by validation amount", and "and payment of fee is completed within a single month of receiving the first square account", the sentence "and payment of fee is completed within a single month of receiving the first square account" does not contain a double word, thereby discarding the sentence, and the sentences "all the target unit price amounts are the same" and "charge by validation amount" are determined as the sentences containing the double word.
At 220, the sentence containing the double-letter is participled. In one example, a text-to-word algorithm is used to perform word segmentation on a sentence containing a word-stack. Examples of the text-to-word algorithm may include, but are not limited to: a text word segmentation algorithm based on a word segmentation dictionary; text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a model-based text word segmentation algorithm; or a text word segmentation algorithm based on word labeling.
In the present specification, the segmentation dictionary may be, for example, a custom dictionary. The custom dictionary may be determined by using a corpus. A statistics-based text-word segmentation algorithm may perform text-word segmentation according to the probability or frequency of occurrence of words adjacent to the words. Examples of the statistics-based text segmentation algorithm may include an N-gram (N-gram) based text segmentation algorithm, a hidden markov model based text segmentation algorithm. The rule-based text-to-word algorithm may perform semantic and syntactic analysis on a sentence and text-to-word the sentence using the syntactic and semantic information. The model-based text segmentation algorithm may be, for example, a text segmentation algorithm based on a text segmentation model.
The text word segmentation algorithm based on word labeling is actually a word construction method, namely, the text word segmentation process is regarded as the labeling problem of words in a word string. Since each word occupies a certain word-formation position (i.e., lexeme) when constructing a specific word, if it is specified that each word has only four word-formation positions at most: i.e., B (prefix), M (in-word), E (suffix), and S (individual word), the segmentation result (1) of the following sentence can be directly expressed in a word-by-word notation form as shown in (2): (1) the word segmentation result is as follows: (ii)/shanghai/plan/N/this/century/end/implementation/average of people/domestic/production/total/five thousand dollars/; (2) the character labeling form is as follows: Shanghai/Bhai/Ejime/Bmin/E N/Sben/Shi/Bmin/Endo/Shi/Emin/Eyun/Bmin/Engo/Bproduct/Bmin/Ewu/Bk/Mmei/M yuan/E. and/S. It is to be noted that, in the present specification, the term "character" is not limited to a chinese character, and may include characters such as a foreign language alphabet, an arabic numeral, and a punctuation mark.
By regarding the word segmentation process as the word tagging problem, the recognition problems of the word list words and the unknown words (such as names of people, places and organizations) can be viewed in a balanced manner. In the word segmentation technology, both the vocabulary words and the unknown words in the text sentences are realized by using a unified word marking process. In the learning architecture, the information of the vocabulary words does not need to be specially emphasized, and a special unknown word recognition module does not need to be specially designed. This greatly simplifies the design of the text segmentation system. In the word marking process, all words are subjected to lexeme characteristic learning according to predefined characteristics to obtain a probability model. And then, obtaining a word position labeling result on the string to be classified according to the combination tightness degree of the characters. And finally, directly obtaining a final word segmentation result according to the lexeme definition.
At 230, word segmentation information of the word segmentation where the superimposed word is located is obtained for the sentence where the superimposed word is located in the adjacent word segmentation, wherein the word segmentation information includes word segmentation part of speech and word segmentation pinyin.
For example, after the word segmentation processing is performed on the sentence including the superimposed word as described above, the word segmentation where the superimposed word is located is extracted for each sentence including the superimposed word. If the word stack is in the same participle, the sentence is discarded. For example, assuming that there are multiple foldings in a sentence, such as "amount" and "of", the sentence is discarded in the case where each foldover is located in the same participle, respectively, i.e., "amount" is located in the same participle, and "of" is located in another participle. If there is a word stack that is not in the same participle, the sentence is retained. And then, performing part-of-speech tagging and pinyin tagging on the segmented sentences, thereby obtaining the segmentation information of the segmented words where the overlapped characters are located. Here, the part-of-speech tagging and the pinyin tagging can be implemented by any suitable tagging method or tagging algorithm in the art. In another example, the word segmentation information may further include a number of constituent words of a word segmentation. Fig. 4 illustrates an example schematic diagram of word segmentation information of a sentence according to an embodiment of the present specification.
At 240, for sentences whose foldover is respectively located in adjacent groups, the obtained word segmentation information is used to detect foldover errors in the sentences. In one example, a foldover discrimination model may be used to detect foldover errors in the statement. In one example of the present specification, the foldover discrimination model may be a model capable of classification prediction, for example, a machine learning model or a deep learning model. Examples of the foldover discrimination model may include, but are not limited to, one of the following: a random forest model; a decision tree model; a gradient lifting tree model; a neural network model; a support vector machine; and a sensing machine. In this case, a corpus is used to train a double-letter discrimination model in advance.
FIG. 5 illustrates an example flow diagram of a training process for a foldover discriminant model in accordance with an embodiment of the present description.
As shown in FIG. 5, at 510, sentences containing foldover are determined from the corpus. In one example of the present specification, the corpus may be a corpus of textual statements obtained from other text processing systems or devices and/or textual statements crawled by a web crawler, for example textual statements obtained from a contract text database of a contract text store.
In one example, the corpus in the corpus may be sentence divided. For example, the corpus may be sentence divided by sentence dividers, examples of which may include, but are not limited to, periods, semicolons, question marks, and exclamation marks, for example. Then, a sentence including a character superimposition is determined from the divided sentences. For example, after the divided sentences are obtained, it is determined for each sentence whether or not the sentence includes a double-letter. If no ply word is contained, the statement is discarded.
At 520, a word segmentation process is performed on the sentence containing the word stack. In one example, a text-to-word algorithm is used to perform word segmentation on a sentence containing a word-stack. Examples of the text-to-word algorithm may include, but are not limited to: a text word segmentation algorithm based on a word segmentation dictionary; text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a model-based text word segmentation algorithm; or a text word segmentation algorithm based on word labeling. Furthermore, if the ply is in the same participle, the sentence is discarded. For example, assuming that there are multiple foldings in a sentence, such as "amount" and "of", the sentence is discarded in the case where each foldover is located in the same participle, respectively, i.e., "amount" is located in the same participle, and "of" is located in another participle. If there is a word stack that is not in the same participle, the sentence is retained.
At 530, for the sentences of which the overlapped characters are respectively positioned at the adjacent participles, the model feature vectors of the sentences are determined and label labeling is carried out.
In one example, the model feature vector may be determined according to word segmentation information of a word segment in which the character stack is located. For example, part-of-speech consistency, pinyin consistency and/or the number of constituent words of the superimposed word in adjacent part-words can be determined according to the word segmentation information. And then, generating a model feature vector of the superimposed character discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the formed words of the superimposed characters in the adjacent word segmentation. For example, in the above case, the model feature vector may be a 4-dimensional vector { a1, a2, a3, a4}, where a1 corresponds to part-of-speech consistency, and the value of a1 is 1 if the part-of-speech consistency is met, and the value of a1 is 0 if the part-of-speech inconsistency is not met. a2 corresponds to pinyin consistency, and if the pinyins are consistent, the value of a2 is 1, and if the pinyins are inconsistent, the value of a2 is 0. a3 corresponds to the number of words of the first word, the value of a3 is 1 if the number of words is 1, and the value of a3 is 0 if the number of words is not 1. a4 corresponds to the number of words of the first word, the value of a4 is 1 if the number of words is 1, and the value of a4 is 0 if the number of words is not 1. According to the above processing manner, the model feature vector of the double-word statement "charge by the valid verification amount" is [1,1,0,1], and the model feature vector of the double-word statement "all the unit price amounts of the targets are the same" is [0,0,0,1 ]. In addition, since a double-word error occurs in the double-word statement "charge by the valid verification quantity", the label of the statement is "1" to identify that the double-word error occurs in the statement. Since no superimposition error occurs in the superimposition sentence "all the unit price amounts of the targets are the same", the label of the sentence is "0" for identifying that no superimposition error occurs in the sentence, thereby completing the sentence labeling processing.
Then, at 540, the sentence subjected to the sentence marking processing is used to train the superimposed character discrimination model. Specifically, the model feature vector corresponding to the sentence is used as a model input and the labeled label is provided for the superimposed character discrimination model for model training until a model training end condition is satisfied, thereby training the superimposed character discrimination model.
Returning to fig. 2, in the case of performing the superimposition error detection using the superimposition discrimination model as described above, in one example, for each sentence, a model feature vector of the superimposition discrimination model is determined from the segmentation information of the sentence. The model feature vector may be the model feature vector determined as described above. Then, the specified model feature vector is supplied to a superimposition discrimination model to perform model discrimination, thereby realizing superimposition error detection for the sentence.
FIG. 6 illustrates an example flow diagram of a model input determination process for a foldover discrimination model in accordance with embodiments of the present description.
As shown in fig. 6, at 610, part-of-speech consistency, pinyin consistency and/or the number of component words of the superimposed word in the adjacent part-of-word are determined according to the word segmentation information. Then, at 620, model feature vectors of the superimposed character discrimination model are generated according to the part-of-speech consistency, pinyin consistency and/or the number of the constituent words of the superimposed characters in the adjacent segmented words.
A method for detecting a double-word error in a sentence according to an embodiment of the present specification is described above with reference to fig. 1 to 6.
By using the method, the word segmentation processing is carried out on the sentence containing the overlapped characters, the word segmentation part of speech and the word segmentation pinyin of different segmented words where the overlapped characters are located are extracted, and the extracted word segmentation part of speech and the word segmentation pinyin are used for carrying out the error detection of the overlapped characters, so that the efficiency and the accuracy of the error detection of the overlapped characters can be improved.
In addition, by using the method, when the word segmentation processing is carried out on the sentence containing the superimposed characters, the number of the component words of the word segmentation where the superimposed characters are located is extracted, and the extracted word segmentation part of speech, word segmentation pinyin and the number of the component words are used for carrying out the error detection of the superimposed characters, so that the efficiency and the accuracy of the error detection of the superimposed characters can be further improved.
In addition, by using the method, the word segmentation accuracy of the text word segmentation can be improved by using the text word segmentation algorithm based on the word segmentation dictionary. In addition, the random forest model is used for predicting and training the classification model, so that the model training efficiency and the model prediction efficiency can be improved.
In addition, model training is performed using the word segmentation part-of-speech consistency and pinyin consistency of the segmentation as model feature vectors, and can be completed with a smaller number of training samples (e.g., about 1000 pieces of data).
Further, optionally, in another example, the method may further include: determining a confusion score change value of the sentence before and after removing the superimposed word, and detecting a superimposed word error in the sentence using the obtained word segmentation information and the confusion score change value.
In this specification, the term "perplexity (ppl)" is used to indicate whether a sentence is smooth and conforms to the human speaking logic. The confusion is typically characterized using a confusion score. The more the sentence is straight, the lower the ppl score. The ppl score may be predicted using a language model.
For example, the ppl score may be performed by using the language model for the original sentence and the sentence after the overlap word is removed, respectively. If the ppl score becomes low after removal, the probability of a double-word error is considered to be high. Accordingly, the ppl score variation value may be used as a model feature of the double-word error discrimination module, so that the dimension of the model feature vector is increased from 4 dimensions to 5 dimensions compared to the previously described example. When the ppl score becomes low, the corresponding dimension value is "1", and when the ppl score is not changed or becomes high, the corresponding dimension value is "0". Then, the 5-dimensional model feature vector is supplied to a double-character discrimination model as a model input of the double-character error discrimination model to perform model prediction.
By using the method, the accuracy of the error judgment of the superimposed character can be further improved by using the ppl score change value as the additional dimension of the error judgment of the superimposed character.
Fig. 7 shows a block diagram of an apparatus for detecting a double-stack error in a sentence (hereinafter referred to as "double-stack error detecting apparatus") 700 according to an embodiment of the present specification. As shown in fig. 7, the device 700 for detecting a superimposition error includes a segmentation processing unit 710, a segmentation information acquisition unit 720, and a superimposition error detection unit 730.
The word segmentation processing unit 710 is configured to perform word segmentation processing on a sentence containing a character stack. The operation of the participle processing unit 710 may refer to the operation of 220 described above with reference to fig. 2.
The word segmentation information obtaining unit 720 is configured to obtain word segmentation information of a word segmentation where the superimposed word is located when the superimposed word is located in an adjacent word segmentation, respectively, where the word segmentation information includes a word segmentation part of speech and a word segmentation pinyin. The operation of the segmentation information acquisition unit 720 may refer to the operation of 230 described above with reference to fig. 2.
The foldover error detection unit 730 is configured to detect a foldover error in a sentence using the segmentation information. The operation of the double-stack error detection unit 730 may refer to the operation of 240 described above with reference to fig. 2.
FIG. 8 shows a block diagram of one example of a foldover error detection unit 800 according to embodiments of the present description. As shown in FIG. 8, the double-stack error detection unit 800 includes a model input determination module 810 and a double-stack error detection module 820.
The model input determination module 810 is configured to determine a model feature vector of the superimposed character discrimination model from the segmented word information. The operation of the model input determination module 810 may refer to the operation of 610 described above with reference to fig. 6.
The foldover error detection module 820 is configured to provide the determined model feature vectors to a foldover discrimination model to detect foldover errors in the sentence. The operation of the double-stack error detection module 820 may refer to the operation of 620 described above with reference to FIG. 6.
Further optionally, in another example, the word segmentation information may further include a component word number of the word segmentation. Accordingly, the model input determining module 810 determines part-of-speech consistency, pinyin consistency and/or the number of constituent words of the superimposed word in the adjacent segmented words according to the segmentation information; and generating a model feature vector of the superimposed character discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the formed characters of the superimposed characters in the adjacent word segments.
Further, optionally, in another example, the word segmentation processing unit 710 may perform word segmentation processing on a sentence containing a word stack using a text word segmentation algorithm. Examples of the text-to-word algorithm may include, but are not limited to: a text word segmentation algorithm based on a word segmentation dictionary; text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a model-based text word segmentation algorithm; or a text word segmentation algorithm based on word labeling.
Further, optionally, in another example, the double-stack error detection apparatus 700 may further include a confusion degree change determination unit (not shown). The confusion change determination unit is configured to determine a confusion score change value of the sentence before and after the removal of the superimposed word. Accordingly, the superimposition error detection unit 730 detects a superimposition error in a sentence using the word segmentation information and the confusion score change value.
Further, optionally, in another example, the double-page error detection apparatus 700 may further include a sentence dividing unit (not shown) and a double-page sentence determination unit (not shown). The sentence division unit is configured to sentence-divide the input sentence. Then, the superimposed sentence determination unit determines a sentence including a superimposed word from the divided sentences.
As described above with reference to fig. 1 to 8, the superimposition error detection method and superimposition error detection apparatus according to the embodiments of the present specification are described. The above overlapping word error detection device can be implemented by hardware, and can also be implemented by software or a combination of hardware and software.
FIG. 9 shows a schematic diagram of an electronic device 900 for implementing a sentence overlap error detection process in accordance with embodiments of the present specification. As shown in fig. 9, the electronic device 900 may include at least one processor 910, a storage (e.g., non-volatile storage) 920, a memory 930, and a communication interface 940, and the at least one processor 910, the storage 920, the memory 930, and the communication interface 940 are connected together via a bus 960. The at least one processor 910 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory.
In one embodiment, computer-executable instructions are stored in the memory that, when executed, cause the at least one processor 910 to: performing word segmentation processing on the sentence containing the character stack; when the superposed characters are respectively positioned at adjacent participles, acquiring participle information of the participle where the superposed characters are positioned, wherein the participle information comprises participle part-of-speech and participle pinyin; and detecting a word folding error in the sentence using the word segmentation information.
It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 910 to perform the various operations and functions described above in connection with fig. 1-8 in the various embodiments of the present description.
According to one embodiment, a program product, such as a machine-readable medium (e.g., a non-transitory machine-readable medium), is provided. A machine-readable medium may have instructions (i.e., elements described above as being implemented in software) that, when executed by a machine, cause the machine to perform various operations and functions described above in connection with fig. 1-8 in the various embodiments of the present specification. Specifically, a system or apparatus may be provided which is provided with a readable storage medium on which software program code implementing the functions of any of the above embodiments is stored, and causes a computer or processor of the system or apparatus to read out and execute instructions stored in the readable storage medium.
In this case, the program code itself read from the readable medium can realize the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.
Examples of the readable storage medium include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or from the cloud via a communications network.
It will be understood by those skilled in the art that various changes and modifications may be made in the above-disclosed embodiments without departing from the spirit of the invention. Accordingly, the scope of the invention should be determined from the following claims.
It should be noted that not all steps and units in the above flows and system structure diagrams are necessary, and some steps or units may be omitted according to actual needs. The execution order of the steps is not fixed, and can be determined as required. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by a plurality of physical entities, or some units may be implemented by some components in a plurality of independent devices.
In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may comprise permanently dedicated circuitry or logic (such as a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware units or processors may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The specific implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.
The detailed description set forth above in connection with the appended drawings describes exemplary embodiments but does not represent all embodiments that may be practiced or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous" over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (17)
1. A method for detecting a double-word error in a statement, comprising:
performing word segmentation processing on the sentence containing the character stack;
when the superposed characters are respectively positioned at adjacent participles, acquiring participle information of the participle where the superposed characters are positioned, wherein the participle information comprises participle part-of-speech and participle pinyin; and
detecting a word-folding error in the sentence using the word segmentation information.
2. The method of claim 1, wherein the participle information further comprises a component word count of a participle.
3. The method of claim 1 or 2, wherein using the word segmentation information to detect a word folding error in the sentence comprises:
determining a model characteristic vector of a character-overlapping judging model according to the word segmentation information; and
providing the model feature vector to the foldover discrimination model to detect foldover errors in the statement.
4. The method of claim 3, wherein determining model feature vectors for a foldover discrimination model based on the word segmentation information comprises:
determining the part-of-speech consistency, the pinyin consistency and/or the number of forming words of the overlapped characters in the adjacent word segmentation according to the word segmentation information; and
and generating a model characteristic vector of the superimposed character discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the formed words of the superimposed characters in the adjacent word segmentation.
5. The method of claim 1, wherein the participling the sentence containing the stopover comprises:
and performing word segmentation on the sentence containing the overlapped characters by using a text word segmentation algorithm.
6. The method of claim 5, wherein the text-tokenization algorithm comprises:
a text word segmentation algorithm based on a word segmentation dictionary;
text word segmentation algorithm based on statistics;
a rule-based text word segmentation algorithm;
a model-based text word segmentation algorithm; or
And a text word segmentation algorithm based on character labeling.
7. The method of claim 1, further comprising:
determining a confusion score change value of the sentence before and after removing the stopover,
detecting a foldover error in the sentence using the word segmentation information comprises:
detecting a typographical error in the sentence using the word segmentation information and the confusion score change value.
8. The method of claim 3, wherein the foldover discrimination model comprises one of:
a random forest model;
a decision tree model;
a gradient lifting tree model;
a neural network model;
a support vector machine;
and a sensing machine.
9. The method of claim 1, further comprising:
carrying out sentence division on an input sentence; and
and determining the sentences containing the overlapped characters from the divided sentences.
10. An apparatus for detecting a double-word error in a sentence, comprising:
the word segmentation processing unit is used for carrying out word segmentation processing on the sentences containing the overlapped characters;
the word segmentation information acquisition unit is used for acquiring word segmentation information of a word segmentation where the superposed word is located when the superposed word is located at an adjacent word segmentation respectively, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin; and
and the character folding error detection unit is used for detecting character folding errors in the sentences by using the word segmentation information.
11. The apparatus of claim 10, wherein the double-stack error detection unit comprises:
the model input determining module is used for determining a model characteristic vector of the character folding distinguishing model according to the word segmentation information; and
and the character folding error detection module is used for providing the model feature vector to the character folding discrimination model to detect character folding errors in the statement.
12. The apparatus of claim 11, wherein the participle information further comprises a constituent word number of a participle, the model input determination module to:
determining the part-of-speech consistency, the pinyin consistency and/or the number of forming words of the overlapped characters in the adjacent word segmentation according to the word segmentation information; and
and generating a model characteristic vector of the superimposed character discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the formed words of the superimposed characters in the adjacent word segmentation.
13. The apparatus of claim 10, wherein the word segmentation processing unit performs word segmentation processing on the sentence containing the word overlap using a text word segmentation algorithm.
14. The apparatus of claim 10, further comprising:
a confusion change determination unit that determines a confusion score change value of the sentence before and after the removal of the superimposed word,
the superimposition error detection unit detects a superimposition error in the sentence using the word segmentation information and the confusion score change value.
15. The apparatus of claim 10, further comprising:
a sentence division unit that performs sentence division on an input sentence; and
a superimposed sentence specifying unit specifies a sentence including a superimposed word from the divided sentences.
16. An electronic device, comprising:
at least one processor, and
a memory coupled with the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 1-9.
17. A machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform the method of any of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010842426.7A CN111783458A (en) | 2020-08-20 | 2020-08-20 | Method and device for detecting overlapping character errors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010842426.7A CN111783458A (en) | 2020-08-20 | 2020-08-20 | Method and device for detecting overlapping character errors |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111783458A true CN111783458A (en) | 2020-10-16 |
Family
ID=72762169
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010842426.7A Pending CN111783458A (en) | 2020-08-20 | 2020-08-20 | Method and device for detecting overlapping character errors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111783458A (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050060150A1 (en) * | 2003-09-15 | 2005-03-17 | Microsoft Corporation | Unsupervised training for overlapping ambiguity resolution in word segmentation |
US20110196670A1 (en) * | 2010-02-09 | 2011-08-11 | Siemens Corporation | Indexing content at semantic level |
CN102324233A (en) * | 2011-08-03 | 2012-01-18 | 中国科学院计算技术研究所 | Method for automatically correcting identification error of repeated words in Chinese pronunciation identification |
CN104375986A (en) * | 2014-12-02 | 2015-02-25 | 江苏科技大学 | Automatic acquisition method of Chinese reduplication words |
CN105279149A (en) * | 2015-10-21 | 2016-01-27 | 上海应用技术学院 | Chinese text automatic correction method |
US9405743B1 (en) * | 2015-05-13 | 2016-08-02 | International Business Machines Corporation | Dynamic modeling of geospatial words in social media |
CN106527756A (en) * | 2016-10-26 | 2017-03-22 | 长沙军鸽软件有限公司 | Method and device for intelligently correcting input information |
CN106776549A (en) * | 2016-12-06 | 2017-05-31 | 桂林电子科技大学 | A kind of rule-based english composition syntax error correcting method |
CN108090045A (en) * | 2017-12-20 | 2018-05-29 | 珠海市君天电子科技有限公司 | A kind of method for building up of marking model, segmenting method and device |
CN111144100A (en) * | 2019-12-24 | 2020-05-12 | 五八有限公司 | Question text recognition method and device, electronic equipment and storage medium |
US20200193217A1 (en) * | 2017-02-27 | 2020-06-18 | Yutou Technology (Hangzhou) Co., Ltd. | Method for determining sentence similarity |
-
2020
- 2020-08-20 CN CN202010842426.7A patent/CN111783458A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050060150A1 (en) * | 2003-09-15 | 2005-03-17 | Microsoft Corporation | Unsupervised training for overlapping ambiguity resolution in word segmentation |
US20110196670A1 (en) * | 2010-02-09 | 2011-08-11 | Siemens Corporation | Indexing content at semantic level |
CN102324233A (en) * | 2011-08-03 | 2012-01-18 | 中国科学院计算技术研究所 | Method for automatically correcting identification error of repeated words in Chinese pronunciation identification |
CN104375986A (en) * | 2014-12-02 | 2015-02-25 | 江苏科技大学 | Automatic acquisition method of Chinese reduplication words |
US9405743B1 (en) * | 2015-05-13 | 2016-08-02 | International Business Machines Corporation | Dynamic modeling of geospatial words in social media |
CN105279149A (en) * | 2015-10-21 | 2016-01-27 | 上海应用技术学院 | Chinese text automatic correction method |
CN106527756A (en) * | 2016-10-26 | 2017-03-22 | 长沙军鸽软件有限公司 | Method and device for intelligently correcting input information |
CN106776549A (en) * | 2016-12-06 | 2017-05-31 | 桂林电子科技大学 | A kind of rule-based english composition syntax error correcting method |
US20200193217A1 (en) * | 2017-02-27 | 2020-06-18 | Yutou Technology (Hangzhou) Co., Ltd. | Method for determining sentence similarity |
CN108090045A (en) * | 2017-12-20 | 2018-05-29 | 珠海市君天电子科技有限公司 | A kind of method for building up of marking model, segmenting method and device |
CN111144100A (en) * | 2019-12-24 | 2020-05-12 | 五八有限公司 | Question text recognition method and device, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
张春霞, 郝天永: "汉语自动分词的研究现状与困难", 系统仿真学报, no. 01, 20 January 2005 (2005-01-20) * |
甘蓉;: "中文分词交叉型歧义消解算法", 西华大学学报(自然科学版), no. 06, 20 November 2018 (2018-11-20) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11475209B2 (en) | Device, system, and method for extracting named entities from sectioned documents | |
US11734328B2 (en) | Artificial intelligence based corpus enrichment for knowledge population and query response | |
CN110188347B (en) | Text-oriented method for extracting cognitive relationship between knowledge topics | |
US11055327B2 (en) | Unstructured data parsing for structured information | |
CN110597964B (en) | Double-recording quality inspection semantic analysis method and device and double-recording quality inspection system | |
KR100630886B1 (en) | Character string identification | |
CN107943911A (en) | Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing | |
CN107357837A (en) | The electric business excavated based on order-preserving submatrix and Frequent episodes comments on sensibility classification method | |
SE517005C2 (en) | Segmentation of text | |
US11386269B2 (en) | Fault-tolerant information extraction | |
WO2023038722A1 (en) | Entry detection and recognition for custom forms | |
Schaback et al. | Multi-level feature extraction for spelling correction | |
Ekbal et al. | Voted NER system using appropriate unlabeled data | |
CN115146644B (en) | Alarm situation text-oriented multi-feature fusion named entity identification method | |
Vinitha et al. | Error detection in indic ocrs | |
Jayasuriya et al. | Learning a stochastic part of speech tagger for sinhala | |
CN111783458A (en) | Method and device for detecting overlapping character errors | |
CN111209373A (en) | Sensitive text recognition method and device based on natural semantics | |
Oudah et al. | Person name recognition using the hybrid approach | |
Larsson | Classification into readability levels: implementation and evaluation | |
CN111782773B (en) | Text matching method and device based on cascade mode | |
US20140093173A1 (en) | Classifying a string formed from hand-written characters | |
CN114049642A (en) | Text recognition method and computing device for form certificate image piece | |
Srikumar et al. | Extraction of entailed semantic relations through syntax-based comma resolution | |
Fujinaga et al. | Automatic score extraction with optical music recognition (omr) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |