CN111783458B - Method and device for detecting character overlapping errors - Google Patents
Method and device for detecting character overlapping errors Download PDFInfo
- Publication number
- CN111783458B CN111783458B CN202010842426.7A CN202010842426A CN111783458B CN 111783458 B CN111783458 B CN 111783458B CN 202010842426 A CN202010842426 A CN 202010842426A CN 111783458 B CN111783458 B CN 111783458B
- Authority
- CN
- China
- Prior art keywords
- word
- word segmentation
- sentence
- overlapped
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000011218 segmentation Effects 0.000 claims abstract description 191
- 238000012545 processing Methods 0.000 claims abstract description 27
- 238000004422 calculation algorithm Methods 0.000 claims description 47
- 238000001514 detection method Methods 0.000 claims description 37
- 239000013598 vector Substances 0.000 claims description 33
- 239000000470 constituent Substances 0.000 claims description 18
- 230000008859 change Effects 0.000 claims description 15
- 238000002372 labelling Methods 0.000 claims description 11
- 238000007637 random forest analysis Methods 0.000 claims description 4
- 238000003066 decision tree Methods 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 3
- 238000012706 support-vector machine Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 12
- 238000012549 training Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 241000590419 Polygonia interrogationis Species 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- ANBQYFIVLNNZCU-CQCLMDPOSA-N alpha-L-Fucp-(1->2)-[alpha-D-GalpNAc-(1->3)]-beta-D-Galp-(1->3)-[alpha-L-Fucp-(1->4)]-beta-D-GlcpNAc-(1->3)-beta-D-Galp Chemical compound O[C@H]1[C@H](O)[C@H](O)[C@H](C)O[C@H]1O[C@H]1[C@H](O[C@H]2[C@@H]([C@@H](O[C@@H]3[C@@H]([C@@H](O)[C@@H](O)[C@@H](CO)O3)NC(C)=O)[C@@H](O)[C@@H](CO)O2)O[C@H]2[C@H]([C@H](O)[C@H](O)[C@H](C)O2)O)[C@@H](NC(C)=O)[C@H](O[C@H]2[C@H]([C@@H](CO)O[C@@H](O)[C@@H]2O)O)O[C@@H]1CO ANBQYFIVLNNZCU-CQCLMDPOSA-N 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
Embodiments of the present description provide methods and apparatus for detecting a fold word error in a sentence. In the method, word segmentation processing is performed on sentences containing superimposed words. In addition, aiming at sentences with overlapped characters respectively positioned in adjacent word segmentation, word segmentation information of the word segmentation in which the overlapped characters are positioned is obtained, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin. Then, the obtained word segmentation information is used to detect a stop-word error in the sentence.
Description
Technical Field
Embodiments of the present disclosure relate generally to the field of language processing, and more particularly, to a method and apparatus for detecting a word stack error in a sentence.
Background
When sentence analysis is performed, it is found that a word overlap phenomenon exists in the analyzed sentence. The phenomenon of overlapping words refers to that words in adjacent positions in the same sentence are identical. These overlapping words in the sentence may be caused by incorrect repeated inputs, for example: the "quantity" in "billing by effective check quantity" is wrong, but some of the doublewords may also be correct, for example "baba" in "aleb networks technologies limited". In some official documents, the wrong input of the superimposed words may give bad impressions to the partner, and may even generate legal risks or legal disputes, for example, when signing a contract, the superimposed words "absolutely" in "first party pays the first party by absolutely yuan (including tax price)" to second party "may cause the contract clauses to be wrong, which causes risks of generating legal disputes later.
Disclosure of Invention
In view of the foregoing, embodiments of the present disclosure provide a method and apparatus for detecting a stack word error in a sentence. By using the method and the device, the character overlapping error detection is carried out by using the word segmentation part of speech and the word segmentation pinyin of different word segmentation where the overlapped characters are located, so that the efficiency and the accuracy of the character overlapping error detection can be improved.
According to an aspect of embodiments of the present specification, there is provided a method for detecting a stitch word error in a sentence, including: word segmentation processing is carried out on sentences containing overlapped words; when the overlapped characters are respectively positioned in adjacent word segmentation, word segmentation information of the word segmentation where the overlapped characters are positioned is obtained, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin; and detecting a stop-word error in the sentence using the word segmentation information.
Optionally, in one example of the above aspect, the word segmentation information further includes a component word number of the segmented word.
Optionally, in one example of the above aspect, detecting a stop-word error in the sentence using the word segmentation information includes: determining a model feature vector of the overlapped character judging model according to the word segmentation information; and providing the model feature vector to the stitch discrimination model to detect stitch errors in the sentence.
Optionally, in one example of the above aspect, determining the model feature vector of the stitch discrimination model according to the word segmentation information includes: determining part-of-speech consistency, pinyin consistency and/or number of component words of the overlapped words in the adjacent segmented words according to the segmented word information; and generating model feature vectors of the overlapped word discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the constituent words of the overlapped word in the adjacent word segmentation.
Optionally, in one example of the above aspect, performing word segmentation processing on the sentence containing the superimposed word includes: a text word segmentation algorithm is used to segment sentences containing overlapping words.
Optionally, in one example of the above aspect, the text word segmentation algorithm includes: a text word segmentation algorithm based on a word segmentation dictionary; a text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a text word segmentation algorithm based on a model; or text word segmentation algorithm based on word labeling.
Optionally, in one example of the above aspect, the method further comprises: determining a confusion degree score change value of the sentence before and after removing the overlapped word, and detecting the overlapped word error in the sentence by using the word segmentation information comprises the following steps: and detecting the character overlapping error in the sentence by using the word segmentation information and the confusion degree score variation value.
Optionally, in one example of the above aspect, the stitch word discrimination model includes one of the following models: a random forest model; a decision tree model; gradient lifting tree model; a neural network model; a support vector machine; and a sensing machine.
Optionally, in one example of the above aspect, the method further comprises: performing sentence division on an input sentence; and determining the sentence containing the overlapped word from the divided sentences.
According to another aspect of embodiments of the present specification, there is provided an apparatus for detecting a stitch word error in a sentence, comprising: the word segmentation processing unit is used for carrying out word segmentation processing on sentences containing overlapped characters; the word segmentation information acquisition unit is used for acquiring word segmentation information of a word segment where the superimposed character is located when the superimposed character is located in an adjacent word segment respectively, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin; and a character-overlapping error detection unit that detects a character-overlapping error in the sentence using the word segmentation information.
Optionally, in one example of the above aspect, the stitch error detection unit includes: the model input determining module is used for determining a model feature vector of the overlapped character judging model according to the word segmentation information; and the character-overlapping error detection module is used for providing the model feature vector to the character-overlapping discrimination model to detect character-overlapping errors in the sentences.
Optionally, in one example of the above aspect, the word segmentation information further includes a word number of the word segment, and the model input determining module: determining part-of-speech consistency, pinyin consistency and/or number of component words of the overlapped words in the adjacent segmented words according to the segmented word information; and generating model feature vectors of the overlapped word discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the constituent words of the overlapped word in the adjacent word segmentation.
Optionally, in one example of the above aspect, the word segmentation processing unit uses a text word segmentation algorithm to perform word segmentation processing on the sentence containing the superimposed word.
Optionally, in one example of the above aspect, the apparatus further includes: and the confusion degree change determining unit is used for determining a confusion degree score change value of the sentence before and after the word stack is removed, and the word stack error detecting unit is used for detecting the word stack error in the sentence by using the word segmentation information and the confusion degree score change value.
Optionally, in one example of the above aspect, the apparatus further includes: a sentence dividing unit that performs sentence division on an input sentence; and a superimposed word sentence determination unit that determines a sentence containing superimposed words from the divided sentences.
According to another aspect of embodiments of the present specification, there is provided an electronic device including: at least one processor, and a memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the stitch word error detection method as described above.
According to another aspect of embodiments of the present description, there is provided a machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform a method of stitch word error detection as described above.
Drawings
A further understanding of the nature and advantages of the present description may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.
Fig. 1 shows an example schematic diagram of a lap sentence according to an embodiment of the present specification.
FIG. 2 illustrates an example flow chart of a method for detecting a stitch word error in a sentence according to an embodiment of this specification.
Fig. 3 shows an example schematic diagram of an input sentence according to an embodiment of the present specification.
Fig. 4 shows an exemplary schematic diagram of word segmentation information of a sentence according to an embodiment of the present specification.
Fig. 5 shows an example flowchart of a training process of a stitch word discrimination model according to an embodiment of the present specification.
Fig. 6 shows an example flowchart of a model input determination process of the stitch discrimination model according to an embodiment of the present specification.
Fig. 7 shows a block diagram of an apparatus for detecting a stitch word error in a sentence according to an embodiment of the present specification.
Fig. 8 shows a block diagram of one example of a stitch error detection unit according to an embodiment of the present specification.
Fig. 9 shows a schematic diagram of an electronic device for implementing a sentence stack error detection process according to an embodiment of the present description.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It should be appreciated that these embodiments are discussed only to enable a person skilled in the art to better understand and thereby practice the subject matter described herein, and are not limiting of the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure as set forth in the specification. Various examples may omit, replace, or add various procedures or components as desired. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may be combined in other examples as well.
As used herein, the term "comprising" and variations thereof mean open-ended terms, meaning "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment. The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. Unless the context clearly indicates otherwise, the definition of a term is consistent throughout this specification.
In the text sentence processing, text sentence analysis is required. When text sentence analysis is performed, it is found that a word stack phenomenon exists in the analyzed text sentence, for example: "billing according to the effective check amount", "aleba network technology limited", etc. These doublewords may be correct doublewords or doublewords due to text input errors. In this specification, a superimposed word due to a text input error is referred to as a "superimposed word error". In some cases, the word stack error in the text sentence may give bad impressions to the partner, and may even generate legal risks or legal disputes, for example, when the contract is signed, if the following word stack error, namely that the first party pays the second party by the four absolutely yuan, exists in the contract, the contract clause error is caused, and the subsequent legal dispute risks are caused.
In view of the foregoing, embodiments of the present disclosure propose a method and apparatus for detecting a stack word error in a sentence. By using the method and the device, the sentence containing the overlapped characters is subjected to word segmentation processing, the word segmentation part of speech and the word segmentation pinyin of different word segments where the overlapped characters are located are extracted, and the extracted word segmentation part of speech and the extracted word segmentation pinyin are used for performing overlapped character error detection, so that the efficiency and the accuracy of overlapped character error detection can be improved.
A method and apparatus for detecting a stitch word error in a sentence according to embodiments of the present specification are described below with reference to the accompanying drawings.
Fig. 1 shows an example schematic diagram of a lap sentence according to an embodiment of the present specification.
In fig. 1, an example of three lap-word sentences is shown. The "quantity" in the "quantity-by-quantity accounting" of the valid check quantity is the doubleword, the "absolutely" in the "first party paid absolutely (including tax price) to the" second party "of the doubleword statement 1" is the doubleword, and the "in the" same "for all the target amounts of the doubleword statement 3" is the doubleword.
It is to be noted that only one stitch is included in each of the example sentences shown in fig. 1. In other examples of the present specification, there may be multiple doublewords in a single doubleword sentence.
FIG. 2 illustrates an example flow chart of a method 200 for detecting a stitch word error in a sentence according to an embodiment of this specification.
As shown in fig. 2, at 210, a sentence containing a lap is determined from the input sentences. In one example of the present specification, the input sentence may be a text sentence input by the user in real time, or may be a text sentence obtained from another text processing system or apparatus, for example, a text sentence obtained from a contract text database of a contract text storage device. Fig. 3 shows an example schematic diagram of an input sentence according to an embodiment of the present specification.
In one example, in retrieving an input sentence, the input sentence may be sentence-divided. For example, the input sentence may be sentence divided in terms of sentence segmentors, examples of which may include, but are not limited to, periods, semicolons, question marks, and exclamation marks, for example. For example, the input sentence in fig. 3 is divided into 3 sentences: "all target monovalent amounts are the same", "charge by effective check amount" and "complete payment within a single month of receipt of the first party account".
Then, a sentence containing a superimposed word is determined from the divided sentences. For example, after the divided sentences are obtained, it is determined for each sentence whether or not the sentence contains a superimposed word. If the stack is not contained, the statement is discarded. For example, for the above 3 sentences, "all the target unit price amounts are the same", "charged by the effective check amount", and "and the fee payment is completed within a single month of receipt of the first party account", the sentence "and the fee payment is completed within a single month of receipt of the first party account" are not included in the superimposed word, whereby the sentence is discarded, and the sentence "all the target unit price amounts are the same" and "charged by the effective check amount" are determined as the sentence including the superimposed word.
At 220, word segmentation is performed on the sentence containing the superimposed word. In one example, a text word segmentation algorithm is used to segment sentences containing overlapping words. Examples of the text word segmentation algorithm may include, but are not limited to: a text word segmentation algorithm based on a word segmentation dictionary; a text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a text word segmentation algorithm based on a model; or text word segmentation algorithm based on word labeling.
In this specification, the word segmentation dictionary may be, for example, a custom dictionary. The custom dictionary may be determined by using a library of words. Statistical-based text word segmentation algorithms may perform text word segmentation based on the probability or frequency of occurrence of a word adjacent to a word. Examples of the statistical-based text-word segmentation algorithm may include an N-gram model (N-gram) -based text-word segmentation algorithm, a hidden markov model-based text-word segmentation algorithm. The rule-based text word segmentation algorithm can perform semantic analysis and syntactic analysis on the sentence, and perform text word segmentation on the sentence by utilizing syntactic information and semantic information. The model-based text word segmentation algorithm may be, for example, a text word segmentation algorithm based on a text word segmentation model.
Text word segmentation algorithms based on word labeling are in fact word forming methods, i.e. the text word segmentation process is regarded as a labeling problem of words in a string. Since each word occupies a certain word position (i.e., a word position) when constructing a particular word, it is assumed that there are at most four word positions per word: namely B (word head), M (word middle), E (word tail) and S (word forming alone), the word segmentation result (1) of the following sentences can be directly expressed into a word-by-word labeling form as shown in (2): (1) word segmentation result: maritime/planning/N/book/century/end/implementation/average/domestic/production/total/five thousand dollars/; (2) word annotation format: shanghai/E meter/B score/E N/S book/S age/B age/E end/S actual/B implementation/E person/B average/E country/B in/E raw/B produced/E total/B value/E five/B thousand/M Mei/M yuan/E. and/S. In this specification, the term "word" is not limited to a kanji character, but may include characters such as foreign letters, arabic numerals, punctuation marks, and the like.
By regarding the word segmentation process as a word labeling problem, recognition problems of vocabulary words and unregistered words (such as person names, place names and organization names) can be considered in balance. In this word segmentation technique, both the vocabulary words and the unregistered words in the text sentence are implemented using a unified word annotation process. In the learning architecture, the vocabulary word information does not need to be emphasized specifically, and a specific unregistered word recognition module does not need to be designed specifically. This greatly simplifies the design of the text segmentation system. In the process of word annotation, all words learn word characteristics according to predefined features to obtain a probability model. And then, obtaining a labeling result of the word position on the word string to be separated according to the combination tightness degree between the words. And finally, directly obtaining a final word segmentation result according to the word position definition.
At 230, word segmentation information of the word segmentation where the superimposed character is located is obtained for sentences where the superimposed character is located in adjacent word segmentation respectively, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin.
For example, after the word segmentation processing is performed on the sentences including the superimposed words as described above, the word segmentation where the superimposed words are located is extracted for each sentence including the superimposed words. If the stack is located in the same word segment, the statement is discarded. For example, assuming that there are a plurality of superimposed words in a sentence, such as "quantity" and "the same word, in the case where each superimposed word is located in the same word, i.e., the" quantity "is located in the same word, and the" quantity "is located in another word, the sentence is discarded. If there is a stack that is not in the same word segment, the statement is preserved. And then, part-of-speech tagging and pinyin tagging are carried out on the segmented sentences, so that word segmentation information of the segmented words where the superimposed characters are located is obtained. Here, the part-of-speech tagging and pinyin tagging may be implemented using any suitable tagging means or tagging algorithm in the art. In another example, the word segmentation information may further include a number of constituent words of the word segment. Fig. 4 shows an exemplary schematic diagram of word segmentation information of a sentence according to an embodiment of the present specification.
At 240, for a sentence in which the superimposed characters are respectively located in adjacent groups, the obtained word segmentation information is used to detect a superimposed character error in the sentence. In one example, a stitch discrimination model may be used to detect stitch errors in the statement. In one example of the present specification, the superimposed-word discrimination model may be a model capable of classification prediction, for example, a machine learning model or a deep learning model. Examples of the stitch discrimination model may include, but are not limited to, one of the following: a random forest model; a decision tree model; gradient lifting tree model; a neural network model; a support vector machine; and a sensing machine. In this case, it is necessary to train the character stack discrimination model in advance using a database.
Fig. 5 shows an example flowchart of a training process of a stitch word discrimination model according to an embodiment of the present specification.
As shown in fig. 5, at 510, a sentence containing a lap is determined from a corpus. In one example of the present specification, the corpus may be a corpus of text sentences obtained from other text processing systems or devices and/or text sentences crawled by web crawlers, e.g., text sentences obtained from a contract text database of a contract text store.
In one example, the corpus in the corpus may be sentence-partitioned. For example, the corpus may be sentence divided in terms of sentence segmentors, examples of which may include, but are not limited to, periods, semicolons, question marks, and exclamation marks, for example. Then, a sentence containing a superimposed word is determined from the divided sentences. For example, after the divided sentences are obtained, it is determined for each sentence whether or not the sentence contains a superimposed word. If the stack is not contained, the statement is discarded.
At 520, word segmentation is performed on the sentence containing the superimposed word. In one example, a text word segmentation algorithm is used to segment sentences containing overlapping words. Examples of the text word segmentation algorithm may include, but are not limited to: a text word segmentation algorithm based on a word segmentation dictionary; a text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a text word segmentation algorithm based on a model; or text word segmentation algorithm based on word labeling. Furthermore, if the stack is located in the same word segment, the statement is discarded. For example, assuming that there are a plurality of superimposed words in a sentence, such as "quantity" and "the same word, in the case where each superimposed word is located in the same word, i.e., the" quantity "is located in the same word, and the" quantity "is located in another word, the sentence is discarded. If there is a stack that is not in the same word segment, the statement is preserved.
At 530, for the sentence whose superimposed word is located in the adjacent word, the model feature vector of the sentence is determined and labeled.
In one example, the model feature vector may be determined according to word segmentation information of a word segment in which the superimposed word is located. For example, part-of-speech consistency, pinyin consistency, and/or the number of constituent words of the superimposed word in adjacent segmented words may be determined based on the segmentation information. And then, generating model feature vectors of the overlapped word discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the constituent words of the overlapped word in the adjacent word segmentation. For example, in the above case, the model feature vector may be a 4-dimensional vector { a1, a2, a3, a4}, where a1 corresponds to part of speech consistency, the value of a1 is 1 if the parts of speech are consistent, and the value of a1 is 0 if the parts of speech are not consistent. a2 corresponds to pinyin consistency, if pinyin is consistent, the value of a2 is 1, and if pinyin is inconsistent, the value of a2 is 0. a3 corresponds to the number of words of the first word, and if the number of words is 1, the value of a3 is 1, and if the number of words is not 1, the value of a3 is 0. a4 corresponds to the number of words of the first word, and if the number of words is 1, the value of a4 is 1, and if the number of words is not 1, the value of a4 is 0. According to the above processing manner, the model feature vector of the superimposed sentence "charged by the effective check amount" is [1, 0,1], and the model feature vector of the superimposed sentence "all the target unit price amounts are the same" is [0, 1]. In addition, since the overlapping word sentence "billing according to the effective check quantity" has an overlapping word error, the label of the sentence is "1" for identifying that the sentence has an overlapping word error. Since the overlapping word sentence is that all the target unit price is the same, no overlapping word error occurs, the label of the sentence is 0, and the label is used for identifying that the sentence has no overlapping word error, thereby completing the sentence marking process.
Then, at 540, the word stack discrimination model is trained using the sentence subjected to the sentence labeling process. Specifically, the model feature vector corresponding to the sentence is used as a model input and the labeled label is provided for the overlapped word discrimination model to perform model training until the model training ending condition is met, so that the overlapped word discrimination model is trained.
Returning to fig. 2, in the case of using the superimposed-character discrimination model for superimposed-character error detection as described above, in one example, for each sentence, a model feature vector of the superimposed-character discrimination model is determined from word segmentation information of the sentence. The model feature vector may be the model feature vector determined as above. And then, providing the determined model feature vector to a character stack discrimination model to perform model discrimination, thereby realizing character stack error detection for the sentence.
Fig. 6 shows an example flowchart of a model input determination process of the stitch discrimination model according to an embodiment of the present specification.
As shown in fig. 6, at 610, part-of-speech consistency, pinyin consistency, and/or number of constituent words of the overlapping words in the adjacent word segments are determined from the word segment information. Then, at 620, model feature vectors of the superimposed word discrimination model are generated based on the part-of-speech consistency, pinyin consistency, and/or constituent word count of the superimposed word in the adjacent segmented words.
A method for detecting a stitch word error in a sentence according to an embodiment of the present specification is described above with reference to fig. 1 to 6.
By using the method, the sentence containing the overlapped characters is subjected to word segmentation processing, the word segmentation part of speech and the word segmentation pinyin of different word segments where the overlapped characters are located are extracted, and the extracted word segmentation part of speech and the extracted word segmentation pinyin are used for performing overlapped character error detection, so that the efficiency and the accuracy of overlapped character error detection can be improved.
In addition, by using the method, when the sentence containing the overlapped characters is subjected to word segmentation, the number of the component words of the segmented words where the overlapped characters are located is also extracted, and the extracted word part of the segmented words, the word spelling and the number of the component words are used for carrying out overlapped character error detection, so that the efficiency and the accuracy of the overlapped character error detection can be further improved.
In addition, by using the method, the word segmentation accuracy of the text word segmentation can be improved by using the text word segmentation algorithm based on the word segmentation dictionary to carry out text word segmentation. In addition, the random forest model is utilized for classification model prediction and training, so that model training efficiency and model prediction efficiency can be improved.
Furthermore, model training using word part of speech consistency and pinyin consistency of the word as model feature vectors may be accomplished with a small number of training samples (e.g., about 1000 pieces of data).
Further optionally, in another example, the method may further include: and determining a confusion degree score change value of the sentence before and after the word stack is removed, and detecting the word stack error in the sentence by using the obtained word segmentation information and the confusion degree score change value.
In this specification, the term "confusion" (perplexity, ppl) is used to indicate whether a sentence is smooth or not, and whether it conforms to the person's speaking logic. Confusion is typically characterized using a confusion score. The more smooth the sentence, the lower the ppl score. The ppl score may be predicted using a language model.
For example, the ppl scoring may be performed by using a language model for the original sentence and the sentence after the word stack is removed, respectively. If after removal, the ppl score becomes lower, the probability of a stitch error being considered to be present is greater. Accordingly, the ppl score change value may be regarded as one model feature of the stitch error discrimination module, whereby the dimension of the model feature vector is increased from 4 dimensions to 5 dimensions compared to the previously described example. The corresponding dimension value is "1" when the ppl score becomes low, and "0" when the ppl score is unchanged or becomes high. Then, the 5-dimensional model feature vector is supplied as a model input to the stitch error discrimination model to perform model prediction.
By using the method, the correction of the character stacking error can be further improved by taking the ppl score change value as the additional dimension of the character stacking error correction.
Fig. 7 shows a block diagram of an apparatus for detecting a stitch error in a sentence (hereinafter referred to as "stitch error detection apparatus") 700 according to an embodiment of the present specification. As shown in fig. 7, the superimposed character error detection apparatus 700 includes a word segmentation processing unit 710, a word segmentation information acquisition unit 720, and a superimposed character error detection unit 730.
The word segmentation processing unit 710 is configured to perform word segmentation processing on sentences containing superimposed words. The operation of the word segmentation processing unit 710 may refer to the operation of 220 described above with reference to fig. 2.
The word segmentation information obtaining unit 720 is configured to obtain word segmentation information of a word segment where the superimposed word is located when the superimposed word is located in an adjacent word segment, where the word segmentation information includes a word segmentation part of speech and a word segmentation pinyin. The operation of the word segmentation information acquisition unit 720 may refer to the operation of 230 described above with reference to fig. 2.
The stop-word error detection unit 730 is configured to detect a stop-word error in a sentence using word segmentation information. The operation of the stitch error detection unit 730 may refer to the operation of 240 described above with reference to fig. 2.
Fig. 8 shows a block diagram of one example of a stitch error detection unit 800 according to an embodiment of the present specification. As shown in fig. 8, the stitch error detection unit 800 includes a model input determination module 810 and a stitch error detection module 820.
The model input determination module 810 is configured to determine model feature vectors of the stitch discrimination model from the word segmentation information. The operation of the model input determination module 810 may refer to the operation of 610 described above with reference to fig. 6.
The stitch error detection module 820 is configured to provide the determined model feature vector to a stitch discrimination model to detect stitch errors in the sentence. The operation of the stitch error detection module 820 may refer to the operation of 620 described above with reference to fig. 6.
Further, optionally, in another example, the word segmentation information may further include a number of constituent words of the word. Accordingly, the model input determination module 810 determines the part-of-speech consistency, pinyin consistency, and/or the number of constituent words of the superimposed word in adjacent segmented words according to the segmented word information; and generating model feature vectors of the overlapped word discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the constituent words of the overlapped word in the adjacent word segmentation.
Further alternatively, in another example, the word segmentation processing unit 710 may perform word segmentation processing on sentences containing superimposed words using a text word segmentation algorithm. Examples of the text word segmentation algorithm may include, but are not limited to: a text word segmentation algorithm based on a word segmentation dictionary; a text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a text word segmentation algorithm based on a model; or text word segmentation algorithm based on word labeling.
Further, alternatively, in another example, the stitch error detection apparatus 700 may further include a confusion degree change determination unit (not shown). The confusion-degree-change determining unit is configured to determine a confusion-degree-score change value of the sentence before and after removing the superimposed word. Accordingly, the stop-word error detection unit 730 detects a stop-word error in a sentence using the word segmentation information and the confusion score variation value.
Further, alternatively, in another example, the lap word error detection apparatus 700 may further include a sentence dividing unit (not shown) and a lap word sentence determining unit (not shown). The sentence dividing unit is configured to perform sentence division on an input sentence. Then, the superimposed word sentence determining unit determines a sentence containing superimposed words from the divided sentences.
As described above with reference to fig. 1 to 8, the stitch error detection method and the stitch error detection apparatus according to the embodiments of the present specification are described. The above stack error detection device may be implemented in hardware, or may be implemented in software, or a combination of hardware and software.
Fig. 9 shows a schematic diagram of an electronic device 900 for implementing a sentence stack error detection process according to an embodiment of the present description. As shown in fig. 9, the electronic device 900 may include at least one processor 910, memory (e.g., non-volatile memory) 920, memory 930, and a communication interface 940, with the at least one processor 910, memory 920, memory 930, and communication interface 940 being connected together via a bus 960. The at least one processor 910 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory.
In one embodiment, computer-executable instructions are stored in memory that, when executed, cause the at least one processor 910 to: word segmentation processing is carried out on sentences containing overlapped words; when the overlapped characters are respectively positioned in adjacent word segmentation, word segmentation information of the word segmentation where the overlapped characters are positioned is obtained, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin; and detecting a stop-word error in the sentence using the word segmentation information.
It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 910 to perform the various operations and functions described above in connection with fig. 1-8 in various embodiments of the present description.
According to one embodiment, a program product such as a machine-readable medium (e.g., a non-transitory machine-readable medium) is provided. The machine-readable medium may have instructions (i.e., elements described above implemented in software) that, when executed by a machine, cause the machine to perform the various operations and functions described above in connection with fig. 1-8 in various embodiments of the specification. In particular, a system or apparatus provided with a readable storage medium having stored thereon software program code implementing the functions of any of the above embodiments may be provided, and a computer or processor of the system or apparatus may be caused to read out and execute instructions stored in the readable storage medium.
In this case, the program code itself read from the readable medium may implement the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.
Examples of readable storage media include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or cloud by a communications network.
It will be appreciated by those skilled in the art that various changes and modifications can be made to the embodiments disclosed above without departing from the spirit of the invention. Accordingly, the scope of the invention should be limited only by the attached claims.
It should be noted that not all the steps and units in the above flowcharts and the system configuration diagrams are necessary, and some steps or units may be omitted according to actual needs. The order of execution of the steps is not fixed and may be determined as desired. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by multiple physical entities, or may be implemented jointly by some components in multiple independent devices.
In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may include permanently dedicated circuitry or logic (e.g., a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware unit or processor may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The particular implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.
The detailed description set forth above in connection with the appended drawings describes exemplary embodiments, but does not represent all embodiments that may be implemented or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (12)
1. A method for detecting a stitch word error in a sentence, comprising:
word segmentation processing is carried out on sentences containing overlapped words;
when the overlapped characters are respectively positioned in adjacent word segmentation, word segmentation information of the word segmentation where the overlapped characters are positioned is obtained, wherein the word segmentation information comprises word segmentation part of speech, word segmentation pinyin and the number of the constituent words of the word segmentation;
Determining part-of-speech consistency, pinyin consistency and number of component words of the overlapped words in the adjacent word segmentation according to the word segmentation information;
generating model feature vectors of the overlapped word discrimination model according to the part-of-speech consistency, the pinyin consistency and the number of the constituent words of the overlapped words in the adjacent word segmentation; and
Providing the model feature vector to the stitch discrimination model to detect stitch errors in the sentence.
2. The method of claim 1, wherein word segmentation of the sentence containing the superimposed word comprises:
A text word segmentation algorithm is used to segment sentences containing overlapping words.
3. The method of claim 2, wherein the text-word segmentation algorithm comprises:
A text word segmentation algorithm based on a word segmentation dictionary;
a text word segmentation algorithm based on statistics;
A rule-based text word segmentation algorithm;
A text word segmentation algorithm based on a model; or alternatively
Text word segmentation algorithm based on word labeling.
4. The method of claim 1, further comprising:
determining the confusion degree score change value of the sentence before and after removing the overlapped word,
Generating a model feature vector of the overlapped word discrimination model according to the part of speech consistency, the pinyin consistency and the number of the constituent words of the overlapped word in the adjacent word segmentation comprises the following steps:
And generating model feature vectors of the overlapped word judging model according to the part-of-speech consistency, the pinyin consistency, the number of the constituent words and the confusion degree score change value of the overlapped words in the adjacent divided words.
5. The method of claim 1, wherein the stitch discrimination model comprises one of:
A random forest model;
A decision tree model;
Gradient lifting tree model;
A neural network model;
a support vector machine;
And a sensing machine.
6. The method of claim 1, further comprising:
Performing sentence division on an input sentence; and
And determining the sentences containing the overlapped words from the divided sentences.
7. An apparatus for detecting a stitch word error in a sentence, comprising:
The word segmentation processing unit is used for carrying out word segmentation processing on sentences containing overlapped characters;
the word segmentation information acquisition unit is used for acquiring word segmentation information of the word segmentation where the overlapped word is located when the overlapped word is located in the adjacent word segmentation respectively, wherein the word segmentation information comprises word segmentation part of speech, word segmentation pinyin and the number of the constituent words of the word segmentation; and
A stop-word error detection unit that detects a stop-word error in the sentence using the word segmentation information,
Wherein, the stack word error detection unit includes:
The model input determining module is used for determining the part-of-speech consistency, the pinyin consistency and the number of the constituent words of the overlapped characters in the adjacent segmented words according to the segmented word information; generating a model feature vector of the overlapped word discrimination model according to the part of speech consistency, the pinyin consistency and the number of the constituent words of the overlapped word in the adjacent word segmentation; and
And the character-overlapping error detection module is used for providing the model feature vector to the character-overlapping discrimination model to detect character-overlapping errors in the sentences.
8. The apparatus of claim 7, wherein the word segmentation processing unit performs word segmentation processing on sentences containing superimposed words using a text word segmentation algorithm.
9. The apparatus of claim 7, further comprising:
A confusion degree change determination unit that determines a confusion degree score change value of the sentence before and after the removal of the superimposed word,
And the model input determining module generates a model feature vector of the overlapped word judging model according to the part of speech consistency, the pinyin consistency, the number of the constituent words and the confusion degree score change value of the overlapped word in the adjacent divided words.
10. The apparatus of claim 7, further comprising:
A sentence dividing unit that performs sentence division on an input sentence; and
And a superimposed word sentence determining unit for determining sentences containing superimposed words from the divided sentences.
11. An electronic device, comprising:
At least one processor, and
A memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 1 to 6.
12. A machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010842426.7A CN111783458B (en) | 2020-08-20 | 2020-08-20 | Method and device for detecting character overlapping errors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010842426.7A CN111783458B (en) | 2020-08-20 | 2020-08-20 | Method and device for detecting character overlapping errors |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111783458A CN111783458A (en) | 2020-10-16 |
CN111783458B true CN111783458B (en) | 2024-05-03 |
Family
ID=72762169
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010842426.7A Active CN111783458B (en) | 2020-08-20 | 2020-08-20 | Method and device for detecting character overlapping errors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111783458B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102324233A (en) * | 2011-08-03 | 2012-01-18 | 中国科学院计算技术研究所 | Method for automatically correcting identification error of repeated words in Chinese pronunciation identification |
CN104375986A (en) * | 2014-12-02 | 2015-02-25 | 江苏科技大学 | Automatic acquisition method of Chinese reduplication words |
CN105279149A (en) * | 2015-10-21 | 2016-01-27 | 上海应用技术学院 | Chinese text automatic correction method |
US9405743B1 (en) * | 2015-05-13 | 2016-08-02 | International Business Machines Corporation | Dynamic modeling of geospatial words in social media |
CN106527756A (en) * | 2016-10-26 | 2017-03-22 | 长沙军鸽软件有限公司 | Method and device for intelligently correcting input information |
CN106776549A (en) * | 2016-12-06 | 2017-05-31 | 桂林电子科技大学 | A kind of rule-based english composition syntax error correcting method |
CN108090045A (en) * | 2017-12-20 | 2018-05-29 | 珠海市君天电子科技有限公司 | A kind of method for building up of marking model, segmenting method and device |
CN111144100A (en) * | 2019-12-24 | 2020-05-12 | 五八有限公司 | Question text recognition method and device, electronic equipment and storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050060150A1 (en) * | 2003-09-15 | 2005-03-17 | Microsoft Corporation | Unsupervised training for overlapping ambiguity resolution in word segmentation |
US8751218B2 (en) * | 2010-02-09 | 2014-06-10 | Siemens Aktiengesellschaft | Indexing content at semantic level |
CN108509408B (en) * | 2017-02-27 | 2019-11-22 | 芋头科技(杭州)有限公司 | A kind of sentence similarity judgment method |
-
2020
- 2020-08-20 CN CN202010842426.7A patent/CN111783458B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102324233A (en) * | 2011-08-03 | 2012-01-18 | 中国科学院计算技术研究所 | Method for automatically correcting identification error of repeated words in Chinese pronunciation identification |
CN104375986A (en) * | 2014-12-02 | 2015-02-25 | 江苏科技大学 | Automatic acquisition method of Chinese reduplication words |
US9405743B1 (en) * | 2015-05-13 | 2016-08-02 | International Business Machines Corporation | Dynamic modeling of geospatial words in social media |
CN105279149A (en) * | 2015-10-21 | 2016-01-27 | 上海应用技术学院 | Chinese text automatic correction method |
CN106527756A (en) * | 2016-10-26 | 2017-03-22 | 长沙军鸽软件有限公司 | Method and device for intelligently correcting input information |
CN106776549A (en) * | 2016-12-06 | 2017-05-31 | 桂林电子科技大学 | A kind of rule-based english composition syntax error correcting method |
CN108090045A (en) * | 2017-12-20 | 2018-05-29 | 珠海市君天电子科技有限公司 | A kind of method for building up of marking model, segmenting method and device |
CN111144100A (en) * | 2019-12-24 | 2020-05-12 | 五八有限公司 | Question text recognition method and device, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
中文分词交叉型歧义消解算法;甘蓉;;西华大学学报(自然科学版);20181120(第06期);全文 * |
汉语自动分词的研究现状与困难;张春霞, 郝天永;系统仿真学报;20050120(第01期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111783458A (en) | 2020-10-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11734328B2 (en) | Artificial intelligence based corpus enrichment for knowledge population and query response | |
US11475209B2 (en) | Device, system, and method for extracting named entities from sectioned documents | |
US20180267956A1 (en) | Identification of reading order text segments with a probabilistic language model | |
US11055327B2 (en) | Unstructured data parsing for structured information | |
CN107943911A (en) | Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing | |
CN110597964A (en) | Double-record quality inspection semantic analysis method and device and double-record quality inspection system | |
CN109002768A (en) | Medical bill class text extraction method based on the identification of neural network text detection | |
US20220335073A1 (en) | Fuzzy searching using word shapes for big data applications | |
CN112633001A (en) | Text named entity recognition method and device, electronic equipment and storage medium | |
Boillet et al. | Robust text line detection in historical documents: learning and evaluation methods | |
CN112183073A (en) | Text error correction and completion method suitable for legal hot-line speech recognition | |
Namysl et al. | NAT: Noise-aware training for robust neural sequence labeling | |
WO2023038722A1 (en) | Entry detection and recognition for custom forms | |
JP2019212115A (en) | Inspection device, inspection method, program, and learning device | |
Schaback et al. | Multi-level feature extraction for spelling correction | |
CN112464927B (en) | Information extraction method, device and system | |
CN110610003A (en) | Method and system for assisting text annotation | |
CN113469005A (en) | Recognition method of bank receipt, related device and storage medium | |
CN111368066A (en) | Method, device and computer readable storage medium for acquiring dialogue abstract | |
CN111783458B (en) | Method and device for detecting character overlapping errors | |
CN112380861A (en) | Model training method and device and intention identification method and device | |
CN109670162A (en) | The determination method, apparatus and terminal device of title | |
CN111782773B (en) | Text matching method and device based on cascade mode | |
CN115146644A (en) | Multi-feature fusion named entity identification method for warning situation text | |
CN115130475A (en) | Extensible universal end-to-end named entity identification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |