CN111783458B - Method and device for detecting character overlapping errors - Google Patents

Method and device for detecting character overlapping errors Download PDF

Info

Publication number
CN111783458B
CN111783458B CN202010842426.7A CN202010842426A CN111783458B CN 111783458 B CN111783458 B CN 111783458B CN 202010842426 A CN202010842426 A CN 202010842426A CN 111783458 B CN111783458 B CN 111783458B
Authority
CN
China
Prior art keywords
word
word segmentation
sentence
overlapped
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010842426.7A
Other languages
Chinese (zh)
Other versions
CN111783458A (en
Inventor
余红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202010842426.7A priority Critical patent/CN111783458B/en
Publication of CN111783458A publication Critical patent/CN111783458A/en
Application granted granted Critical
Publication of CN111783458B publication Critical patent/CN111783458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

Embodiments of the present description provide methods and apparatus for detecting a fold word error in a sentence. In the method, word segmentation processing is performed on sentences containing superimposed words. In addition, aiming at sentences with overlapped characters respectively positioned in adjacent word segmentation, word segmentation information of the word segmentation in which the overlapped characters are positioned is obtained, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin. Then, the obtained word segmentation information is used to detect a stop-word error in the sentence.

Description

Method and device for detecting character overlapping errors
Technical Field
Embodiments of the present disclosure relate generally to the field of language processing, and more particularly, to a method and apparatus for detecting a word stack error in a sentence.
Background
When sentence analysis is performed, it is found that a word overlap phenomenon exists in the analyzed sentence. The phenomenon of overlapping words refers to that words in adjacent positions in the same sentence are identical. These overlapping words in the sentence may be caused by incorrect repeated inputs, for example: the "quantity" in "billing by effective check quantity" is wrong, but some of the doublewords may also be correct, for example "baba" in "aleb networks technologies limited". In some official documents, the wrong input of the superimposed words may give bad impressions to the partner, and may even generate legal risks or legal disputes, for example, when signing a contract, the superimposed words "absolutely" in "first party pays the first party by absolutely yuan (including tax price)" to second party "may cause the contract clauses to be wrong, which causes risks of generating legal disputes later.
Disclosure of Invention
In view of the foregoing, embodiments of the present disclosure provide a method and apparatus for detecting a stack word error in a sentence. By using the method and the device, the character overlapping error detection is carried out by using the word segmentation part of speech and the word segmentation pinyin of different word segmentation where the overlapped characters are located, so that the efficiency and the accuracy of the character overlapping error detection can be improved.
According to an aspect of embodiments of the present specification, there is provided a method for detecting a stitch word error in a sentence, including: word segmentation processing is carried out on sentences containing overlapped words; when the overlapped characters are respectively positioned in adjacent word segmentation, word segmentation information of the word segmentation where the overlapped characters are positioned is obtained, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin; and detecting a stop-word error in the sentence using the word segmentation information.
Optionally, in one example of the above aspect, the word segmentation information further includes a component word number of the segmented word.
Optionally, in one example of the above aspect, detecting a stop-word error in the sentence using the word segmentation information includes: determining a model feature vector of the overlapped character judging model according to the word segmentation information; and providing the model feature vector to the stitch discrimination model to detect stitch errors in the sentence.
Optionally, in one example of the above aspect, determining the model feature vector of the stitch discrimination model according to the word segmentation information includes: determining part-of-speech consistency, pinyin consistency and/or number of component words of the overlapped words in the adjacent segmented words according to the segmented word information; and generating model feature vectors of the overlapped word discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the constituent words of the overlapped word in the adjacent word segmentation.
Optionally, in one example of the above aspect, performing word segmentation processing on the sentence containing the superimposed word includes: a text word segmentation algorithm is used to segment sentences containing overlapping words.
Optionally, in one example of the above aspect, the text word segmentation algorithm includes: a text word segmentation algorithm based on a word segmentation dictionary; a text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a text word segmentation algorithm based on a model; or text word segmentation algorithm based on word labeling.
Optionally, in one example of the above aspect, the method further comprises: determining a confusion degree score change value of the sentence before and after removing the overlapped word, and detecting the overlapped word error in the sentence by using the word segmentation information comprises the following steps: and detecting the character overlapping error in the sentence by using the word segmentation information and the confusion degree score variation value.
Optionally, in one example of the above aspect, the stitch word discrimination model includes one of the following models: a random forest model; a decision tree model; gradient lifting tree model; a neural network model; a support vector machine; and a sensing machine.
Optionally, in one example of the above aspect, the method further comprises: performing sentence division on an input sentence; and determining the sentence containing the overlapped word from the divided sentences.
According to another aspect of embodiments of the present specification, there is provided an apparatus for detecting a stitch word error in a sentence, comprising: the word segmentation processing unit is used for carrying out word segmentation processing on sentences containing overlapped characters; the word segmentation information acquisition unit is used for acquiring word segmentation information of a word segment where the superimposed character is located when the superimposed character is located in an adjacent word segment respectively, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin; and a character-overlapping error detection unit that detects a character-overlapping error in the sentence using the word segmentation information.
Optionally, in one example of the above aspect, the stitch error detection unit includes: the model input determining module is used for determining a model feature vector of the overlapped character judging model according to the word segmentation information; and the character-overlapping error detection module is used for providing the model feature vector to the character-overlapping discrimination model to detect character-overlapping errors in the sentences.
Optionally, in one example of the above aspect, the word segmentation information further includes a word number of the word segment, and the model input determining module: determining part-of-speech consistency, pinyin consistency and/or number of component words of the overlapped words in the adjacent segmented words according to the segmented word information; and generating model feature vectors of the overlapped word discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the constituent words of the overlapped word in the adjacent word segmentation.
Optionally, in one example of the above aspect, the word segmentation processing unit uses a text word segmentation algorithm to perform word segmentation processing on the sentence containing the superimposed word.
Optionally, in one example of the above aspect, the apparatus further includes: and the confusion degree change determining unit is used for determining a confusion degree score change value of the sentence before and after the word stack is removed, and the word stack error detecting unit is used for detecting the word stack error in the sentence by using the word segmentation information and the confusion degree score change value.
Optionally, in one example of the above aspect, the apparatus further includes: a sentence dividing unit that performs sentence division on an input sentence; and a superimposed word sentence determination unit that determines a sentence containing superimposed words from the divided sentences.
According to another aspect of embodiments of the present specification, there is provided an electronic device including: at least one processor, and a memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the stitch word error detection method as described above.
According to another aspect of embodiments of the present description, there is provided a machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform a method of stitch word error detection as described above.
Drawings
A further understanding of the nature and advantages of the present description may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.
Fig. 1 shows an example schematic diagram of a lap sentence according to an embodiment of the present specification.
FIG. 2 illustrates an example flow chart of a method for detecting a stitch word error in a sentence according to an embodiment of this specification.
Fig. 3 shows an example schematic diagram of an input sentence according to an embodiment of the present specification.
Fig. 4 shows an exemplary schematic diagram of word segmentation information of a sentence according to an embodiment of the present specification.
Fig. 5 shows an example flowchart of a training process of a stitch word discrimination model according to an embodiment of the present specification.
Fig. 6 shows an example flowchart of a model input determination process of the stitch discrimination model according to an embodiment of the present specification.
Fig. 7 shows a block diagram of an apparatus for detecting a stitch word error in a sentence according to an embodiment of the present specification.
Fig. 8 shows a block diagram of one example of a stitch error detection unit according to an embodiment of the present specification.
Fig. 9 shows a schematic diagram of an electronic device for implementing a sentence stack error detection process according to an embodiment of the present description.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It should be appreciated that these embodiments are discussed only to enable a person skilled in the art to better understand and thereby practice the subject matter described herein, and are not limiting of the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure as set forth in the specification. Various examples may omit, replace, or add various procedures or components as desired. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may be combined in other examples as well.
As used herein, the term "comprising" and variations thereof mean open-ended terms, meaning "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment. The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. Unless the context clearly indicates otherwise, the definition of a term is consistent throughout this specification.
In the text sentence processing, text sentence analysis is required. When text sentence analysis is performed, it is found that a word stack phenomenon exists in the analyzed text sentence, for example: "billing according to the effective check amount", "aleba network technology limited", etc. These doublewords may be correct doublewords or doublewords due to text input errors. In this specification, a superimposed word due to a text input error is referred to as a "superimposed word error". In some cases, the word stack error in the text sentence may give bad impressions to the partner, and may even generate legal risks or legal disputes, for example, when the contract is signed, if the following word stack error, namely that the first party pays the second party by the four absolutely yuan, exists in the contract, the contract clause error is caused, and the subsequent legal dispute risks are caused.
In view of the foregoing, embodiments of the present disclosure propose a method and apparatus for detecting a stack word error in a sentence. By using the method and the device, the sentence containing the overlapped characters is subjected to word segmentation processing, the word segmentation part of speech and the word segmentation pinyin of different word segments where the overlapped characters are located are extracted, and the extracted word segmentation part of speech and the extracted word segmentation pinyin are used for performing overlapped character error detection, so that the efficiency and the accuracy of overlapped character error detection can be improved.
A method and apparatus for detecting a stitch word error in a sentence according to embodiments of the present specification are described below with reference to the accompanying drawings.
Fig. 1 shows an example schematic diagram of a lap sentence according to an embodiment of the present specification.
In fig. 1, an example of three lap-word sentences is shown. The "quantity" in the "quantity-by-quantity accounting" of the valid check quantity is the doubleword, the "absolutely" in the "first party paid absolutely (including tax price) to the" second party "of the doubleword statement 1" is the doubleword, and the "in the" same "for all the target amounts of the doubleword statement 3" is the doubleword.
It is to be noted that only one stitch is included in each of the example sentences shown in fig. 1. In other examples of the present specification, there may be multiple doublewords in a single doubleword sentence.
FIG. 2 illustrates an example flow chart of a method 200 for detecting a stitch word error in a sentence according to an embodiment of this specification.
As shown in fig. 2, at 210, a sentence containing a lap is determined from the input sentences. In one example of the present specification, the input sentence may be a text sentence input by the user in real time, or may be a text sentence obtained from another text processing system or apparatus, for example, a text sentence obtained from a contract text database of a contract text storage device. Fig. 3 shows an example schematic diagram of an input sentence according to an embodiment of the present specification.
In one example, in retrieving an input sentence, the input sentence may be sentence-divided. For example, the input sentence may be sentence divided in terms of sentence segmentors, examples of which may include, but are not limited to, periods, semicolons, question marks, and exclamation marks, for example. For example, the input sentence in fig. 3 is divided into 3 sentences: "all target monovalent amounts are the same", "charge by effective check amount" and "complete payment within a single month of receipt of the first party account".
Then, a sentence containing a superimposed word is determined from the divided sentences. For example, after the divided sentences are obtained, it is determined for each sentence whether or not the sentence contains a superimposed word. If the stack is not contained, the statement is discarded. For example, for the above 3 sentences, "all the target unit price amounts are the same", "charged by the effective check amount", and "and the fee payment is completed within a single month of receipt of the first party account", the sentence "and the fee payment is completed within a single month of receipt of the first party account" are not included in the superimposed word, whereby the sentence is discarded, and the sentence "all the target unit price amounts are the same" and "charged by the effective check amount" are determined as the sentence including the superimposed word.
At 220, word segmentation is performed on the sentence containing the superimposed word. In one example, a text word segmentation algorithm is used to segment sentences containing overlapping words. Examples of the text word segmentation algorithm may include, but are not limited to: a text word segmentation algorithm based on a word segmentation dictionary; a text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a text word segmentation algorithm based on a model; or text word segmentation algorithm based on word labeling.
In this specification, the word segmentation dictionary may be, for example, a custom dictionary. The custom dictionary may be determined by using a library of words. Statistical-based text word segmentation algorithms may perform text word segmentation based on the probability or frequency of occurrence of a word adjacent to a word. Examples of the statistical-based text-word segmentation algorithm may include an N-gram model (N-gram) -based text-word segmentation algorithm, a hidden markov model-based text-word segmentation algorithm. The rule-based text word segmentation algorithm can perform semantic analysis and syntactic analysis on the sentence, and perform text word segmentation on the sentence by utilizing syntactic information and semantic information. The model-based text word segmentation algorithm may be, for example, a text word segmentation algorithm based on a text word segmentation model.
Text word segmentation algorithms based on word labeling are in fact word forming methods, i.e. the text word segmentation process is regarded as a labeling problem of words in a string. Since each word occupies a certain word position (i.e., a word position) when constructing a particular word, it is assumed that there are at most four word positions per word: namely B (word head), M (word middle), E (word tail) and S (word forming alone), the word segmentation result (1) of the following sentences can be directly expressed into a word-by-word labeling form as shown in (2): (1) word segmentation result: maritime/planning/N/book/century/end/implementation/average/domestic/production/total/five thousand dollars/; (2) word annotation format: shanghai/E meter/B score/E N/S book/S age/B age/E end/S actual/B implementation/E person/B average/E country/B in/E raw/B produced/E total/B value/E five/B thousand/M Mei/M yuan/E. and/S. In this specification, the term "word" is not limited to a kanji character, but may include characters such as foreign letters, arabic numerals, punctuation marks, and the like.
By regarding the word segmentation process as a word labeling problem, recognition problems of vocabulary words and unregistered words (such as person names, place names and organization names) can be considered in balance. In this word segmentation technique, both the vocabulary words and the unregistered words in the text sentence are implemented using a unified word annotation process. In the learning architecture, the vocabulary word information does not need to be emphasized specifically, and a specific unregistered word recognition module does not need to be designed specifically. This greatly simplifies the design of the text segmentation system. In the process of word annotation, all words learn word characteristics according to predefined features to obtain a probability model. And then, obtaining a labeling result of the word position on the word string to be separated according to the combination tightness degree between the words. And finally, directly obtaining a final word segmentation result according to the word position definition.
At 230, word segmentation information of the word segmentation where the superimposed character is located is obtained for sentences where the superimposed character is located in adjacent word segmentation respectively, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin.
For example, after the word segmentation processing is performed on the sentences including the superimposed words as described above, the word segmentation where the superimposed words are located is extracted for each sentence including the superimposed words. If the stack is located in the same word segment, the statement is discarded. For example, assuming that there are a plurality of superimposed words in a sentence, such as "quantity" and "the same word, in the case where each superimposed word is located in the same word, i.e., the" quantity "is located in the same word, and the" quantity "is located in another word, the sentence is discarded. If there is a stack that is not in the same word segment, the statement is preserved. And then, part-of-speech tagging and pinyin tagging are carried out on the segmented sentences, so that word segmentation information of the segmented words where the superimposed characters are located is obtained. Here, the part-of-speech tagging and pinyin tagging may be implemented using any suitable tagging means or tagging algorithm in the art. In another example, the word segmentation information may further include a number of constituent words of the word segment. Fig. 4 shows an exemplary schematic diagram of word segmentation information of a sentence according to an embodiment of the present specification.
At 240, for a sentence in which the superimposed characters are respectively located in adjacent groups, the obtained word segmentation information is used to detect a superimposed character error in the sentence. In one example, a stitch discrimination model may be used to detect stitch errors in the statement. In one example of the present specification, the superimposed-word discrimination model may be a model capable of classification prediction, for example, a machine learning model or a deep learning model. Examples of the stitch discrimination model may include, but are not limited to, one of the following: a random forest model; a decision tree model; gradient lifting tree model; a neural network model; a support vector machine; and a sensing machine. In this case, it is necessary to train the character stack discrimination model in advance using a database.
Fig. 5 shows an example flowchart of a training process of a stitch word discrimination model according to an embodiment of the present specification.
As shown in fig. 5, at 510, a sentence containing a lap is determined from a corpus. In one example of the present specification, the corpus may be a corpus of text sentences obtained from other text processing systems or devices and/or text sentences crawled by web crawlers, e.g., text sentences obtained from a contract text database of a contract text store.
In one example, the corpus in the corpus may be sentence-partitioned. For example, the corpus may be sentence divided in terms of sentence segmentors, examples of which may include, but are not limited to, periods, semicolons, question marks, and exclamation marks, for example. Then, a sentence containing a superimposed word is determined from the divided sentences. For example, after the divided sentences are obtained, it is determined for each sentence whether or not the sentence contains a superimposed word. If the stack is not contained, the statement is discarded.
At 520, word segmentation is performed on the sentence containing the superimposed word. In one example, a text word segmentation algorithm is used to segment sentences containing overlapping words. Examples of the text word segmentation algorithm may include, but are not limited to: a text word segmentation algorithm based on a word segmentation dictionary; a text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a text word segmentation algorithm based on a model; or text word segmentation algorithm based on word labeling. Furthermore, if the stack is located in the same word segment, the statement is discarded. For example, assuming that there are a plurality of superimposed words in a sentence, such as "quantity" and "the same word, in the case where each superimposed word is located in the same word, i.e., the" quantity "is located in the same word, and the" quantity "is located in another word, the sentence is discarded. If there is a stack that is not in the same word segment, the statement is preserved.
At 530, for the sentence whose superimposed word is located in the adjacent word, the model feature vector of the sentence is determined and labeled.
In one example, the model feature vector may be determined according to word segmentation information of a word segment in which the superimposed word is located. For example, part-of-speech consistency, pinyin consistency, and/or the number of constituent words of the superimposed word in adjacent segmented words may be determined based on the segmentation information. And then, generating model feature vectors of the overlapped word discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the constituent words of the overlapped word in the adjacent word segmentation. For example, in the above case, the model feature vector may be a 4-dimensional vector { a1, a2, a3, a4}, where a1 corresponds to part of speech consistency, the value of a1 is 1 if the parts of speech are consistent, and the value of a1 is 0 if the parts of speech are not consistent. a2 corresponds to pinyin consistency, if pinyin is consistent, the value of a2 is 1, and if pinyin is inconsistent, the value of a2 is 0. a3 corresponds to the number of words of the first word, and if the number of words is 1, the value of a3 is 1, and if the number of words is not 1, the value of a3 is 0. a4 corresponds to the number of words of the first word, and if the number of words is 1, the value of a4 is 1, and if the number of words is not 1, the value of a4 is 0. According to the above processing manner, the model feature vector of the superimposed sentence "charged by the effective check amount" is [1, 0,1], and the model feature vector of the superimposed sentence "all the target unit price amounts are the same" is [0, 1]. In addition, since the overlapping word sentence "billing according to the effective check quantity" has an overlapping word error, the label of the sentence is "1" for identifying that the sentence has an overlapping word error. Since the overlapping word sentence is that all the target unit price is the same, no overlapping word error occurs, the label of the sentence is 0, and the label is used for identifying that the sentence has no overlapping word error, thereby completing the sentence marking process.
Then, at 540, the word stack discrimination model is trained using the sentence subjected to the sentence labeling process. Specifically, the model feature vector corresponding to the sentence is used as a model input and the labeled label is provided for the overlapped word discrimination model to perform model training until the model training ending condition is met, so that the overlapped word discrimination model is trained.
Returning to fig. 2, in the case of using the superimposed-character discrimination model for superimposed-character error detection as described above, in one example, for each sentence, a model feature vector of the superimposed-character discrimination model is determined from word segmentation information of the sentence. The model feature vector may be the model feature vector determined as above. And then, providing the determined model feature vector to a character stack discrimination model to perform model discrimination, thereby realizing character stack error detection for the sentence.
Fig. 6 shows an example flowchart of a model input determination process of the stitch discrimination model according to an embodiment of the present specification.
As shown in fig. 6, at 610, part-of-speech consistency, pinyin consistency, and/or number of constituent words of the overlapping words in the adjacent word segments are determined from the word segment information. Then, at 620, model feature vectors of the superimposed word discrimination model are generated based on the part-of-speech consistency, pinyin consistency, and/or constituent word count of the superimposed word in the adjacent segmented words.
A method for detecting a stitch word error in a sentence according to an embodiment of the present specification is described above with reference to fig. 1 to 6.
By using the method, the sentence containing the overlapped characters is subjected to word segmentation processing, the word segmentation part of speech and the word segmentation pinyin of different word segments where the overlapped characters are located are extracted, and the extracted word segmentation part of speech and the extracted word segmentation pinyin are used for performing overlapped character error detection, so that the efficiency and the accuracy of overlapped character error detection can be improved.
In addition, by using the method, when the sentence containing the overlapped characters is subjected to word segmentation, the number of the component words of the segmented words where the overlapped characters are located is also extracted, and the extracted word part of the segmented words, the word spelling and the number of the component words are used for carrying out overlapped character error detection, so that the efficiency and the accuracy of the overlapped character error detection can be further improved.
In addition, by using the method, the word segmentation accuracy of the text word segmentation can be improved by using the text word segmentation algorithm based on the word segmentation dictionary to carry out text word segmentation. In addition, the random forest model is utilized for classification model prediction and training, so that model training efficiency and model prediction efficiency can be improved.
Furthermore, model training using word part of speech consistency and pinyin consistency of the word as model feature vectors may be accomplished with a small number of training samples (e.g., about 1000 pieces of data).
Further optionally, in another example, the method may further include: and determining a confusion degree score change value of the sentence before and after the word stack is removed, and detecting the word stack error in the sentence by using the obtained word segmentation information and the confusion degree score change value.
In this specification, the term "confusion" (perplexity, ppl) is used to indicate whether a sentence is smooth or not, and whether it conforms to the person's speaking logic. Confusion is typically characterized using a confusion score. The more smooth the sentence, the lower the ppl score. The ppl score may be predicted using a language model.
For example, the ppl scoring may be performed by using a language model for the original sentence and the sentence after the word stack is removed, respectively. If after removal, the ppl score becomes lower, the probability of a stitch error being considered to be present is greater. Accordingly, the ppl score change value may be regarded as one model feature of the stitch error discrimination module, whereby the dimension of the model feature vector is increased from 4 dimensions to 5 dimensions compared to the previously described example. The corresponding dimension value is "1" when the ppl score becomes low, and "0" when the ppl score is unchanged or becomes high. Then, the 5-dimensional model feature vector is supplied as a model input to the stitch error discrimination model to perform model prediction.
By using the method, the correction of the character stacking error can be further improved by taking the ppl score change value as the additional dimension of the character stacking error correction.
Fig. 7 shows a block diagram of an apparatus for detecting a stitch error in a sentence (hereinafter referred to as "stitch error detection apparatus") 700 according to an embodiment of the present specification. As shown in fig. 7, the superimposed character error detection apparatus 700 includes a word segmentation processing unit 710, a word segmentation information acquisition unit 720, and a superimposed character error detection unit 730.
The word segmentation processing unit 710 is configured to perform word segmentation processing on sentences containing superimposed words. The operation of the word segmentation processing unit 710 may refer to the operation of 220 described above with reference to fig. 2.
The word segmentation information obtaining unit 720 is configured to obtain word segmentation information of a word segment where the superimposed word is located when the superimposed word is located in an adjacent word segment, where the word segmentation information includes a word segmentation part of speech and a word segmentation pinyin. The operation of the word segmentation information acquisition unit 720 may refer to the operation of 230 described above with reference to fig. 2.
The stop-word error detection unit 730 is configured to detect a stop-word error in a sentence using word segmentation information. The operation of the stitch error detection unit 730 may refer to the operation of 240 described above with reference to fig. 2.
Fig. 8 shows a block diagram of one example of a stitch error detection unit 800 according to an embodiment of the present specification. As shown in fig. 8, the stitch error detection unit 800 includes a model input determination module 810 and a stitch error detection module 820.
The model input determination module 810 is configured to determine model feature vectors of the stitch discrimination model from the word segmentation information. The operation of the model input determination module 810 may refer to the operation of 610 described above with reference to fig. 6.
The stitch error detection module 820 is configured to provide the determined model feature vector to a stitch discrimination model to detect stitch errors in the sentence. The operation of the stitch error detection module 820 may refer to the operation of 620 described above with reference to fig. 6.
Further, optionally, in another example, the word segmentation information may further include a number of constituent words of the word. Accordingly, the model input determination module 810 determines the part-of-speech consistency, pinyin consistency, and/or the number of constituent words of the superimposed word in adjacent segmented words according to the segmented word information; and generating model feature vectors of the overlapped word discrimination model according to the part-of-speech consistency, the pinyin consistency and/or the number of the constituent words of the overlapped word in the adjacent word segmentation.
Further alternatively, in another example, the word segmentation processing unit 710 may perform word segmentation processing on sentences containing superimposed words using a text word segmentation algorithm. Examples of the text word segmentation algorithm may include, but are not limited to: a text word segmentation algorithm based on a word segmentation dictionary; a text word segmentation algorithm based on statistics; a rule-based text word segmentation algorithm; a text word segmentation algorithm based on a model; or text word segmentation algorithm based on word labeling.
Further, alternatively, in another example, the stitch error detection apparatus 700 may further include a confusion degree change determination unit (not shown). The confusion-degree-change determining unit is configured to determine a confusion-degree-score change value of the sentence before and after removing the superimposed word. Accordingly, the stop-word error detection unit 730 detects a stop-word error in a sentence using the word segmentation information and the confusion score variation value.
Further, alternatively, in another example, the lap word error detection apparatus 700 may further include a sentence dividing unit (not shown) and a lap word sentence determining unit (not shown). The sentence dividing unit is configured to perform sentence division on an input sentence. Then, the superimposed word sentence determining unit determines a sentence containing superimposed words from the divided sentences.
As described above with reference to fig. 1 to 8, the stitch error detection method and the stitch error detection apparatus according to the embodiments of the present specification are described. The above stack error detection device may be implemented in hardware, or may be implemented in software, or a combination of hardware and software.
Fig. 9 shows a schematic diagram of an electronic device 900 for implementing a sentence stack error detection process according to an embodiment of the present description. As shown in fig. 9, the electronic device 900 may include at least one processor 910, memory (e.g., non-volatile memory) 920, memory 930, and a communication interface 940, with the at least one processor 910, memory 920, memory 930, and communication interface 940 being connected together via a bus 960. The at least one processor 910 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory.
In one embodiment, computer-executable instructions are stored in memory that, when executed, cause the at least one processor 910 to: word segmentation processing is carried out on sentences containing overlapped words; when the overlapped characters are respectively positioned in adjacent word segmentation, word segmentation information of the word segmentation where the overlapped characters are positioned is obtained, wherein the word segmentation information comprises word segmentation part of speech and word segmentation pinyin; and detecting a stop-word error in the sentence using the word segmentation information.
It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 910 to perform the various operations and functions described above in connection with fig. 1-8 in various embodiments of the present description.
According to one embodiment, a program product such as a machine-readable medium (e.g., a non-transitory machine-readable medium) is provided. The machine-readable medium may have instructions (i.e., elements described above implemented in software) that, when executed by a machine, cause the machine to perform the various operations and functions described above in connection with fig. 1-8 in various embodiments of the specification. In particular, a system or apparatus provided with a readable storage medium having stored thereon software program code implementing the functions of any of the above embodiments may be provided, and a computer or processor of the system or apparatus may be caused to read out and execute instructions stored in the readable storage medium.
In this case, the program code itself read from the readable medium may implement the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.
Examples of readable storage media include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or cloud by a communications network.
It will be appreciated by those skilled in the art that various changes and modifications can be made to the embodiments disclosed above without departing from the spirit of the invention. Accordingly, the scope of the invention should be limited only by the attached claims.
It should be noted that not all the steps and units in the above flowcharts and the system configuration diagrams are necessary, and some steps or units may be omitted according to actual needs. The order of execution of the steps is not fixed and may be determined as desired. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by multiple physical entities, or may be implemented jointly by some components in multiple independent devices.
In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may include permanently dedicated circuitry or logic (e.g., a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware unit or processor may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The particular implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.
The detailed description set forth above in connection with the appended drawings describes exemplary embodiments, but does not represent all embodiments that may be implemented or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A method for detecting a stitch word error in a sentence, comprising:
word segmentation processing is carried out on sentences containing overlapped words;
when the overlapped characters are respectively positioned in adjacent word segmentation, word segmentation information of the word segmentation where the overlapped characters are positioned is obtained, wherein the word segmentation information comprises word segmentation part of speech, word segmentation pinyin and the number of the constituent words of the word segmentation;
Determining part-of-speech consistency, pinyin consistency and number of component words of the overlapped words in the adjacent word segmentation according to the word segmentation information;
generating model feature vectors of the overlapped word discrimination model according to the part-of-speech consistency, the pinyin consistency and the number of the constituent words of the overlapped words in the adjacent word segmentation; and
Providing the model feature vector to the stitch discrimination model to detect stitch errors in the sentence.
2. The method of claim 1, wherein word segmentation of the sentence containing the superimposed word comprises:
A text word segmentation algorithm is used to segment sentences containing overlapping words.
3. The method of claim 2, wherein the text-word segmentation algorithm comprises:
A text word segmentation algorithm based on a word segmentation dictionary;
a text word segmentation algorithm based on statistics;
A rule-based text word segmentation algorithm;
A text word segmentation algorithm based on a model; or alternatively
Text word segmentation algorithm based on word labeling.
4. The method of claim 1, further comprising:
determining the confusion degree score change value of the sentence before and after removing the overlapped word,
Generating a model feature vector of the overlapped word discrimination model according to the part of speech consistency, the pinyin consistency and the number of the constituent words of the overlapped word in the adjacent word segmentation comprises the following steps:
And generating model feature vectors of the overlapped word judging model according to the part-of-speech consistency, the pinyin consistency, the number of the constituent words and the confusion degree score change value of the overlapped words in the adjacent divided words.
5. The method of claim 1, wherein the stitch discrimination model comprises one of:
A random forest model;
A decision tree model;
Gradient lifting tree model;
A neural network model;
a support vector machine;
And a sensing machine.
6. The method of claim 1, further comprising:
Performing sentence division on an input sentence; and
And determining the sentences containing the overlapped words from the divided sentences.
7. An apparatus for detecting a stitch word error in a sentence, comprising:
The word segmentation processing unit is used for carrying out word segmentation processing on sentences containing overlapped characters;
the word segmentation information acquisition unit is used for acquiring word segmentation information of the word segmentation where the overlapped word is located when the overlapped word is located in the adjacent word segmentation respectively, wherein the word segmentation information comprises word segmentation part of speech, word segmentation pinyin and the number of the constituent words of the word segmentation; and
A stop-word error detection unit that detects a stop-word error in the sentence using the word segmentation information,
Wherein, the stack word error detection unit includes:
The model input determining module is used for determining the part-of-speech consistency, the pinyin consistency and the number of the constituent words of the overlapped characters in the adjacent segmented words according to the segmented word information; generating a model feature vector of the overlapped word discrimination model according to the part of speech consistency, the pinyin consistency and the number of the constituent words of the overlapped word in the adjacent word segmentation; and
And the character-overlapping error detection module is used for providing the model feature vector to the character-overlapping discrimination model to detect character-overlapping errors in the sentences.
8. The apparatus of claim 7, wherein the word segmentation processing unit performs word segmentation processing on sentences containing superimposed words using a text word segmentation algorithm.
9. The apparatus of claim 7, further comprising:
A confusion degree change determination unit that determines a confusion degree score change value of the sentence before and after the removal of the superimposed word,
And the model input determining module generates a model feature vector of the overlapped word judging model according to the part of speech consistency, the pinyin consistency, the number of the constituent words and the confusion degree score change value of the overlapped word in the adjacent divided words.
10. The apparatus of claim 7, further comprising:
A sentence dividing unit that performs sentence division on an input sentence; and
And a superimposed word sentence determining unit for determining sentences containing superimposed words from the divided sentences.
11. An electronic device, comprising:
At least one processor, and
A memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 1 to 6.
12. A machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform the method of any one of claims 1 to 6.
CN202010842426.7A 2020-08-20 2020-08-20 Method and device for detecting character overlapping errors Active CN111783458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010842426.7A CN111783458B (en) 2020-08-20 2020-08-20 Method and device for detecting character overlapping errors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010842426.7A CN111783458B (en) 2020-08-20 2020-08-20 Method and device for detecting character overlapping errors

Publications (2)

Publication Number Publication Date
CN111783458A CN111783458A (en) 2020-10-16
CN111783458B true CN111783458B (en) 2024-05-03

Family

ID=72762169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010842426.7A Active CN111783458B (en) 2020-08-20 2020-08-20 Method and device for detecting character overlapping errors

Country Status (1)

Country Link
CN (1) CN111783458B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324233A (en) * 2011-08-03 2012-01-18 中国科学院计算技术研究所 Method for automatically correcting identification error of repeated words in Chinese pronunciation identification
CN104375986A (en) * 2014-12-02 2015-02-25 江苏科技大学 Automatic acquisition method of Chinese reduplication words
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
US9405743B1 (en) * 2015-05-13 2016-08-02 International Business Machines Corporation Dynamic modeling of geospatial words in social media
CN106527756A (en) * 2016-10-26 2017-03-22 长沙军鸽软件有限公司 Method and device for intelligently correcting input information
CN106776549A (en) * 2016-12-06 2017-05-31 桂林电子科技大学 A kind of rule-based english composition syntax error correcting method
CN108090045A (en) * 2017-12-20 2018-05-29 珠海市君天电子科技有限公司 A kind of method for building up of marking model, segmenting method and device
CN111144100A (en) * 2019-12-24 2020-05-12 五八有限公司 Question text recognition method and device, electronic equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050060150A1 (en) * 2003-09-15 2005-03-17 Microsoft Corporation Unsupervised training for overlapping ambiguity resolution in word segmentation
US8751218B2 (en) * 2010-02-09 2014-06-10 Siemens Aktiengesellschaft Indexing content at semantic level
CN108509408B (en) * 2017-02-27 2019-11-22 芋头科技(杭州)有限公司 A kind of sentence similarity judgment method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324233A (en) * 2011-08-03 2012-01-18 中国科学院计算技术研究所 Method for automatically correcting identification error of repeated words in Chinese pronunciation identification
CN104375986A (en) * 2014-12-02 2015-02-25 江苏科技大学 Automatic acquisition method of Chinese reduplication words
US9405743B1 (en) * 2015-05-13 2016-08-02 International Business Machines Corporation Dynamic modeling of geospatial words in social media
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
CN106527756A (en) * 2016-10-26 2017-03-22 长沙军鸽软件有限公司 Method and device for intelligently correcting input information
CN106776549A (en) * 2016-12-06 2017-05-31 桂林电子科技大学 A kind of rule-based english composition syntax error correcting method
CN108090045A (en) * 2017-12-20 2018-05-29 珠海市君天电子科技有限公司 A kind of method for building up of marking model, segmenting method and device
CN111144100A (en) * 2019-12-24 2020-05-12 五八有限公司 Question text recognition method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
中文分词交叉型歧义消解算法;甘蓉;;西华大学学报(自然科学版);20181120(第06期);全文 *
汉语自动分词的研究现状与困难;张春霞, 郝天永;系统仿真学报;20050120(第01期);全文 *

Also Published As

Publication number Publication date
CN111783458A (en) 2020-10-16

Similar Documents

Publication Publication Date Title
US11734328B2 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
US11475209B2 (en) Device, system, and method for extracting named entities from sectioned documents
US20180267956A1 (en) Identification of reading order text segments with a probabilistic language model
US11055327B2 (en) Unstructured data parsing for structured information
CN107943911A (en) Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
CN110597964A (en) Double-record quality inspection semantic analysis method and device and double-record quality inspection system
CN109002768A (en) Medical bill class text extraction method based on the identification of neural network text detection
US20220335073A1 (en) Fuzzy searching using word shapes for big data applications
CN112633001A (en) Text named entity recognition method and device, electronic equipment and storage medium
Boillet et al. Robust text line detection in historical documents: learning and evaluation methods
CN112183073A (en) Text error correction and completion method suitable for legal hot-line speech recognition
Namysl et al. NAT: Noise-aware training for robust neural sequence labeling
WO2023038722A1 (en) Entry detection and recognition for custom forms
JP2019212115A (en) Inspection device, inspection method, program, and learning device
Schaback et al. Multi-level feature extraction for spelling correction
CN112464927B (en) Information extraction method, device and system
CN110610003A (en) Method and system for assisting text annotation
CN113469005A (en) Recognition method of bank receipt, related device and storage medium
CN111368066A (en) Method, device and computer readable storage medium for acquiring dialogue abstract
CN111783458B (en) Method and device for detecting character overlapping errors
CN112380861A (en) Model training method and device and intention identification method and device
CN109670162A (en) The determination method, apparatus and terminal device of title
CN111782773B (en) Text matching method and device based on cascade mode
CN115146644A (en) Multi-feature fusion named entity identification method for warning situation text
CN115130475A (en) Extensible universal end-to-end named entity identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant