CN113887202A - Text error correction method and device, computer equipment and storage medium - Google Patents

Text error correction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113887202A
CN113887202A CN202111150351.7A CN202111150351A CN113887202A CN 113887202 A CN113887202 A CN 113887202A CN 202111150351 A CN202111150351 A CN 202111150351A CN 113887202 A CN113887202 A CN 113887202A
Authority
CN
China
Prior art keywords
score
gram model
information
text
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111150351.7A
Other languages
Chinese (zh)
Inventor
莫琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202111150351.7A priority Critical patent/CN113887202A/en
Publication of CN113887202A publication Critical patent/CN113887202A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The application belongs to the field of artificial intelligence and relates to a text error correction method which comprises the steps of obtaining text information; processing the text information by adopting a forward maximum matching method to obtain a word segmentation set; judging whether each participle in the participle set has wrongly written characters or not through an N-gram model, wherein the N-gram model is obtained based on background information training; when the participles in the participle set have wrongly-typed characters, acquiring target participle positions with the wrongly-typed characters; recalling the candidate word set through a user-defined dictionary; and screening the candidate word set to obtain candidate words meeting preset screening conditions as correct candidate words, and replacing the wrongly-written characters with the correct candidate words at the target word segmentation positions. The application also provides a text error correction device, equipment and a storage medium. In addition, the application also relates to a block chain technology, and the N-gram model and the custom dictionary can be stored in the block chain. The invention effectively improves the response rate of Chinese text error correction.

Description

Text error correction method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a text error correction method, apparatus, computer device, and storage medium.
Background
Text error correction refers to automatic inspection and automatic error correction of Chinese sentences, and aims to improve language accuracy and reduce manual verification cost. In the prior art, text error correction is realized by a pycorrector method, although the larger the vocabulary in a user-defined dictionary (including a near-phonetic dictionary, a near-shape dictionary and a confusion dictionary), the higher the error retrieval success rate is, the larger the vocabulary in the user-defined dictionary is, the corresponding retrieval requirement is greatly prolonged, and the online text error correction response rate is low.
Disclosure of Invention
An embodiment of the present application aims to provide a text error correction method, a text error correction device, a computer device, and a storage medium, so as to solve the problem of a low text error correction response rate in the prior art.
In order to solve the above technical problem, an embodiment of the present application provides a text error correction method, which adopts the following technical solutions:
acquiring text information;
processing the text information by adopting a forward maximum matching method to obtain a word segmentation set, wherein the word segmentation set comprises a plurality of words;
judging whether each participle in the participle set has wrongly-written characters or not through an N-gram model, wherein the N-gram model is obtained through training based on background information, N is not less than 1, and the background information comprises at least one of industry field information, general corpus information and region information;
when the participles in the participle set have wrongly-typed characters, acquiring target participle positions with the wrongly-typed characters;
recalling a set of candidate words through a custom dictionary, wherein the custom dictionary comprises at least one of a near-phonetic dictionary, a near-shape dictionary, and a confusion dictionary, and the set of candidate words comprises at least one candidate word;
and screening the candidate word set to obtain candidate words meeting preset screening conditions as correct candidate words, and replacing the wrongly-written characters with the correct candidate words at the target word segmentation positions.
Further, after the step of obtaining the text information and before the step of processing the text information by using the forward maximum matching method, the method further comprises the following steps:
and performing sentence division processing on the text information.
Further, the step of processing the text information by using a forward maximum matching method includes:
step A: segmenting the text information according to a preset maximum input character length to obtain a character group and residual text information, wherein the character group comprises at least one character;
and B: matching the character group through a preset word segmentation dictionary, and judging whether the character group is a phrase in the word segmentation dictionary;
and C: when the character group is a word group in the word segmentation dictionary, taking the character group as a segmentation word, replacing the text information with the residual text information, and repeatedly executing the step A until the final character in the text information is segmented;
step D: when the character group is not a word group in a word segmentation dictionary, removing the terminal character of the character group, using the character group with the terminal character removed as a secondary character group, adding the removed character to the head end of the residual text information to obtain secondary residual text information, replacing the secondary character group with the character group and replacing the residual text information with the secondary residual text information, and then repeatedly executing the step B until the terminal character in the text information is segmented;
step E: and when the terminal character in the text information is segmented, combining all the obtained segmented words to form the segmented word set.
Further, the N-gram model comprises a 2-gram model and a 3-gram model, and the step of judging whether each participle in the participle set has wrongly written characters through the N-gram model comprises the following steps:
respectively scoring adjacent participles in the participle set through the 2-gram model and the 3-gram model to obtain a first score of the 2-gram model and a second score of the 3-gram model;
judging whether the first score and the second score are both lower than a preset threshold value;
if the first score and the second score are lower than the preset threshold, determining that wrongly written characters exist in the participle set;
and if the first score and the second score are not lower than the preset threshold, determining that no wrongly written characters exist in the segmentation set.
Further, the N-gram models include at least two 2-gram models and at least two 3-gram models; the step of respectively scoring adjacent participles in the participle set through the 2-gram model and the 3-gram model to obtain a first score of the 2-gram model and a second score of the 3-gram model comprises the following steps:
each 2-gram model scores adjacent participles in the participle set to obtain a first sub-score, wherein the first sub-score comprises a plurality of first sub-scores, and background information of each 2-gram model is different;
each 3-gram model scores adjacent participles in the participle set to obtain a second sub-score, wherein the second sub-score comprises a plurality of second sub-scores, and background information of each 3-gram model is different.
Further, each 2-gram model scores adjacent participles in the participle set to obtain a first sub-score; each 3-gram model scores adjacent participles in the participle set to obtain a second sub-score, and the steps comprise:
determining whether the industry field information of adjacent participles in the participle set belongs to a preset industry field or not to obtain a judgment result;
and determining whether to perform weighting processing on the obtained first sub-score and the second sub-score based on the judgment result.
Further, the step of performing screening processing on the candidate word set includes:
carrying out coarse screening processing on the candidate word set through a logistic regression model;
and carrying out fine screening processing on the candidate word set subjected to coarse screening through an Xgboost model.
In order to solve the above technical problem, an embodiment of the present application further provides a text error correction device, which adopts the following technical solutions:
the acquisition module is used for acquiring text information;
the word segmentation module is used for processing the text information by adopting a forward maximum matching method to obtain a word segmentation set, wherein the word segmentation set comprises a plurality of words;
the detection module is used for judging whether each participle in the participle set has a wrongly-typed character or not through an N-gram model, wherein the N-gram model is obtained based on background information training, N is larger than or equal to 1, and the background information comprises at least one of industry field information, general corpus information and regional information;
the positioning module is used for acquiring a target word segmentation position with wrongly-typed characters when the word segmentation in the word segmentation set has wrongly-typed characters;
the candidate recalling module is used for recalling a candidate word set through a self-defined dictionary, wherein the self-defined dictionary comprises at least one of a near-phonetic dictionary, a near-shape dictionary and a confusion dictionary, and the candidate word set comprises at least one candidate word;
and the replacing module is used for screening the candidate word set, acquiring candidate words meeting preset screening conditions as correct candidate words, and replacing the wrongly-written characters with the correct candidate words at the target word segmentation positions.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:
comprising a memory having computer readable instructions stored therein and a processor that when executed implements the steps of the text correction method as described above.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:
the computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the text correction method as described above.
Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:
the method comprises the steps of obtaining text information, processing the text information by adopting a forward maximum matching method to obtain a word segmentation set, wherein the word segmentation set comprises a plurality of words, then whether each word in the word segmentation set has wrongly written characters is judged through an N-gram model, wherein the N-gram model is obtained by training based on background information, N is not less than 1, the background information comprises at least one of industry field information, general corpus information and regional information, when the segmented words in the segmented word set have wrongly-typed characters, the target segmented word position with the wrongly-typed characters is obtained, the candidate word set is recalled through a user-defined dictionary, the candidate word set comprises at least one candidate word, the candidate word set is subjected to screening processing, candidate words meeting preset screening conditions are obtained and serve as correct candidate words, and the wrongly-written characters are replaced by the correct candidate words at the target word segmentation positions; the method and the device adopt a forward maximum matching algorithm to perform word segmentation, and perform error retrieval on each segmented word in a segmented word set through an N-gram model obtained based on background information training so as to achieve the aim of targeted retrieval on the text, reduce time consumption required by error word retrieval, ensure the accuracy of error word retrieval and improve the response rate of error correction of wrongly written characters on a text line.
Drawings
In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for use in the description of the embodiments of the present application, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a text correction method according to the present application;
FIG. 3 is a schematic diagram of an embodiment of a text correction device according to the present application;
FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.
It should be noted that the text error correction method provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the text error correction apparatus is generally disposed in the server/terminal device.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow diagram of one embodiment of a text correction method according to the present application is shown. The text error correction method comprises the following steps:
step 201: and acquiring text information.
Specifically, the user inputs text information (such as a text of a chinese language) to the above-described terminal device/server through the input device.
Wherein, the input device comprises one or more of a virtual keyboard, a physical keyboard and a voice input device.
Step 202: and processing the text information by adopting a forward maximum matching method to obtain a word segmentation set, wherein the word segmentation set comprises a plurality of words.
Specifically, before the text information is processed by the forward maximum matching method, the length of the maximum input character is predefined (generally, the maximum input character length is the longest phrase length in the dictionary, and of course, the maximum input character length can also be defined by the user himself); in practical application, a forward maximum matching method is adopted to perform word segmentation on a text, text information is cut on the basis of a predefined maximum input character length to obtain a plurality of words, and the words are converged to form the word segmentation set.
Step 203: and judging whether each participle in the participle set has wrongly-written characters or not through an N-gram model, wherein the N-gram model is obtained through training based on background information, N is not less than 1, and the background information comprises at least one of industry field information, general corpus information and region information.
Specifically, industry field information: for example, in the financial field, in the stock market, the phrase "evaporation" means that the stock market value evaporates, and generally means that the stock market value at a certain time point (for example, a unit of year or month) is compared with the previous stock market value, and the market value is also reduced due to the drop of the stock price, and the shrinkage of the market value is called the stock market value evaporation; however, in daily life, the phrase meaning of 'evaporation' is used for describing the phenomenon that liquid is converted into gas, so that the industry field information is introduced to train the N-gram model, more detailed classification can be carried out based on the difference of the industry field, and the error detection accuracy of wrongly-written characters in the word segmentation is improved; certainly, the industry field information can also be industry vertical field information, the main field is divided into a plurality of branch fields, such as a financial field, and the sub-fields such as securities, banks, insurance and the like are finely divided, so that the error detection accuracy of wrongly written characters in the participles is further improved.
General corpus information: the phrase meaning of the loan is a credit activity form of lending money funds by banks or other financial institutions according to certain interest rate and the condition of returning, simple and popular understanding is the money borrowing needing interest, the meaning of the loan phrase in various industry fields is the same or similar, and therefore general corpus information is introduced to train the N-gram model so as to improve the applicability of the N-gram model.
Region information: taking the financial field as an example, if "deposit" is a phrase in area a and "deposit" is a phrase in area B, but the meanings of "deposit" and "deposit" are consistent, both of them mean that one party stores a certain fee in the other party to ensure that his own behavior will not damage the benefit of the other party, if the damage is caused, the fee can be paid according to the real or other compensation, so that general area information is introduced to train the N-gram model to further improve the applicability of the N-gram model.
Further, the background information includes one or more of industry field information, general corpus information, and region information, that is, the background information may be one of the industry field information, the general corpus information, and the region information, or at least two of the industry field information, the general corpus information, and the region information, so as to meet different scene usage requirements and improve the applicability of the N-gram model.
The N-gram model is obtained based on background information training, and the N-gram model meeting the requirement can be selected based on the difference of the current background information so as to improve the accuracy of wrongly-written character retrieval and reduce the time consumption of wrongly-written character retrieval; as in the financial domain, the N-gram model may be trained based on vertical domains of the financial domain, general predictions, differences in terms across regions.
Step 204: and when the participles in the participle set have wrongly-typed characters, acquiring the target participle positions with the wrongly-typed characters.
Specifically, the target word segmentation position with the wrongly-written character can be located by a mark, such as one or more of font thickening, color modification (e.g., modifying the font color from black to red), line marking (e.g., adding underlines, wavy lines, zigzag lines, etc.), font scaling (changing the font size by enlarging or reducing), and font theme modification (e.g., modifying the original font to a regular font theme).
Step 205: recalling a set of candidate words through a custom dictionary, wherein the custom dictionary comprises at least one of a near-phonetic dictionary, a near-shape dictionary, and a confusion dictionary, and the set of candidate words comprises at least one candidate word.
Specifically, the confusion dictionary is a 1-gram model and a 2-gram model confusion dictionary with editing distance, in order to improve the indexing efficiency of the dictionary and reduce the search time, 1-gram model words and word frequency and a 1-gram model near-phonetic dictionary are stored by using a double-array dictionary tree, while the 2-gram model dictionary is stored by adopting a CSR data structure, and meanwhile, the 2-gram model near-phonetic confusion words can be recovered from the dictionary. In order to recall the candidate word set within the editing distance, a hierarchical inverted index dictionary is established, so that the searching efficiency is improved.
Further, the near-phonetic dictionary is [ murder, bear, brother, male ], and if the error of the murder is detected, the candidate word is [ bear, brother, male ]. If the test is wrong, the candidate word is examined, picked and frugal. And if the confusion dictionary is plus or increased, if the plus is detected to have errors, the candidate word is increased.
Step 206: and screening the candidate word set to obtain candidate words meeting preset screening conditions as correct candidate words, and replacing the wrongly-written characters with the correct candidate words at the target word segmentation positions.
Specifically, the screening condition is to compare the scores of the candidate words, select the candidate word with the highest score and the first rank as the correct candidate word, and replace the wrongly-written characters in the target word with the correct candidate word.
In some optional implementations of this embodiment, after step 201 and before step 202, further including:
and performing sentence division processing on the text information.
Specifically, the sentence dividing processing is carried out on the text based on punctuation marks; further, as in a handheld device or a portable device, the splitting of the sentence is accomplished through a function built in the JAVA language.
In some optional implementations of this embodiment, in step 202, the step of processing the text information by using the forward maximum matching method includes:
segmenting the text information according to the predefined maximum input character length to obtain a character group and residual text information, wherein the character group comprises at least one character;
step A: segmenting the text information according to a preset maximum input character length to obtain a character group and residual text information, wherein the character group comprises at least one character;
and B: matching the character group through a preset word segmentation dictionary, and judging whether the character group is a phrase in the word segmentation dictionary;
and C: when the character group is a word group in the word segmentation dictionary, taking the character group as a segmentation word, replacing the text information with the residual text information, and repeatedly executing the step A until the final character in the text information is segmented;
step D: when the character group is not a word group in a word segmentation dictionary, removing the terminal character of the character group, using the character group with the terminal character removed as a secondary character group, adding the removed character to the head end of the residual text information to obtain secondary residual text information, replacing the secondary character group with the character group and replacing the residual text information with the secondary residual text information, and then repeatedly executing the step B until the terminal character in the text information is segmented;
step E: and when the terminal character in the text information is segmented, combining all the obtained segmented words to form the segmented word set.
Specifically, the following example is given by taking a preset maximum input character LENGTH (MAX _ LENGTH) as 3 and combining an id card error prompt statement in a financial scenario: if the text information is 'the system prompt identity card number is incorrect', segmenting 'the system prompt identity card number is incorrect' through a preset maximum input character LENGTH (MAX _ LENGTH) to obtain a 'system prompt' character group and 'the identity card number is incorrect' residual text information; then matching the character group of the 'system proposal' through a word segmentation dictionary, and searching through the word segmentation dictionary, if the matching of the character group of the 'system proposal' and any phrase in the word segmentation dictionary is unsuccessful, removing the 'lifting' character in the 'system proposal' character group to obtain a 'system' secondary character group and a 'lifting' character, adding the 'lifting' character into the first segment of the residual text information with incorrect 'identity card number to obtain the secondary residual text information with incorrect' prompt identity card number, then replacing the character group with the 'system' secondary character group and replacing the residual text information with the secondary residual text information body with the incorrect prompt identity card number, matching the character group of the 'system' through the word segmentation dictionary again until the terminal character in the text information is segmented, and when the matching of the 'system' character group and any phrase in the word segmentation dictionary is successful, taking the character group of the system as a participle, then segmenting the text information which is replaced by the residual text information through a preset maximum input character LENGTH (MAX _ LENGTH), repeating the matching steps of the participle dictionary, and so on until the final character in the residual text information is segmented, and then combining all the obtained participles to form the participle set.
Specifically, the word segmentation dictionary comprises a plurality of word groups for carrying out word group matching on the character groups, and can be understood that if the character groups are 'system extraction', the character groups are matched with the word segmentation dictionary, after the character groups are not found, the 'extraction' characters are removed, secondary character groups of the 'system' are obtained, then the secondary character groups are matched with the word segmentation dictionary, and after the matching with the word segmentation dictionary is successful, the secondary character groups are output and recorded as word segmentation.
In some optional implementations, in step 203, the N-gram model includes a 2-gram model and a 3-gram model, and determining whether there is a wrongly-written word in the segmented word set by the N-gram model includes:
respectively scoring adjacent participles in the participle set through the 2-gram model and the 3-gram model to obtain a first score of the 2-gram model and a second score of the 3-gram model;
judging whether the first score and the second score are both lower than a preset threshold value;
if the first score and the second score are lower than the preset threshold, determining that wrongly written characters exist in the participle set;
and if the first score and the second score are not lower than the preset threshold, determining that no wrongly written characters exist in the segmentation set.
Specifically, for example, in the financial field, if the sentence is "personal security account", the word segmentation processing in step S2 is performed to obtain a word segmentation set including "person", "security", and "account", for a 2-gram model, [ person, security ], and [ security, account ] are respectively imported into the 2-gram model to obtain a first score a1 and a first score a2, and for A3-gram model, [ person, security, account ] is imported into the 3-gram model to obtain a second score, whether the first score a1, the first score a2, and the second score are lower than a preset threshold is determined, and a basic rule is that, when the first score and the second score are both lower than the preset threshold, it is determined that there is a wrong word in the word segmentation, and a position of the wrong word is located, that is, i.e., a position of "account" in the word segmentation "account" is located.
In some alternative implementations, the N-gram models include at least two 2-gram models and at least two 3-gram models; the step of respectively performing scoring processing on adjacent participles in the participle set through the 2-gram model and the 3-gram model to obtain a first score of the 2-gram model and a second score of the 3-gram model comprises the following steps of:
each 2-gram model scores adjacent participles in the participle set to obtain a first sub-score, wherein the first sub-score comprises a plurality of first sub-scores, and background information of each 2-gram model is different;
each 3-gram model scores adjacent participles in the participle set to obtain a second sub-score, wherein the second sub-score comprises a plurality of second sub-scores, and background information of each 3-gram model is different.
Specifically, taking two 2-gram models as an example, the background information of the two 2-gram models can be any one of industry field information, general corpus information and region information, and only the two 2-gram models need to have different background information, so that the background information of the two 2-gram models can be understood as the industry field information and the general corpus information, or the industry field information and the region information, or the general corpus information and the region information; in practical application, in the above [ individual, securities ], two 2-gram models are loaded with adjacent participles [ individual, securities ] and are subjected to scoring processing to obtain first sub-scores of the background information, when the first scores are compared with a preset threshold, specifically, the first sub-scores in the first scores are compared with the preset threshold one by one to obtain a comparison result of the first scores, and whether a misclassified character exists in a participle set is judged by combining the comparison result of the second scores and the preset threshold, and please refer to the above description for a comparison method of the specific first scores and the second scores and the preset threshold.
Similarly, when at least two 3-gram models are respectively trained based on the background information, the principle of the calculation process of each 3-gram model is the same as that of the 2-gram model.
In some optional implementation manners, in each 2-gram model, a score is given to adjacent participles in the participle set to obtain a first sub-score; each 3-gram model scores adjacent participles in the participle set to obtain a second sub-score, wherein the step of obtaining the second sub-score comprises the following steps:
determining whether the industry field information of adjacent participles in the participle set belongs to a preset industry field or not to obtain a judgment result;
and determining whether to perform weighting processing on the obtained first sub-score and the second sub-score based on the judgment result.
Specifically, taking two 2-gram models as an example, the background information of the two 2-gram models can be any one of industry field information, general corpus information and region information, and only the two 2-gram models need to have different background information, so that the background information of the two 2-gram models can be understood as the industry field information and the general corpus information, or the industry field information and the region information, or the general corpus information and the region information; before scoring adjacent participles in a participle set, determining whether the industry field information to which the adjacent participles in the participle set belong is a preset industry field or not to obtain a judgment result, wherein the judgment result comprises two types, namely a first result: if the preset industry field is the financial field, when the industry field information of the adjacent participles in the participle set is not the financial field, comparing and judging each first sub-score with a preset threshold one by one, wherein a specific comparison process is a comparison process of the first sub-scores and the preset threshold (see the description above), so that the influence caused by industry difference, regional difference and poor word bank adaptability is reduced, and the detection error rate of wrongly-written characters is further improved; and a second result: when the industry field information to which adjacent participles in the participle set belong is the financial field, based on the proportion of the industry field information, the general corpus information and the regional information in the background information, after pre-weighting processing is performed on the obtained first sub-scores, comparison judgment is performed on the obtained first sub-scores and a preset threshold value one by one, and the specific comparison process is a comparison process of the first scores and the preset threshold value (see the above description specifically), so that the influence of each difference in the background information is further reduced, and the detection error rate of wrongly-recognized characters is further improved. And judging whether the information of the industry field to which the adjacent participles belong in the participle set belongs is the preset industry field through an operator or automatically identifying, and switching the processing mode of the result one to the processing mode of the result two when the participles belong to the word group of the preset industry field information.
Similarly, when at least two 3-gram models are respectively trained based on the background information, the background information confirmation and calculation process principle of each 3-gram model is the same as that of the 2-gram model.
In some optional implementations, in step 206, the step of performing a filtering process on the candidate word set includes:
carrying out coarse screening processing on the candidate word set through a logistic regression model;
and carrying out fine screening processing on the candidate word set subjected to coarse screening through an Xgboost model.
Specifically, a logistic regression model algorithm is adopted for feature extraction, obviously wrong candidate words in a candidate word set are filtered to obtain a candidate word set meeting a preset screening condition, then the candidate word set meeting the preset screening condition is scored through an Xgboost model, a candidate word with a score larger than a preset threshold value and highest score rank in the candidate word set meeting the preset screening condition is used as a correct candidate word, and the wrong words in target participles are replaced by the correct candidate word; therefore, the candidate words with obvious errors in the candidate word set are filtered through rough screening processing, and the quantity of the candidate words in the candidate word set is used for further effectively improving the subsequent fine screening processing efficiency.
The characteristic extraction comprises dictionary statistical characteristics (counting the frequency of occurrence of each candidate word in the linguistic data of the general field and the industry field), editing distance (calculating the editing distance between the original word and the candidate word), pinyin jaccard distance (calculating the jaccard distance between the original word and the candidate word), and statistical language model calculation scores (calculating the scores of a 2-gram model and a 3-gram model of the candidate word).
The method and the device adopt a forward maximum matching algorithm to perform word segmentation, and perform error retrieval on each segmented word in a segmented word set through an N-gram model obtained based on background information training so as to achieve the aim of targeted retrieval on the text, reduce time consumption required by the retrieval of the erroneous words, ensure the accuracy of the retrieval of the erroneous words and improve the response rate of error correction of the wrongly written characters on the text line.
It is emphasized that, in order to further ensure the privacy and security of the information, the N-gram model and the custom dictionary information in the above embodiments may also be stored in a node of a block chain.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity of the information (anti-counterfeiting) and generating a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the programs can include processes of the embodiments of the methods described above. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or in turns with other steps or at least a portion of the sub-steps or stages of other steps.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a text error correction apparatus, which corresponds to the method embodiment shown in the figure and can be applied to various electronic devices.
As shown in fig. 3, the text error correction apparatus 300 according to the present embodiment includes: an acquisition module 301, a segmentation module 302, a detection module 303, a location module 304, a candidate recall module 305, and a replacement module 306. Wherein:
an obtaining module 301, configured to obtain text information;
a word segmentation module 302, configured to process the text information by using a forward maximum matching method to obtain a word segmentation set, where the word segmentation set includes multiple word segments;
a detection module 303, configured to judge whether each participle in the participle set has a wrongly written or not through an N-gram model, where the N-gram model is obtained through training based on background information, N is greater than or equal to 1, and the background information includes at least one of industry field information, general corpus information, and region information;
the positioning module 304 is configured to obtain a target word segmentation position with wrongly-typed characters when a word segmentation in the word segmentation set has wrongly-typed characters;
a candidate recall module 305 for recalling a set of candidate words through a self-defined dictionary, wherein the self-defined dictionary comprises at least one of a near-phonetic dictionary, a near-shape dictionary and a confusion dictionary, and the set of candidate words comprises at least one candidate word;
a replacing module 306, configured to perform screening processing on the candidate word set, acquire a candidate word meeting a preset screening condition as a correct candidate word, and replace the wrongly-written character with the correct candidate word at the target word segmentation position.
The text error correction device adopts the forward maximum matching algorithm to perform word segmentation, and performs error retrieval on each segmented word in the segmented word set through the N-gram model obtained based on background information training so as to achieve the aim of targeted retrieval on the text, thereby being beneficial to reducing the time consumption required by the retrieval of the erroneous words, ensuring the retrieval accuracy of the erroneous words and improving the response speed of error correction of the wrongly written characters on the text line.
In some optional implementations of this embodiment, the method further includes: and a sentence dividing module 307, configured to perform sentence dividing processing on the text information.
In some optional implementations of this embodiment, the word segmentation module 302 includes:
the segmentation unit is used for segmenting the text information according to a preset maximum input character length to obtain a character group and residual text information, wherein the character group comprises at least one character;
the matching unit is used for matching the character group through a preset word segmentation dictionary and judging whether the character group is a word group in the word segmentation dictionary;
and a first judging unit, configured to, when the character group is a word group in the word segmentation dictionary, use the character group as a segmentation word, replace the text information with the remaining text information, and repeatedly execute the step a until an endmost character in the text information has been segmented.
And a second judging unit, configured to, when the character group is not a phrase in the segmentation dictionary, remove a terminal character of the character group, use the character group from which the terminal character has been removed as a sub-character group, add the removed character to a head end of the remaining text information to obtain sub-remaining text information, replace the character group with the sub-character group and replace the remaining text information with the sub-remaining text information, and then repeatedly execute the step B until the terminal character in the text information has been segmented.
And the converging unit is used for converging all the obtained participles to form the participle set after the terminal character in the text information is segmented.
In some optional implementations of this embodiment, the detecting module 303 includes:
a scoring unit, configured to score adjacent segmented words in the segmented word set through the 2-gram model and the 3-gram model respectively to obtain a first score of the 2-gram model and a second score of the 3-gram model;
and the third judging unit is used for judging whether the first score and the second score are both lower than a preset threshold value.
A first determining unit, configured to determine that a wrongly-written or mispronounced character exists in the participle set if the first score and the second score are both lower than the preset threshold;
and the second determining unit is used for determining that no wrongly written characters exist in the participle set if the first score and the second score are not lower than the preset threshold.
In some optional implementations of this embodiment, the scoring unit includes:
a first scoring unit, configured to score, by each 2-gram model, adjacent segmented words in the segmented word set to obtain a first sub-score, where the first sub-score includes multiple first sub-scores, and background information of each 2-gram model is different;
and the second scoring subunit is used for scoring adjacent participles in the participle set by each 3-gram model to obtain a second sub-score, wherein the second sub-score comprises a plurality of second sub-scores, and the background information of each 3-gram model is different.
In some optional implementation manners of this embodiment, the detecting module 303 further includes:
the third determining unit is used for determining whether the industry field information to which the adjacent participles in the participle set belong is a preset industry field or not to obtain a judgment result;
a fourth determination unit configured to determine whether to perform weighting processing on the obtained first sub-score and second sub-score based on the determination result.
In some optional implementations of this embodiment, the replacing module 306 includes:
the rough screening unit is used for carrying out rough screening processing on the candidate word set through a logistic regression model;
and the fine screening unit is used for performing fine screening processing on the candidate word set subjected to the coarse screening through an Xgboost model.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D text error correction memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like equipped on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes of a text error correction method. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, for example, execute the program code of the text error correction method.
The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.
The method adopts a forward maximum matching algorithm to perform word segmentation, and performs error retrieval on each segmented word in a segmented word set through an N-gram model obtained based on background information training so as to achieve the aim of performing targeted retrieval on the text, thereby being beneficial to reducing the time consumption required by the retrieval of the erroneous words, ensuring the accuracy rate of the retrieval of the erroneous words and improving the response rate of error correction of wrongly written characters on a text line.
The present application further provides another embodiment, which is to provide a computer-readable storage medium storing a text correction program, which is executable by at least one processor to cause the at least one processor to perform the steps of the text correction method as described above.
The method adopts a forward maximum matching algorithm to perform word segmentation, and performs error retrieval on each segmented word in a segmented word set through an N-gram model obtained based on background information training so as to achieve the aim of performing targeted retrieval on the text, thereby being beneficial to reducing the time consumption required by the retrieval of the erroneous words, ensuring the accuracy rate of the retrieval of the erroneous words and improving the response rate of error correction of wrongly written characters on a text line.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that the present application may be practiced without these specific details or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A text error correction method, comprising the steps of:
acquiring text information;
processing the text information by adopting a forward maximum matching method to obtain a word segmentation set, wherein the word segmentation set comprises a plurality of words;
judging whether each participle in the participle set has wrongly-written characters or not through an N-gram model, wherein the N-gram model is obtained through training based on background information, N is not less than 1, and the background information comprises at least one of industry field information, general corpus information and region information;
when the participles in the participle set have wrongly-typed characters, acquiring target participle positions with the wrongly-typed characters;
recalling a set of candidate words through a custom dictionary, wherein the custom dictionary comprises at least one of a near-phonetic dictionary, a near-shape dictionary, and a confusion dictionary, and the set of candidate words comprises at least one candidate word;
and screening the candidate word set to obtain candidate words meeting preset screening conditions as correct candidate words, and replacing the wrongly-written characters with the correct candidate words at the target word segmentation positions.
2. The text error correction method of claim 1, further comprising, after the step of obtaining text information and before the step of processing the text information using a forward maximum matching method:
and performing sentence division processing on the text information.
3. The text correction method of claim 1 wherein the step of processing the text information using forward maximum matching comprises:
step A: segmenting the text information according to a preset maximum input character length to obtain a character group and residual text information, wherein the character group comprises at least one character;
and B: matching the character group through a preset word segmentation dictionary, and judging whether the character group is a phrase in the word segmentation dictionary;
and C: when the character group is a word group in the word segmentation dictionary, taking the character group as a word segmentation, replacing the text information with the residual text information, and repeatedly executing the step A until the final character in the text information is segmented;
step D: when the character group is not a word group in a word segmentation dictionary, removing the terminal character of the character group, taking the character group with the terminal character removed as a secondary character group, adding the removed character to the head end of the residual text information to obtain secondary residual text information, replacing the character group with the secondary character group and replacing the residual text information with the secondary residual text information, and then repeatedly executing the step B until the terminal character in the text information is segmented;
step E: and when the terminal character in the text information is segmented, converging all the obtained participles to form the participle set.
4. The text error correction method according to any one of claims 1 to 3, wherein the N-gram model includes a 2-gram model and a 3-gram model, and the step of judging whether each participle in the participle set has a wrongly written or not through the N-gram model includes:
respectively scoring adjacent participles in the participle set through the 2-gram model and the 3-gram model to obtain a first score of the 2-gram model and a second score of the 3-gram model;
judging whether the first score and the second score are both lower than a preset threshold value;
if the first score and the second score are lower than the preset threshold, determining that wrongly written characters exist in the participle set;
and if the first score and the second score are not lower than the preset threshold, determining that no wrongly written characters exist in the participle set.
5. The text correction method according to claim 4, wherein the N-gram models include at least two 2-gram models and at least two 3-gram models; the step of respectively scoring adjacent participles in the participle set through the 2-gram model and the 3-gram model to obtain a first score of the 2-gram model and a second score of the 3-gram model comprises the following steps:
each 2-gram model scores adjacent participles in the participle set to obtain a first sub-score, wherein the first sub-score comprises a plurality of first sub-scores, and background information of each 2-gram model is different;
each 3-gram model scores adjacent participles in the participle set to obtain a second sub-score, wherein the second sub-score comprises a plurality of second sub-scores, and background information of each 3-gram model is different.
6. The text error correction method according to claim 5, wherein each 2-gram model scores adjacent participles in the participle set to obtain a first sub-score; each 3-gram model scores adjacent participles in the participle set to obtain a second sub-score, wherein the step of obtaining the second sub-score comprises the following steps:
determining whether the industry field information of adjacent participles in the participle set belongs to a preset industry field or not to obtain a judgment result;
and determining whether to perform weighting processing on the obtained first sub-score and the second sub-score based on the judgment result.
7. The text error correction method according to any one of claims 1 to 3, wherein the step of performing a filtering process on the candidate word set includes:
carrying out coarse screening processing on the candidate word set through a logistic regression model;
and carrying out fine screening processing on the candidate word set subjected to coarse screening through an Xgboost model.
8. A text correction apparatus, comprising:
the acquisition module is used for acquiring text information;
the word segmentation module is used for processing the text information by adopting a forward maximum matching method to obtain a word segmentation set, wherein the word segmentation set comprises a plurality of words;
the detection module is used for judging whether each participle in the participle set has wrongly written characters or not through an N-gram model, wherein the N-gram model is obtained through training based on background information, N is larger than or equal to 1, and the background information comprises at least one of industry field information, general corpus information and regional information;
the positioning module is used for acquiring a target word segmentation position with wrongly-typed characters when the word segmentation in the word segmentation set has wrongly-typed characters;
the candidate recalling module is used for recalling a candidate word set through a self-defined dictionary, wherein the self-defined dictionary comprises at least one of a near-phonetic dictionary, a near-shape dictionary and a confusion dictionary, and the candidate word set comprises at least one candidate word;
and the replacing module is used for screening the candidate word set, acquiring candidate words meeting preset screening conditions as correct candidate words, and replacing the wrongly-written characters with the correct candidate words at the target word segmentation positions.
9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the text correction method of any one of claims 1 to 7.
10. A computer-readable storage medium having computer-readable instructions stored thereon which, when executed by a processor, implement the steps of the text correction method of any of claims 1 to 7.
CN202111150351.7A 2021-09-29 2021-09-29 Text error correction method and device, computer equipment and storage medium Pending CN113887202A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111150351.7A CN113887202A (en) 2021-09-29 2021-09-29 Text error correction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111150351.7A CN113887202A (en) 2021-09-29 2021-09-29 Text error correction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113887202A true CN113887202A (en) 2022-01-04

Family

ID=79007803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111150351.7A Pending CN113887202A (en) 2021-09-29 2021-09-29 Text error correction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113887202A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341543A (en) * 2023-05-31 2023-06-27 安徽商信政通信息技术股份有限公司 Method, system, equipment and storage medium for identifying and correcting personal names
CN117371445A (en) * 2023-12-07 2024-01-09 深圳市慧动创想科技有限公司 Information error correction method, device, computer equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341543A (en) * 2023-05-31 2023-06-27 安徽商信政通信息技术股份有限公司 Method, system, equipment and storage medium for identifying and correcting personal names
CN116341543B (en) * 2023-05-31 2023-09-19 安徽商信政通信息技术股份有限公司 Method, system, equipment and storage medium for identifying and correcting personal names
CN117371445A (en) * 2023-12-07 2024-01-09 深圳市慧动创想科技有限公司 Information error correction method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111897970A (en) Text comparison method, device and equipment based on knowledge graph and storage medium
US10997366B2 (en) Methods, devices and systems for data augmentation to improve fraud detection
CN111639489A (en) Chinese text error correction system, method, device and computer readable storage medium
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
CN106095972B (en) Information classification method and device
CN110555206A (en) named entity identification method, device, equipment and storage medium
CN113887202A (en) Text error correction method and device, computer equipment and storage medium
WO2021218027A1 (en) Method and apparatus for extracting terminology in intelligent interview, device, and medium
CN112632278A (en) Labeling method, device, equipment and storage medium based on multi-label classification
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
CN110046648B (en) Method and device for classifying business based on at least one business classification model
CN114626731A (en) Risk identification method and device, electronic equipment and computer readable storage medium
CN112686053A (en) Data enhancement method and device, computer equipment and storage medium
CN112084779A (en) Entity acquisition method, device, equipment and storage medium for semantic recognition
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN111723870A (en) Data set acquisition method, device, equipment and medium based on artificial intelligence
CN113626576A (en) Method and device for extracting relational characteristics in remote supervision, terminal and storage medium
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN111309901A (en) Short text classification method and device
CN115730237A (en) Junk mail detection method and device, computer equipment and storage medium
CN113051396B (en) Classification recognition method and device for documents and electronic equipment
CN103942188A (en) Method and device for identifying corpus languages
CN112800771A (en) Article identification method and device, computer readable storage medium and computer equipment
CN113505293B (en) Information pushing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination