CN114970502B

CN114970502B - Text error correction method applied to digital government

Info

Publication number: CN114970502B
Application number: CN202111633076.4A
Authority: CN
Inventors: 吴琼; 常诚; 王元卓
Original assignee: China Science And Technology Big Data Research Institute
Current assignee: China Science And Technology Big Data Research Institute
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2023-03-28
Anticipated expiration: 2041-12-29
Also published as: CN114970502A

Abstract

The invention belongs to the technical field of computers, and particularly relates to a text error correction method applied to a digital government, which comprises a method and a flow of model training, data acquisition, data cleaning, text error correction and data storage, wherein character voice, font and characters are used as characteristics and added into a pre-training model for training, so that the error correction accuracy rate with similar character voice and similar font can be improved, the workload of supervision and detection personnel is effectively reduced, the model error correction accuracy rate is about 70%, and the error correction accuracy rate reaches 83% by adding character voice and font as characteristic training.

Description

Text error correction method applied to digital government

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a text error correction method applied to a digital government.

Background

Website information content is typically detected by manual inspection, system monitoring, public media feedback, and the like. Due to the fact that government affair disclosure is multi-faceted and wide in range, huge in information amount and high in timeliness requirement, the requirement cannot be met only through manual inspection. Therefore, systematic detection is the most important detection method, wherein the accuracy of the detection result is particularly important, and the workload of detection personnel is increased if wrong detection or missed detection is carried out.

The frequently used method for monitoring wrongly-written characters comprises a wrongly-written character dictionary, an editing distance and a language model, and the manual cost for constructing the dictionary is higher based on the error correction algorithm of the wrongly-written character dictionary, so that the method is suitable for the partial vertical field with limited wrongly-written characters; the error correction algorithm based on edit distance matching adopts a method similar to character string fuzzy matching, and can correct part of common wrongly written or mispronounced characters and language diseases by contrasting correct samples, but the universality is insufficient, so that the research on a text error correction method and a text error correction system applied to a digital government is necessary.

Disclosure of Invention

Aiming at the defects and problems of the existing equipment, the invention provides a text error correction method applied to the digital government, and the problems of high labor cost and poor universality of the existing error correction model are effectively solved.

The technical scheme adopted by the invention for solving the technical problems is as follows: a text error correction method applied to digital governments comprises the following steps:

s1, model training

(1) Obtaining a corpus

Integrating encyclopedia, headlines and Zhihe news as a resource library to obtain a corpus;

(2) Formulating confusion decision rules

The confusion judgment rule comprises a character pattern confusion rule, a character and sound confusion rule and a character confusion rule; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and the decomposition result is input into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N;

the word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are formulated according to the pinyin dictionary;

(1) the pronunciation is the same, the tone is the same, the edit distance is equal to 0;

(2) the pronunciation is the same, the tone is different, the edit distance is equal to 1;

(3) flatly rolling the tongue, and making the front and back nasal sounds, wherein the editing distance is equal to 1;

(4) changing one of the initial consonant or the final sound, wherein the editing distance is equal to 1;

(5) the initial consonant and the final are changed, and the editing distance is larger than 1;

selecting editing distances with different lengths to generate a character and sound confusion set;

the editing distance of the character confusion rule is 1, and the confusion rule is that a character is randomly selected from a character library to be replaced;

(3) Obtaining a confusion set

Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a proportion, and replacing the corresponding confusion set;

(4) Model training

Taking the confusion set as an input set, taking the sample set as a comparison set, forming one-to-one corresponding sentence pairs, and performing model training in an end-to-end mode to finally obtain an error correction model;

s2, data acquisition

Receiving a website domain name, recursively acquiring links in all webpages, removing repeated links according to URL HASH, forming acquisition process information and acquisition results into a JSON format, and pushing the JSON format as a data acquisition result to a KAFKA message system;

s3, data cleaning

Subscribing to a KAFKA message, consuming the data collection results, and performing the steps of:

1. subscribing KAFKA information, consuming data acquisition results, identifying content types and filtering non-HTML types;

2. webpage preprocessing: encoding characters of a page source code by using a charset attribute in the collected information, then serializing a collected result, and analyzing a webpage into a DOM tree;

3. extracting the webpage tags: extracting meta tags from DOM tree and performing classified storage

4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; taking out the body from the HTML source code of the content page, removing all tags in the body, wherein the tags comprise style styles, javaScript scripts and annotation contents, reserving original line feed characters, and extracting the text contents of the webpage after denoising;

5. and (3) outputting a processing result: forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system;

s4, text error correction

Subscribing to a KAFKA message, consuming the data cleansing result, performing the steps of:

1. cutting the webpage text into sentences according to punctuation marks and paragraphs, inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences, inputting the error correction model again, performing recursive error correction, and if the two error correction results are the same, stopping the recursion, and obtaining the correction result of the error correction model;

2. forming sentences containing wrongly written characters and error correction results into sentences JSON, and forming sentence arrays by a plurality of JSONs;

3. correcting the wrongly written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentences JSON;

the first probability is the probability of character errors, the first characteristic is the corresponding character error characteristic, the second probability is the probability of character-pronunciation errors, the second characteristic is the corresponding character-pronunciation error characteristic, the third probability is the probability of character-shape errors, the third probability is the corresponding character-shape error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the fusion error characteristic corresponding to the maximum value;

4. and (3) outputting a processing result: adding the extracted text error correction result to a data cleaning result JSON, and pushing to a KAFKA message system;

s5, storing data

Subscribing KAFKA information, storing the data acquisition result, the data cleaning result and the text error correction result into an Elasticissearch storage system, and storing by taking URL HASH as a main key.

Further, in the process of obtaining the confusion set in S1, 15% of characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to character confusion.

Further, in S2, the acquisition process information includes a URL, an IP, a protocol, an agent, a request mode, a request time, acquisition time, an acquisition status, and a server; the acquisition results include a request header, a response header, and corresponding content.

Further, in S3, the rules for performing classification storage are:

(1) website labeling: siteName, siteDomain, siteIDcode, columName

(2) Column label: columnDescription, columnKeywords, columnType

(3) Content page tag: articleTitle, pubDate, contentSource, keywords, author, description, image, url.

Further, in S3, after denoising, text content extraction is performed on the web page using a text extraction algorithm based on the text density and the symbol density of the web page.

The invention has the beneficial effects that: the invention discloses a text error correction method and a text error correction system for government affair public information.

In order to solve the problem, the invention adds the character voice, the character pattern and the character as features into a pre-training model for training, can improve the error correction accuracy rate of similar character voice and similar character pattern, effectively lightens the workload of supervision and detection personnel, the error correction accuracy rate of the model is about 70%, and the error correction accuracy rate of the error correction model by adding the character voice and the character pattern as features reaches 83%.

Meanwhile, corresponding weights are added to the character pronunciation, the character pattern and the character in the error correction process, particularly, the existing input habit is mostly pinyin input, and the weight of the character pronunciation exceeds half, so that the wrong character can be accurately judged, and the accuracy is improved.

Drawings

FIG. 1 is a schematic diagram of an error correction process according to the present invention.

FIG. 2 is a schematic diagram of fusion error correction of characters, pronunciation and font.

Detailed description of the preferred embodiments

The invention is further illustrated by the following examples in conjunction with the drawings.

Example 1: the embodiment mainly considers word tones, characters and fonts during error correction, and provides a text error correction method applied to the digital government.

When the embodiment is implemented, the method comprises the following steps

S1, model training

(1) First, a corpus is obtained

Integrating encyclopedia, headlines, discordant news and the like serving as resource libraries to obtain a corpus;

(2) Formulating confusion decision rules

The confusion judgment rule comprises a character pattern confusion rule, a character sound confusion rule and a character confusion rule; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and the decomposition result is input into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N; the five-stroke coding is used for splitting the character into a plurality of independent parts, so that the dimensionality can be effectively reduced compared with the stroke coding, and the effect and performance of the model are remarkably improved.

The word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are made according to the pinyin dictionary and exemplified by san Xin Di Yi "s ā n x ī n r y mu;

(1) the pronunciation is the same, the tone is the same, edit distance equal to 0, record as 0000; for example, sanxin (s ā n);

(2) the pronunciation is the same, the tone is different, edit distance equal to 1, note 0100; for example, the powder (s-a n) heart;

(3) flat rolling tongue, front and back nasal sound, edit distance equal to 1, and mark as 001; for example, it is good (sh-ar-n) heart;

(4) changing one of the initial consonant or the final consonant, wherein the edit distance is equal to 1 and is marked as 0; for example Sen (s and n) Xin Di Yi;

(5) the initial consonant and the final sound are changed, the editing distance is larger than 1, and the editing distance is marked as x 1; for example, injury to the heart (sh ā ng);

and selecting editing distances with different lengths to generate a character and sound confusion set.

The editing distance of the character confusion rule is 1, and the confusion rule is that one character is randomly selected from a character library to be replaced.

(3) Obtaining a confusion set

Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a specific proportion, and replacing the corresponding confusion set;

preferably, in implementation, 15% of characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to character-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to character confusion.

(4) Model training

And taking the confusion set as an input set, taking the sample set as a comparison set, and forming one-to-one corresponding sentence pairs, wherein the sentence pairs comprise correct sentences and wrong sentences containing wrong characters after mixing and stirring, and model training is carried out in an end-to-end mode to finally obtain an error correction model.

S2, data acquisition

The method comprises the steps of importing a website URL to be detected into a website data acquisition system, limiting the range in a website domain name, recursively acquiring links in all webpages (removing repeated links according to URL HASH), forming JSON format by acquiring process information (URL, IP, protocol, proxy, request mode, request time, acquisition state, server and the like) and acquisition results (request header, response header and corresponding content) as data acquisition results, and pushing the data acquisition results to a KAFKA message system.

S3, data cleaning

Subscribing to a KAFKA message, consuming a data collection result, performing the steps of:

1. subscribing to KAFKA messages, consuming data collection results, identifying content types, filtering non-HTML types.

2. Webpage preprocessing: and encoding characters of the page source code by using a charset attribute in the collected information to prevent the webpage from generating messy codes, and then serializing the collected result to analyze the webpage into a DOM tree.

3. Extracting the webpage label: extracting meta tag from DOM tree and storing by classification

(1) Website labeling: siteName, siteDomain, siteIDcode, columName

(2) Column label: columnDescription, columnKeywords, columnType

4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; and taking out the body from the HTML source code of the content page, removing all tags in the body, including style styles, javaScript scripts, annotation contents and the like, reserving the original line feed, and extracting the text content of the webpage by using a text extraction algorithm based on the text density and the symbol density of the webpage after noise reduction.

5. And (3) outputting a processing result: and forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system.

S4, text error correction

1. cutting the webpage text into sentences according to punctuations and paragraphs, firstly inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences to input the model again, carrying out recursive error correction, and if the two error correction results are the same, stopping the recursion, and obtaining a model correction result; including the error probabilities (i.e., the first, second, third probabilities) and the corrected characters (i.e., the first, second, third characteristics) of the character, the pronunciation, and the font.

3. correcting the wrongly written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentences JSON.

The first probability is the probability of font error, the first characteristic is the corresponding font error characteristic, the second probability is the probability of character error, the second characteristic is the corresponding character error characteristic, the third probability is the probability of character error, the third probability is the corresponding character error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the corresponding fusion error characteristic.

4. And (3) outputting a processing result: and adding the text extraction error correction result to the data cleaning result JSON and pushing the result to the KAFKA message system.

S5, data storage

Claims

1. A text error correction method applied to digital governments is characterized in that: the method comprises the following steps:

s1, model training

(1) First, a corpus is obtained

(2) Formulating confusion decision rules

The confusion judgment rule comprises a character pattern confusion rule, a character sound confusion rule and a character confusion rule; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and inputs the decomposition result into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N;

(3) Obtaining a confusion set

(4) Model training

s2, data acquisition

s3, data cleaning

1. subscribing KAFKA information, consuming data acquisition results, identifying content types, and filtering non-HTML types;

s4, text error correction

Subscribing to the KAFKA message, consuming the data cleansing result, performing the steps of:

s5, storing data

2. The text error correction method applied to the digital government according to claim 1, wherein: in the process of obtaining the confusion set in S1, 15% of characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to character confusion.

3. The text error correction method applied to the digital government according to claim 1, wherein: in S2, the acquisition process information comprises URL, IP, protocol, proxy, request mode, request time, acquisition time consumption, acquisition state and server; the acquisition results include a request header, a response header, and corresponding content.

4. The text error correction method applied to the digital government according to claim 1, wherein: in S3, the rules for classified storage are:

(1) website labeling: siteName, siteDomain, siteIDcode, columName

(2) Column label: columnDescription, columnKeywords, columnType

(3) Content page tag: articletile, pubDate, contentSource, keywords, author, description, image, url.

5. The text error correction method applied to the digital government according to claim 1, wherein: and in S3, text content extraction is carried out on the webpage by using a text extraction algorithm based on webpage text density and symbol density after noise reduction.