CN114970502B - Text error correction method applied to digital government - Google Patents
Text error correction method applied to digital government Download PDFInfo
- Publication number
- CN114970502B CN114970502B CN202111633076.4A CN202111633076A CN114970502B CN 114970502 B CN114970502 B CN 114970502B CN 202111633076 A CN202111633076 A CN 202111633076A CN 114970502 B CN114970502 B CN 114970502B
- Authority
- CN
- China
- Prior art keywords
- error correction
- character
- confusion
- result
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012937 correction Methods 0.000 title claims abstract description 62
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 15
- 238000004140 cleaning Methods 0.000 claims abstract description 13
- 230000004927 fusion Effects 0.000 claims description 16
- 238000000605 extraction Methods 0.000 claims description 15
- 238000003860 storage Methods 0.000 claims description 7
- 238000003491 array Methods 0.000 claims description 6
- 238000000354 decomposition reaction Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 5
- 238000013480 data collection Methods 0.000 claims description 4
- 238000005520 cutting process Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 238000005096 rolling process Methods 0.000 claims description 3
- 238000013515 script Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 8
- 238000013500 data storage Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 208000027418 Wounds and injury Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 208000014674 injury Diseases 0.000 description 1
- 208000011977 language disease Diseases 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 239000000843 powder Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000003756 stirring Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/146—Coding or compression of tree-structured data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Abstract
The invention belongs to the technical field of computers, and particularly relates to a text error correction method applied to a digital government, which comprises a method and a flow of model training, data acquisition, data cleaning, text error correction and data storage, wherein character voice, font and characters are used as characteristics and added into a pre-training model for training, so that the error correction accuracy rate with similar character voice and similar font can be improved, the workload of supervision and detection personnel is effectively reduced, the model error correction accuracy rate is about 70%, and the error correction accuracy rate reaches 83% by adding character voice and font as characteristic training.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a text error correction method applied to a digital government.
Background
Website information content is typically detected by manual inspection, system monitoring, public media feedback, and the like. Due to the fact that government affair disclosure is multi-faceted and wide in range, huge in information amount and high in timeliness requirement, the requirement cannot be met only through manual inspection. Therefore, systematic detection is the most important detection method, wherein the accuracy of the detection result is particularly important, and the workload of detection personnel is increased if wrong detection or missed detection is carried out.
The frequently used method for monitoring wrongly-written characters comprises a wrongly-written character dictionary, an editing distance and a language model, and the manual cost for constructing the dictionary is higher based on the error correction algorithm of the wrongly-written character dictionary, so that the method is suitable for the partial vertical field with limited wrongly-written characters; the error correction algorithm based on edit distance matching adopts a method similar to character string fuzzy matching, and can correct part of common wrongly written or mispronounced characters and language diseases by contrasting correct samples, but the universality is insufficient, so that the research on a text error correction method and a text error correction system applied to a digital government is necessary.
Disclosure of Invention
Aiming at the defects and problems of the existing equipment, the invention provides a text error correction method applied to the digital government, and the problems of high labor cost and poor universality of the existing error correction model are effectively solved.
The technical scheme adopted by the invention for solving the technical problems is as follows: a text error correction method applied to digital governments comprises the following steps:
s1, model training
(1) Obtaining a corpus
Integrating encyclopedia, headlines and Zhihe news as a resource library to obtain a corpus;
(2) Formulating confusion decision rules
The confusion judgment rule comprises a character pattern confusion rule, a character and sound confusion rule and a character confusion rule; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and the decomposition result is input into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N;
the word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are formulated according to the pinyin dictionary;
(1) the pronunciation is the same, the tone is the same, the edit distance is equal to 0;
(2) the pronunciation is the same, the tone is different, the edit distance is equal to 1;
(3) flatly rolling the tongue, and making the front and back nasal sounds, wherein the editing distance is equal to 1;
(4) changing one of the initial consonant or the final sound, wherein the editing distance is equal to 1;
(5) the initial consonant and the final are changed, and the editing distance is larger than 1;
selecting editing distances with different lengths to generate a character and sound confusion set;
the editing distance of the character confusion rule is 1, and the confusion rule is that a character is randomly selected from a character library to be replaced;
(3) Obtaining a confusion set
Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a proportion, and replacing the corresponding confusion set;
(4) Model training
Taking the confusion set as an input set, taking the sample set as a comparison set, forming one-to-one corresponding sentence pairs, and performing model training in an end-to-end mode to finally obtain an error correction model;
s2, data acquisition
Receiving a website domain name, recursively acquiring links in all webpages, removing repeated links according to URL HASH, forming acquisition process information and acquisition results into a JSON format, and pushing the JSON format as a data acquisition result to a KAFKA message system;
s3, data cleaning
Subscribing to a KAFKA message, consuming the data collection results, and performing the steps of:
1. subscribing KAFKA information, consuming data acquisition results, identifying content types and filtering non-HTML types;
2. webpage preprocessing: encoding characters of a page source code by using a charset attribute in the collected information, then serializing a collected result, and analyzing a webpage into a DOM tree;
3. extracting the webpage tags: extracting meta tags from DOM tree and performing classified storage
4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; taking out the body from the HTML source code of the content page, removing all tags in the body, wherein the tags comprise style styles, javaScript scripts and annotation contents, reserving original line feed characters, and extracting the text contents of the webpage after denoising;
5. and (3) outputting a processing result: forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system;
s4, text error correction
Subscribing to a KAFKA message, consuming the data cleansing result, performing the steps of:
1. cutting the webpage text into sentences according to punctuation marks and paragraphs, inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences, inputting the error correction model again, performing recursive error correction, and if the two error correction results are the same, stopping the recursion, and obtaining the correction result of the error correction model;
2. forming sentences containing wrongly written characters and error correction results into sentences JSON, and forming sentence arrays by a plurality of JSONs;
3. correcting the wrongly written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentences JSON;
the first probability is the probability of character errors, the first characteristic is the corresponding character error characteristic, the second probability is the probability of character-pronunciation errors, the second characteristic is the corresponding character-pronunciation error characteristic, the third probability is the probability of character-shape errors, the third probability is the corresponding character-shape error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the fusion error characteristic corresponding to the maximum value;
4. and (3) outputting a processing result: adding the extracted text error correction result to a data cleaning result JSON, and pushing to a KAFKA message system;
s5, storing data
Subscribing KAFKA information, storing the data acquisition result, the data cleaning result and the text error correction result into an Elasticissearch storage system, and storing by taking URL HASH as a main key.
Further, in the process of obtaining the confusion set in S1, 15% of characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to character confusion.
Further, in S2, the acquisition process information includes a URL, an IP, a protocol, an agent, a request mode, a request time, acquisition time, an acquisition status, and a server; the acquisition results include a request header, a response header, and corresponding content.
Further, in S3, the rules for performing classification storage are:
(1) website labeling: siteName, siteDomain, siteIDcode, columName
(2) Column label: columnDescription, columnKeywords, columnType
(3) Content page tag: articleTitle, pubDate, contentSource, keywords, author, description, image, url.
Further, in S3, after denoising, text content extraction is performed on the web page using a text extraction algorithm based on the text density and the symbol density of the web page.
The invention has the beneficial effects that: the invention discloses a text error correction method and a text error correction system for government affair public information.
In order to solve the problem, the invention adds the character voice, the character pattern and the character as features into a pre-training model for training, can improve the error correction accuracy rate of similar character voice and similar character pattern, effectively lightens the workload of supervision and detection personnel, the error correction accuracy rate of the model is about 70%, and the error correction accuracy rate of the error correction model by adding the character voice and the character pattern as features reaches 83%.
Meanwhile, corresponding weights are added to the character pronunciation, the character pattern and the character in the error correction process, particularly, the existing input habit is mostly pinyin input, and the weight of the character pronunciation exceeds half, so that the wrong character can be accurately judged, and the accuracy is improved.
Drawings
FIG. 1 is a schematic diagram of an error correction process according to the present invention.
FIG. 2 is a schematic diagram of fusion error correction of characters, pronunciation and font.
Detailed description of the preferred embodiments
The invention is further illustrated by the following examples in conjunction with the drawings.
Example 1: the embodiment mainly considers word tones, characters and fonts during error correction, and provides a text error correction method applied to the digital government.
When the embodiment is implemented, the method comprises the following steps
S1, model training
(1) First, a corpus is obtained
Integrating encyclopedia, headlines, discordant news and the like serving as resource libraries to obtain a corpus;
(2) Formulating confusion decision rules
The confusion judgment rule comprises a character pattern confusion rule, a character sound confusion rule and a character confusion rule; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and the decomposition result is input into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N; the five-stroke coding is used for splitting the character into a plurality of independent parts, so that the dimensionality can be effectively reduced compared with the stroke coding, and the effect and performance of the model are remarkably improved.
The word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are made according to the pinyin dictionary and exemplified by san Xin Di Yi "s ā n x ī n r y mu;
(1) the pronunciation is the same, the tone is the same, edit distance equal to 0, record as 0000; for example, sanxin (s ā n);
(2) the pronunciation is the same, the tone is different, edit distance equal to 1, note 0100; for example, the powder (s-a n) heart;
(3) flat rolling tongue, front and back nasal sound, edit distance equal to 1, and mark as 001; for example, it is good (sh-ar-n) heart;
(4) changing one of the initial consonant or the final consonant, wherein the edit distance is equal to 1 and is marked as 0; for example Sen (s and n) Xin Di Yi;
(5) the initial consonant and the final sound are changed, the editing distance is larger than 1, and the editing distance is marked as x 1; for example, injury to the heart (sh ā ng);
and selecting editing distances with different lengths to generate a character and sound confusion set.
The editing distance of the character confusion rule is 1, and the confusion rule is that one character is randomly selected from a character library to be replaced.
(3) Obtaining a confusion set
Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a specific proportion, and replacing the corresponding confusion set;
preferably, in implementation, 15% of characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to character-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to character confusion.
(4) Model training
And taking the confusion set as an input set, taking the sample set as a comparison set, and forming one-to-one corresponding sentence pairs, wherein the sentence pairs comprise correct sentences and wrong sentences containing wrong characters after mixing and stirring, and model training is carried out in an end-to-end mode to finally obtain an error correction model.
S2, data acquisition
The method comprises the steps of importing a website URL to be detected into a website data acquisition system, limiting the range in a website domain name, recursively acquiring links in all webpages (removing repeated links according to URL HASH), forming JSON format by acquiring process information (URL, IP, protocol, proxy, request mode, request time, acquisition state, server and the like) and acquisition results (request header, response header and corresponding content) as data acquisition results, and pushing the data acquisition results to a KAFKA message system.
S3, data cleaning
Subscribing to a KAFKA message, consuming a data collection result, performing the steps of:
1. subscribing to KAFKA messages, consuming data collection results, identifying content types, filtering non-HTML types.
2. Webpage preprocessing: and encoding characters of the page source code by using a charset attribute in the collected information to prevent the webpage from generating messy codes, and then serializing the collected result to analyze the webpage into a DOM tree.
3. Extracting the webpage label: extracting meta tag from DOM tree and storing by classification
(1) Website labeling: siteName, siteDomain, siteIDcode, columName
(2) Column label: columnDescription, columnKeywords, columnType
(3) Content page tag: articleTitle, pubDate, contentSource, keywords, author, description, image, url.
4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; and taking out the body from the HTML source code of the content page, removing all tags in the body, including style styles, javaScript scripts, annotation contents and the like, reserving the original line feed, and extracting the text content of the webpage by using a text extraction algorithm based on the text density and the symbol density of the webpage after noise reduction.
5. And (3) outputting a processing result: and forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system.
S4, text error correction
Subscribing to a KAFKA message, consuming the data cleansing result, performing the steps of:
1. cutting the webpage text into sentences according to punctuations and paragraphs, firstly inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences to input the model again, carrying out recursive error correction, and if the two error correction results are the same, stopping the recursion, and obtaining a model correction result; including the error probabilities (i.e., the first, second, third probabilities) and the corrected characters (i.e., the first, second, third characteristics) of the character, the pronunciation, and the font.
2. Forming sentences containing wrongly written characters and error correction results into sentences JSON, and forming sentence arrays by a plurality of JSONs;
3. correcting the wrongly written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentences JSON.
The first probability is the probability of font error, the first characteristic is the corresponding font error characteristic, the second probability is the probability of character error, the second characteristic is the corresponding character error characteristic, the third probability is the probability of character error, the third probability is the corresponding character error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the corresponding fusion error characteristic.
4. And (3) outputting a processing result: and adding the text extraction error correction result to the data cleaning result JSON and pushing the result to the KAFKA message system.
S5, data storage
Subscribing KAFKA information, storing the data acquisition result, the data cleaning result and the text error correction result into an Elasticissearch storage system, and storing by taking URL HASH as a main key.
Claims (5)
1. A text error correction method applied to digital governments is characterized in that: the method comprises the following steps:
s1, model training
(1) First, a corpus is obtained
Integrating encyclopedia, headlines and Zhihe news as a resource library to obtain a corpus;
(2) Formulating confusion decision rules
The confusion judgment rule comprises a character pattern confusion rule, a character sound confusion rule and a character confusion rule; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and inputs the decomposition result into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N;
the word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are formulated according to the pinyin dictionary;
(1) the pronunciation is the same, the tone is the same, the edit distance is equal to 0;
(2) the pronunciation is the same, the tone is different, the edit distance is equal to 1;
(3) flatly rolling the tongue, and making the front and back nasal sounds, wherein the editing distance is equal to 1;
(4) changing one of the initial consonant or the final sound, wherein the editing distance is equal to 1;
(5) the initial consonant and the final are changed, and the editing distance is larger than 1;
selecting editing distances with different lengths to generate a character and sound confusion set;
the editing distance of the character confusion rule is 1, and the confusion rule is that a character is randomly selected from a character library to be replaced;
(3) Obtaining a confusion set
Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a specific proportion, and replacing the corresponding confusion set;
(4) Model training
Taking the confusion set as an input set, taking the sample set as a comparison set, forming one-to-one corresponding sentence pairs, and performing model training in an end-to-end mode to finally obtain an error correction model;
s2, data acquisition
Receiving a website domain name, recursively acquiring links in all webpages, removing repeated links according to URL HASH, forming acquisition process information and acquisition results into a JSON format, and pushing the JSON format as a data acquisition result to a KAFKA message system;
s3, data cleaning
Subscribing to a KAFKA message, consuming the data collection results, and performing the steps of:
1. subscribing KAFKA information, consuming data acquisition results, identifying content types, and filtering non-HTML types;
2. webpage preprocessing: encoding characters of a page source code by using a charset attribute in the collected information, then serializing a collected result, and analyzing a webpage into a DOM tree;
3. extracting the webpage tags: extracting meta tags from DOM tree and performing classified storage
4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; taking out the body from the HTML source code of the content page, removing all tags in the body, wherein the tags comprise style styles, javaScript scripts and annotation contents, reserving original line feed characters, and extracting the text contents of the webpage after denoising;
5. and (3) outputting a processing result: forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system;
s4, text error correction
Subscribing to the KAFKA message, consuming the data cleansing result, performing the steps of:
1. cutting the webpage text into sentences according to punctuation marks and paragraphs, inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences, inputting the error correction model again, performing recursive error correction, and if the two error correction results are the same, stopping the recursion, and obtaining the correction result of the error correction model;
2. forming sentences containing wrongly written characters and error correction results into sentences JSON, and forming sentence arrays by a plurality of JSONs;
3. correcting the wrongly written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentences JSON;
the first probability is the probability of character errors, the first characteristic is the corresponding character error characteristic, the second probability is the probability of character-pronunciation errors, the second characteristic is the corresponding character-pronunciation error characteristic, the third probability is the probability of character-shape errors, the third probability is the corresponding character-shape error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the fusion error characteristic corresponding to the maximum value;
4. and (3) outputting a processing result: adding the extracted text error correction result to a data cleaning result JSON, and pushing to a KAFKA message system;
s5, storing data
Subscribing KAFKA information, storing the data acquisition result, the data cleaning result and the text error correction result into an Elasticissearch storage system, and storing by taking URL HASH as a main key.
2. The text error correction method applied to the digital government according to claim 1, wherein: in the process of obtaining the confusion set in S1, 15% of characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to character confusion.
3. The text error correction method applied to the digital government according to claim 1, wherein: in S2, the acquisition process information comprises URL, IP, protocol, proxy, request mode, request time, acquisition time consumption, acquisition state and server; the acquisition results include a request header, a response header, and corresponding content.
4. The text error correction method applied to the digital government according to claim 1, wherein: in S3, the rules for classified storage are:
(1) website labeling: siteName, siteDomain, siteIDcode, columName
(2) Column label: columnDescription, columnKeywords, columnType
(3) Content page tag: articletile, pubDate, contentSource, keywords, author, description, image, url.
5. The text error correction method applied to the digital government according to claim 1, wherein: and in S3, text content extraction is carried out on the webpage by using a text extraction algorithm based on webpage text density and symbol density after noise reduction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111633076.4A CN114970502B (en) | 2021-12-29 | 2021-12-29 | Text error correction method applied to digital government |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111633076.4A CN114970502B (en) | 2021-12-29 | 2021-12-29 | Text error correction method applied to digital government |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114970502A CN114970502A (en) | 2022-08-30 |
CN114970502B true CN114970502B (en) | 2023-03-28 |
Family
ID=82974441
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111633076.4A Active CN114970502B (en) | 2021-12-29 | 2021-12-29 | Text error correction method applied to digital government |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114970502B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115438650B (en) * | 2022-11-08 | 2023-04-07 | 深圳擎盾信息科技有限公司 | Contract text error correction method, system, equipment and medium fusing multi-source characteristics |
CN117236319B (en) * | 2023-09-25 | 2024-04-19 | 中国—东盟信息港股份有限公司 | Real scene Chinese text error correction method based on transducer generation model |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11328317A (en) * | 1998-05-11 | 1999-11-30 | Nippon Telegr & Teleph Corp <Ntt> | Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded |
CN1687877A (en) * | 2005-04-14 | 2005-10-26 | 刘伊翰 | Chinese character input method capable of using English |
CN104916169A (en) * | 2015-05-20 | 2015-09-16 | 江苏理工学院 | Card type German learning tool |
CN110489760A (en) * | 2019-09-17 | 2019-11-22 | 达而观信息科技(上海)有限公司 | Based on deep neural network text auto-collation and device |
CN110765740A (en) * | 2019-10-11 | 2020-02-07 | 深圳市比一比网络科技有限公司 | DOM tree-based full-type text replacement method, system, device and storage medium |
CN112016310A (en) * | 2020-09-03 | 2020-12-01 | 平安科技(深圳)有限公司 | Text error correction method, system, device and readable storage medium |
CN112199945A (en) * | 2020-08-19 | 2021-01-08 | 宿迁硅基智能科技有限公司 | Text error correction method and device |
CN112287670A (en) * | 2020-11-18 | 2021-01-29 | 北京明略软件系统有限公司 | Text error correction method, system, computer device and readable storage medium |
CN113361266A (en) * | 2021-06-25 | 2021-09-07 | 达闼机器人有限公司 | Text error correction method, electronic device and storage medium |
CN113642316A (en) * | 2021-07-28 | 2021-11-12 | 平安国际智慧城市科技股份有限公司 | Chinese text error correction method and device, electronic equipment and storage medium |
-
2021
- 2021-12-29 CN CN202111633076.4A patent/CN114970502B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11328317A (en) * | 1998-05-11 | 1999-11-30 | Nippon Telegr & Teleph Corp <Ntt> | Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded |
CN1687877A (en) * | 2005-04-14 | 2005-10-26 | 刘伊翰 | Chinese character input method capable of using English |
CN104916169A (en) * | 2015-05-20 | 2015-09-16 | 江苏理工学院 | Card type German learning tool |
CN110489760A (en) * | 2019-09-17 | 2019-11-22 | 达而观信息科技(上海)有限公司 | Based on deep neural network text auto-collation and device |
CN110765740A (en) * | 2019-10-11 | 2020-02-07 | 深圳市比一比网络科技有限公司 | DOM tree-based full-type text replacement method, system, device and storage medium |
CN112199945A (en) * | 2020-08-19 | 2021-01-08 | 宿迁硅基智能科技有限公司 | Text error correction method and device |
CN112016310A (en) * | 2020-09-03 | 2020-12-01 | 平安科技(深圳)有限公司 | Text error correction method, system, device and readable storage medium |
WO2021189851A1 (en) * | 2020-09-03 | 2021-09-30 | 平安科技(深圳)有限公司 | Text error correction method, system and device, and readable storage medium |
CN112287670A (en) * | 2020-11-18 | 2021-01-29 | 北京明略软件系统有限公司 | Text error correction method, system, computer device and readable storage medium |
CN113361266A (en) * | 2021-06-25 | 2021-09-07 | 达闼机器人有限公司 | Text error correction method, electronic device and storage medium |
CN113642316A (en) * | 2021-07-28 | 2021-11-12 | 平安国际智慧城市科技股份有限公司 | Chinese text error correction method and device, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
Junjie Yu等.Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape.《Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing》.2014,220-223. * |
李建义等.关于中文拼写纠错数据增强的方法.《北华航天工业学院学报》.2021,第31卷(第31期),1-5. * |
Also Published As
Publication number | Publication date |
---|---|
CN114970502A (en) | 2022-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109145260B (en) | Automatic text information extraction method | |
CN114970502B (en) | Text error correction method applied to digital government | |
CN110609983B (en) | Structured decomposition method for policy file | |
US20150100304A1 (en) | Incremental computation of repeats | |
CN111783394A (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN110413787B (en) | Text clustering method, device, terminal and storage medium | |
CN102253937A (en) | Method and related device for acquiring information of interest in webpages | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN113033185B (en) | Standard text error correction method and device, electronic equipment and storage medium | |
CN113268576B (en) | Deep learning-based department semantic information extraction method and device | |
CN113051356A (en) | Open relationship extraction method and device, electronic equipment and storage medium | |
CN111967267A (en) | XLNET-based news text region extraction method and system | |
CN112527981A (en) | Open type information extraction method and device, electronic equipment and storage medium | |
CN113255331B (en) | Text error correction method, device and storage medium | |
Sagcan et al. | Toponym recognition in social media for estimating the location of events | |
CN117034948B (en) | Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion | |
CN112818693A (en) | Automatic extraction method and system for electronic component model words | |
CN109472020A (en) | A kind of feature alignment Chinese word cutting method | |
CN111274354B (en) | Referee document structuring method and referee document structuring device | |
CN107451215B (en) | Feature text extraction method and device | |
CN114842982B (en) | Knowledge expression method, device and system for medical information system | |
CN110941703A (en) | Integrated resume information extraction method based on machine learning and fuzzy rules | |
CN107145947B (en) | Information processing method and device and electronic equipment | |
CN112966501B (en) | New word discovery method, system, terminal and medium | |
CN110941713A (en) | Self-optimization financial information plate classification method based on topic model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
OL01 | Intention to license declared |