CN114970502B - Text error correction method applied to digital government - Google Patents

Text error correction method applied to digital government Download PDF

Info

Publication number
CN114970502B
CN114970502B CN202111633076.4A CN202111633076A CN114970502B CN 114970502 B CN114970502 B CN 114970502B CN 202111633076 A CN202111633076 A CN 202111633076A CN 114970502 B CN114970502 B CN 114970502B
Authority
CN
China
Prior art keywords
error correction
character
confusion
result
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111633076.4A
Other languages
Chinese (zh)
Other versions
CN114970502A (en
Inventor
吴琼
常诚
王元卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Science And Technology Big Data Research Institute
Original Assignee
China Science And Technology Big Data Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Science And Technology Big Data Research Institute filed Critical China Science And Technology Big Data Research Institute
Priority to CN202111633076.4A priority Critical patent/CN114970502B/en
Publication of CN114970502A publication Critical patent/CN114970502A/en
Application granted granted Critical
Publication of CN114970502B publication Critical patent/CN114970502B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/146Coding or compression of tree-structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Abstract

The invention belongs to the technical field of computers, and particularly relates to a text error correction method applied to a digital government, which comprises a method and a flow of model training, data acquisition, data cleaning, text error correction and data storage, wherein character voice, font and characters are used as characteristics and added into a pre-training model for training, so that the error correction accuracy rate with similar character voice and similar font can be improved, the workload of supervision and detection personnel is effectively reduced, the model error correction accuracy rate is about 70%, and the error correction accuracy rate reaches 83% by adding character voice and font as characteristic training.

Description

Text error correction method applied to digital government
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a text error correction method applied to a digital government.
Background
Website information content is typically detected by manual inspection, system monitoring, public media feedback, and the like. Due to the fact that government affair disclosure is multi-faceted and wide in range, huge in information amount and high in timeliness requirement, the requirement cannot be met only through manual inspection. Therefore, systematic detection is the most important detection method, wherein the accuracy of the detection result is particularly important, and the workload of detection personnel is increased if wrong detection or missed detection is carried out.
The frequently used method for monitoring wrongly-written characters comprises a wrongly-written character dictionary, an editing distance and a language model, and the manual cost for constructing the dictionary is higher based on the error correction algorithm of the wrongly-written character dictionary, so that the method is suitable for the partial vertical field with limited wrongly-written characters; the error correction algorithm based on edit distance matching adopts a method similar to character string fuzzy matching, and can correct part of common wrongly written or mispronounced characters and language diseases by contrasting correct samples, but the universality is insufficient, so that the research on a text error correction method and a text error correction system applied to a digital government is necessary.
Disclosure of Invention
Aiming at the defects and problems of the existing equipment, the invention provides a text error correction method applied to the digital government, and the problems of high labor cost and poor universality of the existing error correction model are effectively solved.
The technical scheme adopted by the invention for solving the technical problems is as follows: a text error correction method applied to digital governments comprises the following steps:
s1, model training
(1) Obtaining a corpus
Integrating encyclopedia, headlines and Zhihe news as a resource library to obtain a corpus;
(2) Formulating confusion decision rules
The confusion judgment rule comprises a character pattern confusion rule, a character and sound confusion rule and a character confusion rule; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and the decomposition result is input into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N;
the word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are formulated according to the pinyin dictionary;
(1) the pronunciation is the same, the tone is the same, the edit distance is equal to 0;
(2) the pronunciation is the same, the tone is different, the edit distance is equal to 1;
(3) flatly rolling the tongue, and making the front and back nasal sounds, wherein the editing distance is equal to 1;
(4) changing one of the initial consonant or the final sound, wherein the editing distance is equal to 1;
(5) the initial consonant and the final are changed, and the editing distance is larger than 1;
selecting editing distances with different lengths to generate a character and sound confusion set;
the editing distance of the character confusion rule is 1, and the confusion rule is that a character is randomly selected from a character library to be replaced;
(3) Obtaining a confusion set
Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a proportion, and replacing the corresponding confusion set;
(4) Model training
Taking the confusion set as an input set, taking the sample set as a comparison set, forming one-to-one corresponding sentence pairs, and performing model training in an end-to-end mode to finally obtain an error correction model;
s2, data acquisition
Receiving a website domain name, recursively acquiring links in all webpages, removing repeated links according to URL HASH, forming acquisition process information and acquisition results into a JSON format, and pushing the JSON format as a data acquisition result to a KAFKA message system;
s3, data cleaning
Subscribing to a KAFKA message, consuming the data collection results, and performing the steps of:
1. subscribing KAFKA information, consuming data acquisition results, identifying content types and filtering non-HTML types;
2. webpage preprocessing: encoding characters of a page source code by using a charset attribute in the collected information, then serializing a collected result, and analyzing a webpage into a DOM tree;
3. extracting the webpage tags: extracting meta tags from DOM tree and performing classified storage
4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; taking out the body from the HTML source code of the content page, removing all tags in the body, wherein the tags comprise style styles, javaScript scripts and annotation contents, reserving original line feed characters, and extracting the text contents of the webpage after denoising;
5. and (3) outputting a processing result: forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system;
s4, text error correction
Subscribing to a KAFKA message, consuming the data cleansing result, performing the steps of:
1. cutting the webpage text into sentences according to punctuation marks and paragraphs, inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences, inputting the error correction model again, performing recursive error correction, and if the two error correction results are the same, stopping the recursion, and obtaining the correction result of the error correction model;
2. forming sentences containing wrongly written characters and error correction results into sentences JSON, and forming sentence arrays by a plurality of JSONs;
3. correcting the wrongly written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentences JSON;
the first probability is the probability of character errors, the first characteristic is the corresponding character error characteristic, the second probability is the probability of character-pronunciation errors, the second characteristic is the corresponding character-pronunciation error characteristic, the third probability is the probability of character-shape errors, the third probability is the corresponding character-shape error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the fusion error characteristic corresponding to the maximum value;
4. and (3) outputting a processing result: adding the extracted text error correction result to a data cleaning result JSON, and pushing to a KAFKA message system;
s5, storing data
Subscribing KAFKA information, storing the data acquisition result, the data cleaning result and the text error correction result into an Elasticissearch storage system, and storing by taking URL HASH as a main key.
Further, in the process of obtaining the confusion set in S1, 15% of characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to character confusion.
Further, in S2, the acquisition process information includes a URL, an IP, a protocol, an agent, a request mode, a request time, acquisition time, an acquisition status, and a server; the acquisition results include a request header, a response header, and corresponding content.
Further, in S3, the rules for performing classification storage are:
(1) website labeling: siteName, siteDomain, siteIDcode, columName
(2) Column label: columnDescription, columnKeywords, columnType
(3) Content page tag: articleTitle, pubDate, contentSource, keywords, author, description, image, url.
Further, in S3, after denoising, text content extraction is performed on the web page using a text extraction algorithm based on the text density and the symbol density of the web page.
The invention has the beneficial effects that: the invention discloses a text error correction method and a text error correction system for government affair public information.
In order to solve the problem, the invention adds the character voice, the character pattern and the character as features into a pre-training model for training, can improve the error correction accuracy rate of similar character voice and similar character pattern, effectively lightens the workload of supervision and detection personnel, the error correction accuracy rate of the model is about 70%, and the error correction accuracy rate of the error correction model by adding the character voice and the character pattern as features reaches 83%.
Meanwhile, corresponding weights are added to the character pronunciation, the character pattern and the character in the error correction process, particularly, the existing input habit is mostly pinyin input, and the weight of the character pronunciation exceeds half, so that the wrong character can be accurately judged, and the accuracy is improved.
Drawings
FIG. 1 is a schematic diagram of an error correction process according to the present invention.
FIG. 2 is a schematic diagram of fusion error correction of characters, pronunciation and font.
Detailed description of the preferred embodiments
The invention is further illustrated by the following examples in conjunction with the drawings.
Example 1: the embodiment mainly considers word tones, characters and fonts during error correction, and provides a text error correction method applied to the digital government.
When the embodiment is implemented, the method comprises the following steps
S1, model training
(1) First, a corpus is obtained
Integrating encyclopedia, headlines, discordant news and the like serving as resource libraries to obtain a corpus;
(2) Formulating confusion decision rules
The confusion judgment rule comprises a character pattern confusion rule, a character sound confusion rule and a character confusion rule; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and the decomposition result is input into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N; the five-stroke coding is used for splitting the character into a plurality of independent parts, so that the dimensionality can be effectively reduced compared with the stroke coding, and the effect and performance of the model are remarkably improved.
The word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are made according to the pinyin dictionary and exemplified by san Xin Di Yi "s ā n x ī n r y mu;
(1) the pronunciation is the same, the tone is the same, edit distance equal to 0, record as 0000; for example, sanxin (s ā n);
(2) the pronunciation is the same, the tone is different, edit distance equal to 1, note 0100; for example, the powder (s-a n) heart;
(3) flat rolling tongue, front and back nasal sound, edit distance equal to 1, and mark as 001; for example, it is good (sh-ar-n) heart;
(4) changing one of the initial consonant or the final consonant, wherein the edit distance is equal to 1 and is marked as 0; for example Sen (s and n) Xin Di Yi;
(5) the initial consonant and the final sound are changed, the editing distance is larger than 1, and the editing distance is marked as x 1; for example, injury to the heart (sh ā ng);
and selecting editing distances with different lengths to generate a character and sound confusion set.
The editing distance of the character confusion rule is 1, and the confusion rule is that one character is randomly selected from a character library to be replaced.
(3) Obtaining a confusion set
Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a specific proportion, and replacing the corresponding confusion set;
preferably, in implementation, 15% of characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to character-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to character confusion.
(4) Model training
And taking the confusion set as an input set, taking the sample set as a comparison set, and forming one-to-one corresponding sentence pairs, wherein the sentence pairs comprise correct sentences and wrong sentences containing wrong characters after mixing and stirring, and model training is carried out in an end-to-end mode to finally obtain an error correction model.
S2, data acquisition
The method comprises the steps of importing a website URL to be detected into a website data acquisition system, limiting the range in a website domain name, recursively acquiring links in all webpages (removing repeated links according to URL HASH), forming JSON format by acquiring process information (URL, IP, protocol, proxy, request mode, request time, acquisition state, server and the like) and acquisition results (request header, response header and corresponding content) as data acquisition results, and pushing the data acquisition results to a KAFKA message system.
S3, data cleaning
Subscribing to a KAFKA message, consuming a data collection result, performing the steps of:
1. subscribing to KAFKA messages, consuming data collection results, identifying content types, filtering non-HTML types.
2. Webpage preprocessing: and encoding characters of the page source code by using a charset attribute in the collected information to prevent the webpage from generating messy codes, and then serializing the collected result to analyze the webpage into a DOM tree.
3. Extracting the webpage label: extracting meta tag from DOM tree and storing by classification
(1) Website labeling: siteName, siteDomain, siteIDcode, columName
(2) Column label: columnDescription, columnKeywords, columnType
(3) Content page tag: articleTitle, pubDate, contentSource, keywords, author, description, image, url.
4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; and taking out the body from the HTML source code of the content page, removing all tags in the body, including style styles, javaScript scripts, annotation contents and the like, reserving the original line feed, and extracting the text content of the webpage by using a text extraction algorithm based on the text density and the symbol density of the webpage after noise reduction.
5. And (3) outputting a processing result: and forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system.
S4, text error correction
Subscribing to a KAFKA message, consuming the data cleansing result, performing the steps of:
1. cutting the webpage text into sentences according to punctuations and paragraphs, firstly inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences to input the model again, carrying out recursive error correction, and if the two error correction results are the same, stopping the recursion, and obtaining a model correction result; including the error probabilities (i.e., the first, second, third probabilities) and the corrected characters (i.e., the first, second, third characteristics) of the character, the pronunciation, and the font.
2. Forming sentences containing wrongly written characters and error correction results into sentences JSON, and forming sentence arrays by a plurality of JSONs;
3. correcting the wrongly written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentences JSON.
The first probability is the probability of font error, the first characteristic is the corresponding font error characteristic, the second probability is the probability of character error, the second characteristic is the corresponding character error characteristic, the third probability is the probability of character error, the third probability is the corresponding character error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the corresponding fusion error characteristic.
4. And (3) outputting a processing result: and adding the text extraction error correction result to the data cleaning result JSON and pushing the result to the KAFKA message system.
S5, data storage
Subscribing KAFKA information, storing the data acquisition result, the data cleaning result and the text error correction result into an Elasticissearch storage system, and storing by taking URL HASH as a main key.

Claims (5)

1. A text error correction method applied to digital governments is characterized in that: the method comprises the following steps:
s1, model training
(1) First, a corpus is obtained
Integrating encyclopedia, headlines and Zhihe news as a resource library to obtain a corpus;
(2) Formulating confusion decision rules
The confusion judgment rule comprises a character pattern confusion rule, a character sound confusion rule and a character confusion rule; the font confusion rule adopts five-stroke reverse coding to carry out reverse five-stroke decomposition on the font, and inputs the decomposition result into the font vector; randomly replacing one or more etymons according to the five error-prone libraries to form a corresponding font confusion set, wherein the coding distance for replacing one etymon is marked as 1, and the coding distance for correspondingly replacing N etymons is marked as N;
the word-sound confusion rule obtains the pinyin of the word through a pinyin dictionary, further obtains the initial consonant, the final sound and the tone of the pinyin of the word, and inputs the result into a word-sound vector; the following rules are formulated according to the pinyin dictionary;
(1) the pronunciation is the same, the tone is the same, the edit distance is equal to 0;
(2) the pronunciation is the same, the tone is different, the edit distance is equal to 1;
(3) flatly rolling the tongue, and making the front and back nasal sounds, wherein the editing distance is equal to 1;
(4) changing one of the initial consonant or the final sound, wherein the editing distance is equal to 1;
(5) the initial consonant and the final are changed, and the editing distance is larger than 1;
selecting editing distances with different lengths to generate a character and sound confusion set;
the editing distance of the character confusion rule is 1, and the confusion rule is that a character is randomly selected from a character library to be replaced;
(3) Obtaining a confusion set
Randomly extracting sample characters from a corpus each time, preprocessing the sample characters to obtain a sample set, replacing the sample set according to a font confusion rule, a character and sound confusion rule and a character confusion rule respectively according to a specific proportion, and replacing the corresponding confusion set;
(4) Model training
Taking the confusion set as an input set, taking the sample set as a comparison set, forming one-to-one corresponding sentence pairs, and performing model training in an end-to-end mode to finally obtain an error correction model;
s2, data acquisition
Receiving a website domain name, recursively acquiring links in all webpages, removing repeated links according to URL HASH, forming acquisition process information and acquisition results into a JSON format, and pushing the JSON format as a data acquisition result to a KAFKA message system;
s3, data cleaning
Subscribing to a KAFKA message, consuming the data collection results, and performing the steps of:
1. subscribing KAFKA information, consuming data acquisition results, identifying content types, and filtering non-HTML types;
2. webpage preprocessing: encoding characters of a page source code by using a charset attribute in the collected information, then serializing a collected result, and analyzing a webpage into a DOM tree;
3. extracting the webpage tags: extracting meta tags from DOM tree and performing classified storage
4. Extracting the webpage text: judging whether the meta tag extraction result contains a content page tag or not, and determining whether the webpage is a content page containing a text or not; taking out the body from the HTML source code of the content page, removing all tags in the body, wherein the tags comprise style styles, javaScript scripts and annotation contents, reserving original line feed characters, and extracting the text contents of the webpage after denoising;
5. and (3) outputting a processing result: forming JSON by the tag extraction result and the text extraction result, taking the JSON as a data cleaning result, adding the JSON to the data acquisition result, and pushing the JSON to a KAFKA message system;
s4, text error correction
Subscribing to the KAFKA message, consuming the data cleansing result, performing the steps of:
1. cutting the webpage text into sentences according to punctuation marks and paragraphs, inputting the sentences into an error correction model, judging whether errors exist, if so, putting the corrected result into the sentences, inputting the error correction model again, performing recursive error correction, and if the two error correction results are the same, stopping the recursion, and obtaining the correction result of the error correction model;
2. forming sentences containing wrongly written characters and error correction results into sentences JSON, and forming sentence arrays by a plurality of JSONs;
3. correcting the wrongly written character position, the first probability, the second probability, the third probability, the fusion probability, the first characteristic, the second characteristic, the third characteristic and the fusion characteristic of each sentence in the sentence array to form JSON, and adding a plurality of JSON forming arrays to the corresponding sentences JSON;
the first probability is the probability of character errors, the first characteristic is the corresponding character error characteristic, the second probability is the probability of character-pronunciation errors, the second characteristic is the corresponding character-pronunciation error characteristic, the third probability is the probability of character-shape errors, the third probability is the corresponding character-shape error characteristic, the fusion probability is the maximum value of the first probability, the second probability and the third probability, and the fusion characteristic is the fusion error characteristic corresponding to the maximum value;
4. and (3) outputting a processing result: adding the extracted text error correction result to a data cleaning result JSON, and pushing to a KAFKA message system;
s5, storing data
Subscribing KAFKA information, storing the data acquisition result, the data cleaning result and the text error correction result into an Elasticissearch storage system, and storing by taking URL HASH as a main key.
2. The text error correction method applied to the digital government according to claim 1, wherein: in the process of obtaining the confusion set in S1, 15% of characters are randomly extracted from the corpus each time, and 60% of the characters are subjected to word-sound confusion, 20% of the characters are subjected to font confusion, and 20% of the characters are subjected to character confusion.
3. The text error correction method applied to the digital government according to claim 1, wherein: in S2, the acquisition process information comprises URL, IP, protocol, proxy, request mode, request time, acquisition time consumption, acquisition state and server; the acquisition results include a request header, a response header, and corresponding content.
4. The text error correction method applied to the digital government according to claim 1, wherein: in S3, the rules for classified storage are:
(1) website labeling: siteName, siteDomain, siteIDcode, columName
(2) Column label: columnDescription, columnKeywords, columnType
(3) Content page tag: articletile, pubDate, contentSource, keywords, author, description, image, url.
5. The text error correction method applied to the digital government according to claim 1, wherein: and in S3, text content extraction is carried out on the webpage by using a text extraction algorithm based on webpage text density and symbol density after noise reduction.
CN202111633076.4A 2021-12-29 2021-12-29 Text error correction method applied to digital government Active CN114970502B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111633076.4A CN114970502B (en) 2021-12-29 2021-12-29 Text error correction method applied to digital government

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111633076.4A CN114970502B (en) 2021-12-29 2021-12-29 Text error correction method applied to digital government

Publications (2)

Publication Number Publication Date
CN114970502A CN114970502A (en) 2022-08-30
CN114970502B true CN114970502B (en) 2023-03-28

Family

ID=82974441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111633076.4A Active CN114970502B (en) 2021-12-29 2021-12-29 Text error correction method applied to digital government

Country Status (1)

Country Link
CN (1) CN114970502B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438650B (en) * 2022-11-08 2023-04-07 深圳擎盾信息科技有限公司 Contract text error correction method, system, equipment and medium fusing multi-source characteristics
CN117236319B (en) * 2023-09-25 2024-04-19 中国—东盟信息港股份有限公司 Real scene Chinese text error correction method based on transducer generation model

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11328317A (en) * 1998-05-11 1999-11-30 Nippon Telegr & Teleph Corp <Ntt> Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded
CN1687877A (en) * 2005-04-14 2005-10-26 刘伊翰 Chinese character input method capable of using English
CN104916169A (en) * 2015-05-20 2015-09-16 江苏理工学院 Card type German learning tool
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device
CN110765740A (en) * 2019-10-11 2020-02-07 深圳市比一比网络科技有限公司 DOM tree-based full-type text replacement method, system, device and storage medium
CN112016310A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Text error correction method, system, device and readable storage medium
CN112199945A (en) * 2020-08-19 2021-01-08 宿迁硅基智能科技有限公司 Text error correction method and device
CN112287670A (en) * 2020-11-18 2021-01-29 北京明略软件系统有限公司 Text error correction method, system, computer device and readable storage medium
CN113361266A (en) * 2021-06-25 2021-09-07 达闼机器人有限公司 Text error correction method, electronic device and storage medium
CN113642316A (en) * 2021-07-28 2021-11-12 平安国际智慧城市科技股份有限公司 Chinese text error correction method and device, electronic equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11328317A (en) * 1998-05-11 1999-11-30 Nippon Telegr & Teleph Corp <Ntt> Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded
CN1687877A (en) * 2005-04-14 2005-10-26 刘伊翰 Chinese character input method capable of using English
CN104916169A (en) * 2015-05-20 2015-09-16 江苏理工学院 Card type German learning tool
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device
CN110765740A (en) * 2019-10-11 2020-02-07 深圳市比一比网络科技有限公司 DOM tree-based full-type text replacement method, system, device and storage medium
CN112199945A (en) * 2020-08-19 2021-01-08 宿迁硅基智能科技有限公司 Text error correction method and device
CN112016310A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Text error correction method, system, device and readable storage medium
WO2021189851A1 (en) * 2020-09-03 2021-09-30 平安科技(深圳)有限公司 Text error correction method, system and device, and readable storage medium
CN112287670A (en) * 2020-11-18 2021-01-29 北京明略软件系统有限公司 Text error correction method, system, computer device and readable storage medium
CN113361266A (en) * 2021-06-25 2021-09-07 达闼机器人有限公司 Text error correction method, electronic device and storage medium
CN113642316A (en) * 2021-07-28 2021-11-12 平安国际智慧城市科技股份有限公司 Chinese text error correction method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Junjie Yu等.Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape.《Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing》.2014,220-223. *
李建义等.关于中文拼写纠错数据增强的方法.《北华航天工业学院学报》.2021,第31卷(第31期),1-5. *

Also Published As

Publication number Publication date
CN114970502A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN109145260B (en) Automatic text information extraction method
CN114970502B (en) Text error correction method applied to digital government
CN110609983B (en) Structured decomposition method for policy file
US20150100304A1 (en) Incremental computation of repeats
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN102253937A (en) Method and related device for acquiring information of interest in webpages
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN113033185B (en) Standard text error correction method and device, electronic equipment and storage medium
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
CN111967267A (en) XLNET-based news text region extraction method and system
CN112527981A (en) Open type information extraction method and device, electronic equipment and storage medium
CN113255331B (en) Text error correction method, device and storage medium
Sagcan et al. Toponym recognition in social media for estimating the location of events
CN117034948B (en) Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN109472020A (en) A kind of feature alignment Chinese word cutting method
CN111274354B (en) Referee document structuring method and referee document structuring device
CN107451215B (en) Feature text extraction method and device
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN110941703A (en) Integrated resume information extraction method based on machine learning and fuzzy rules
CN107145947B (en) Information processing method and device and electronic equipment
CN112966501B (en) New word discovery method, system, terminal and medium
CN110941713A (en) Self-optimization financial information plate classification method based on topic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OL01 Intention to license declared