CN112560451B - Wrongly written character proofreading method and device for automatically generating training data - Google Patents

Wrongly written character proofreading method and device for automatically generating training data Download PDF

Info

Publication number
CN112560451B
CN112560451B CN202110190708.8A CN202110190708A CN112560451B CN 112560451 B CN112560451 B CN 112560451B CN 202110190708 A CN202110190708 A CN 202110190708A CN 112560451 B CN112560451 B CN 112560451B
Authority
CN
China
Prior art keywords
phrase
word
corpus
proofreading
word set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110190708.8A
Other languages
Chinese (zh)
Other versions
CN112560451A (en
Inventor
蓝建敏
池沐霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Excellence Information Technology Co ltd
Original Assignee
Excellence Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Excellence Information Technology Co ltd filed Critical Excellence Information Technology Co ltd
Priority to CN202110190708.8A priority Critical patent/CN112560451B/en
Publication of CN112560451A publication Critical patent/CN112560451A/en
Application granted granted Critical
Publication of CN112560451B publication Critical patent/CN112560451B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a wrongly written and wrongly written proofreading method and a device for automatically generating training data, wherein the method comprises the steps of performing word segmentation processing on given linguistic data to obtain a plurality of first phrases; generating a plurality of confusable word sets according to each first phrase; selecting a first phrase to be replaced from a plurality of first phrases of a given corpus, and then taking a confusable word set with the same core phrase and the first phrase to be replaced as a selected word set; replacing a first phrase to be replaced in a given corpus with a similar phrase in a selected word set to generate a wrong corpus; taking the given corpus and the error corpus as a training data set, and training the wrongly written or mispronounced character proofreading model according to the training data set; and checking the text to be checked according to the wrongly written character checking model. The method and the device can solve the problems of long time consumption and low efficiency in manually collecting the wrong corpora in the prior art.

Description

Wrongly written character proofreading method and device for automatically generating training data
Technical Field
The invention relates to the technical field of computers, in particular to a wrongly written or mispronounced character proofreading method and device for automatically generating training data.
Background
The error character proofreading is one of the works of text proofreading. With the development of science and technology, automatic model building and error correction through machine learning are becoming popular. A large amount of training data is needed in the process of training a model, and the existing training data needs to manually collect the wrong corpora of a user and then label the wrong corpora to generate a training sample. Manually collecting the wrong corpora with the wrongly written characters is time-consuming, labor-consuming and low in efficiency.
Disclosure of Invention
The embodiment of the invention provides a wrongly written or mispronounced character proofreading method and device for automatically generating training data, which can automatically generate a wrongly written or mispronounced character with wrongly written or mispronounced characters, train a wrongly written or mispronounced character model through the generated wrongly written or mispronounced character model, finally carry out wrongly written or mispronounced character proofreading through the wrongly written or mispronounced character model, reduce labor consumption and improve efficiency.
An embodiment of the present invention provides a method for correcting wrongly written characters by automatically generating training data, including:
obtaining given linguistic data and performing word segmentation processing on the given linguistic data to obtain a plurality of first phrases;
generating a plurality of confusable word sets according to the first phrases; each confusable word set comprises a core phrase and a plurality of similar phrases corresponding to the core phrase, and the input operation of each similar phrase and the core phrase under the same input method is the same; the core phrase is the first phrase;
selecting a first phrase to be replaced from a plurality of first phrases of the given corpus, and then taking a confusable word set with the same core phrase as the first phrase to be replaced as a selected word set; replacing the first phrase to be replaced in the given corpus with a similar phrase in the selected word set to generate an error corpus;
taking the given corpus and the error corpus as training data sets, and training an error word proofreading model according to the training data sets;
and checking the text to be checked according to the wrongly written character checking model.
Further, after generating a plurality of confusable word sets according to each of the first phrases, the method further includes:
and calculating the cosine distance of the word vector between the core word group in each confusable word set and each similar word group, and eliminating the similar word groups of which the cosine distance of the word vector exceeds a preset threshold value.
Further, the input method comprises any one or combination of the following: the method comprises a five-stroke input method, a pinyin input method and a stroke input method.
Further, when the proofreading text to be proofread is proofread through the proofreading model for the wrongly written words, if a wrong phrase is identified, a confusable word set with the same core phrase and the wrong phrase is used as a second selected word set; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.
On the basis of the embodiment of the method item, the invention correspondingly provides an embodiment of a device item;
the invention provides a wrongly written or mispronounced character proofreading device capable of automatically generating training data, which comprises a word segmentation module, an easily confused word set generation module, a wrong corpus generation module, a model training module and a proofreading module;
the word segmentation module is used for acquiring given linguistic data and performing word segmentation processing on the given linguistic data to acquire a plurality of first phrases;
the confusable word set generating module is used for generating a plurality of confusable word sets according to the first phrases; each confusable word set comprises a core phrase and a plurality of similar phrases corresponding to the core phrase, and the input operation of each similar phrase and the core phrase under the same input method is the same; the core phrase is the first phrase;
the wrong corpus generating module is used for selecting a first phrase to be replaced from a plurality of first phrases of the given corpus, and then taking a confusable word set with the same core phrase and the first phrase to be replaced as a selected word set; replacing the first phrase to be replaced in the given corpus with a similar phrase in the selected word set to generate an error corpus;
the model training module is used for taking the given corpus and the error corpus as a training data set and training an error word proofreading model according to the training data set;
and the proofreading module is used for proofreading the text to be proofread according to the wrongly written character proofreading model.
Further, the device also comprises a phrase eliminating module; and the phrase eliminating module is used for calculating the word vector cosine distance between the core phrase in each confusable word set and each similar phrase and eliminating the similar phrases of which the word vector cosine distance exceeds a preset threshold value.
Further, when the proofreading module proofreads the text to be proofread through the proofreading model for the wrongly written words, if a wrong phrase is identified, the confusable word set with the same core phrase as the wrong phrase is used as a second selected word set; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.
The invention has the following beneficial effects:
the method comprises the steps of segmenting a given corpus to obtain a plurality of first phrases, generating an easily-confused word set corresponding to each phrase according to the first phrases, replacing the selected first phrase in the given corpus according to the easily-confused word set, automatically generating error corpuses, taking the error corpuses and the given corpuses as training data sets required by a training model, performing model training to obtain an error corpuses correction model, and finally performing correction on a text to be corrected according to the error corpuses correction model.
Drawings
Fig. 1 is a flowchart illustrating a method for correcting wrongly written or mispronounced words of automatically generated training data according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a wrongly written or mispronounced character checking apparatus for automatically generating training data according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a method for correcting wrongly written or mispronounced words of automatically generated training data, including:
step S101: obtaining given linguistic data and performing word segmentation processing on the given linguistic data to obtain a plurality of first phrases.
Step S102: generating a plurality of confusable word sets according to the first phrases; each confusable word set comprises a core phrase and a plurality of similar phrases corresponding to the core phrase, and the input operation of each similar phrase and the core phrase under the same input method is the same; the core phrase is the first phrase.
Step S103: selecting a first phrase to be replaced from a plurality of first phrases of the given corpus, and then taking a confusable word set with the same core phrase as the first phrase to be replaced as a selected word set; and replacing the first phrase to be replaced in the given corpus with the similar phrase in the selected word set to generate an error corpus.
Step S104: and taking the given corpus and the error corpus as a training data set, and training an error character proofreading model according to the training data set.
Step S105: and checking the text to be checked according to the wrongly written character checking model.
For the step S101, in the present invention, the given corpus may include text artifacts produced by the user in daily life, such as world documents produced in daily life, and data crawled on the web, such as in forums, microblogs, comments, and the like; generally speaking, the linguistic data obtained from the text results generated by users in daily life are generally specified in terms of words, most of which are written words, and the linguistic data crawled through forums and online comments generally tend to be spoken words; when the original corpus is collected, two forms of corpus of written expression and spoken expression are comprehensively considered, so that the application range of a wrongly-written and wrongly-written proofreading model trained subsequently is wider;
in step S102, the present invention simulates wrongly written characters generated by the user to construct the confusable word sets. Specifically, a first phrase is selected as a core phrase, then a plurality of phrases which have the same input operation as the core phrase under the same input method are extracted as similar phrases according to the existing input method, and then a confusable word set corresponding to each first phrase is constructed according to the core phrase and the extracted similar phrases. For example, after word segmentation is performed on a given corpus, a first word group of 'proposed' is obtained, then the 'proposed' is taken as a core word group, words which are the same as the 'proposed' input operation under a pinyin input method are 'rejected', 'kicked out', 'shaved off' and the like, words which are the same as the 'proposed' input operation under a five-stroke input method are 'pinched out', 'pulled out' and the like, and then a confusion word set corresponding to the 'proposed' can be obtained as { proposed | rejected, kicked out, shaved out, pinched out, pulled out }; the confusion word set is 'proposed' as a core word group; the rejecting, kicking out, shaving, pinching out and pulling out are all similar phrases.
It should be noted that the above "input operation is the same" means that the order and type of the entered letters are the same, for example, in the above example, in the case of the pinyin input method, the corresponding entered letter is "tihu" and the corresponding entered letter is also "tihu" by removing, kicking out and shaving, and this case is referred to as the same input operation. In addition, it should be noted that the input method may include any one or a combination of the following: the method comprises a five-stroke input method, a pinyin input method and a stroke input method. That is, when generating the confusable word set of each first phrase, the confusable word set may be constructed by only one input method or may be constructed by a combination of a plurality of input methods. For example, if the method is constructed according to the pinyin input method only, the resulting confusing word set is: { extracting | removing, kicking out and shaving }; if the method is constructed according to the pinyin input method and the wubi input method, the obtained confusion word set is { propose | remove, kick out, shave off, pinch out and pull out }; no matter which construction method is adopted to generate the confusion word set, the condition that each similar phrase in the final confusion word and the core phrase are identical and the input operation under the same input method is identical is met. And constructing the confusable word set of each first phrase in the given corpus according to the mode to generate the plurality of confusable word sets.
In step S103, a first phrase to be replaced is selected from the given corpus, and then any similar phrase is extracted from the confusable word set corresponding to the first phrase for replacement. For example, if the given corpus is 'propose suggestion', the selected first phrase to be replaced is 'propose', then firstly searching the confusable word set of which the core phrase is 'propose' to obtain { propose | reject, kick out, shave off }; then, any word group is selected from the group consisting of 'removing, kicking and shaving' to replace 'proposing', and a 'removing suggestion', 'kicking suggestion' and 'shaving suggestion' of the replaced wrong corpus are obtained. It is understood that a plurality of "suggestions" may occur in a given corpus, and all the "suggestions" may be replaced at the time of replacement, and the word frequency in the "suggestions" may be replaced according to a corresponding ratio, for example, a ratio of 2:1, that is, if the "suggestions" occur 10 times in the given corpus, the "suggestions" occurring 5 times may be replaced. In addition, the same similar phrase can be selected during each replacement, for example, 5 times of replacement replace 'propose' with 'reject', or the similar phrase can be randomly selected for replacement, for example, the 'propose' is replaced with 'reject' during the first replacement, and the 'propose' is replaced with 'kick' during the second replacement.
In an optional embodiment, after generating a plurality of confusable word sets according to each of the first phrases, the method further includes: and calculating the cosine distance of the word vector between the core word group in each confusable word set and each similar word group, and eliminating the similar word groups of which the cosine distance of the word vector exceeds a preset threshold value.
Similar phrases which are too close to the core phrases in each confusion word set are removed through word vector distance calculation, and the preset threshold value can be set to be 0.9; and calculating the cosine distance of the word vector between each similar phrase and the core phrase one by one, and if the cosine distance of the word vector is greater than 0.9, indicating that the similar phrase is very similar to the meaning of the core phrase, and removing the similar phrase. The step is mainly to eliminate the similar meaning words with similar word meanings, and because the Chinese expression is changeable, the same meaning can be expressed by different words which are called as the similar meaning words; however, the similar phrases in the confusing word set and the core phrase that are similar to each other are replaced with the similar phrases that are similar to each other in the confusing word set by the word vector distance calculation in this step, in order to avoid that the similar phrases in the confusing word set and the core phrase that are similar to each other are replaced with the similar phrases that are similar to each other in the confusing word set are replaced with the similar phrases that are similar to each other in the core phrase by the wrong phrase, the similar phrases in the confusing word set and the core phrase that are similar to each other are eliminated by the word vector distance calculation in this step.
And step S104, taking the given corpus as a positive sample, taking the automatically generated error corpus as a negative sample, and then performing model training by using a sequence labeling model algorithm, such as bi-lstm + crf, to obtain the wrongly written and mispronounced character proofreading model.
For step S105, in a preferred embodiment, when the text to be corrected is corrected by the wrongly written character correction model, if a wrong phrase is identified, a confusable word set having a core phrase that is the same as the wrong phrase is used as a second selected word set; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.
Of course, in the actual process, a plurality of models can be trained by using a plurality of different algorithms, and the plurality of models can be used for analyzing the correctness of the sentence and taking the weighted result for correcting the error.
By implementing the embodiment of the invention, training data required by model training can be automatically generated without manual acquisition, the efficiency of model training is improved, the loss of manpower and material resources is reduced, and simultaneously, after a user provides a work document of the user as a given corpus, the training corpus conforming to the habit of the user can be generated, so that the trained model is ensured, and wrongly-written characters frequently appearing in work of the user can be corrected.
As shown in fig. 2, the present invention provides an embodiment of an apparatus corresponding to the embodiment of the method;
the embodiment of the invention provides a wrongly written or mispronounced character proofreading device capable of automatically generating training data, which comprises a word segmentation module, an easily confused word set generation module, a wrong corpus generation module, a model training module and a proofreading module, wherein the word segmentation module is used for segmenting words;
the word segmentation module is used for acquiring given linguistic data and performing word segmentation processing on the given linguistic data to acquire a plurality of first phrases;
the confusable word set generating module is used for generating a plurality of confusable word sets according to the first phrases; each confusable word set comprises a core phrase and a plurality of similar phrases corresponding to the core phrase, and the input operation of each similar phrase and the core phrase under the same input method is the same; the core phrase is the first phrase;
the wrong corpus generating module is used for selecting a first phrase to be replaced from a plurality of first phrases of the given corpus, and then taking a confusable word set with the same core phrase and the first phrase to be replaced as a selected word set; replacing the first phrase to be replaced in the given corpus with a similar phrase in the selected word set to generate an error corpus;
the model training module is used for taking the given corpus and the error corpus as a training data set and training an error word proofreading model according to the training data set;
and the proofreading module is used for proofreading the text to be proofread according to the wrongly written character proofreading model.
In a preferred embodiment, the system further comprises a phrase eliminating module;
and the phrase eliminating module is used for calculating the word vector cosine distance between the core phrase in each confusable word set and each similar phrase and eliminating the similar phrases of which the word vector cosine distance exceeds a preset threshold value.
In a preferred embodiment, when the proofreading module proofreads the text to be proofread through the mispronounced word proofreading model, if a wrong phrase is identified, a confusable word set with a core phrase being the same as the wrong phrase is used as a second selected word set; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.
It should be noted that the above-mentioned embodiment of the apparatus of the present invention corresponds to the embodiment of the method of the present invention, and the method for correcting wrongly written characters of automatically generated training data of the present invention can be implemented. In addition, the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (7)

1. A wrongly written or mispronounced character proofreading method for automatically generating training data is characterized by comprising the following steps:
obtaining given linguistic data and performing word segmentation processing on the given linguistic data to obtain a plurality of first phrases;
generating a plurality of confusable word sets according to the first phrases; each confusable word set comprises a core phrase and a plurality of similar phrases corresponding to the core phrase, and the input operation of each similar phrase and the core phrase under the same input method is the same; the core phrase is the first phrase;
selecting a first phrase to be replaced from a plurality of first phrases of the given corpus, and then taking a confusable word set with the same core phrase as the first phrase to be replaced as a selected word set; replacing the first phrase to be replaced in the given corpus with a similar phrase in the selected word set to generate an error corpus;
taking the given corpus and the error corpus as training data sets, and training an error word proofreading model according to the training data sets;
and checking the text to be checked according to the wrongly written character checking model.
2. The method of automatically generating proofreading of wrongly written words of training data according to claim 1, further comprising, after generating a plurality of confusable word sets from each of said first phrases:
and calculating the cosine distance of the word vector between the core word group in each confusable word set and each similar word group, and eliminating the similar word groups of which the cosine distance of the word vector exceeds a preset threshold value.
3. The method of automatically generating a wrongly written proofreading of training data as set forth in claim 1, wherein the input method includes any one or a combination of the following: the method comprises a five-stroke input method, a pinyin input method and a stroke input method.
4. The method according to claim 1, wherein when the text to be corrected is corrected by the wrongly written character correcting model, if a wrong phrase is identified, an confusable word set having a core phrase identical to the wrong phrase is used as a second selected word set; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.
5. A wrongly written or mispronounced character proofreading device capable of automatically generating training data is characterized by comprising a word segmentation module, an easily confused word set generation module, a wrong corpus generation module, a model training module and a proofreading module;
the word segmentation module is used for acquiring given linguistic data and performing word segmentation processing on the given linguistic data to acquire a plurality of first phrases;
the confusable word set generating module is used for generating a plurality of confusable word sets according to the first phrases; each confusable word set comprises a core phrase and a plurality of similar phrases corresponding to the core phrase, and the input operation of each similar phrase and the core phrase under the same input method is the same; the core phrase is the first phrase;
the wrong corpus generating module is used for selecting a first phrase to be replaced from a plurality of first phrases of the given corpus, and then taking a confusable word set with the same core phrase and the first phrase to be replaced as a selected word set; replacing the first phrase to be replaced in the given corpus with a similar phrase in the selected word set to generate an error corpus;
the model training module is used for taking the given corpus and the error corpus as a training data set and training an error word proofreading model according to the training data set;
and the proofreading module is used for proofreading the text to be proofread according to the wrongly written character proofreading model.
6. The apparatus according to claim 5, further comprising a phrase culling module;
and the phrase eliminating module is used for calculating the word vector cosine distance between the core phrase in each confusable word set and each similar phrase and eliminating the similar phrases of which the word vector cosine distance exceeds a preset threshold value.
7. The apparatus according to claim 5, wherein the proofreading module, when proofreading the text to be proofread through the proofreading model for the mispronounced word, takes a confusable word set having a core word set identical to the incorrect word set as a second selected word set if the incorrect word set is identified; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.
CN202110190708.8A 2021-02-20 2021-02-20 Wrongly written character proofreading method and device for automatically generating training data Active CN112560451B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110190708.8A CN112560451B (en) 2021-02-20 2021-02-20 Wrongly written character proofreading method and device for automatically generating training data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110190708.8A CN112560451B (en) 2021-02-20 2021-02-20 Wrongly written character proofreading method and device for automatically generating training data

Publications (2)

Publication Number Publication Date
CN112560451A CN112560451A (en) 2021-03-26
CN112560451B true CN112560451B (en) 2021-05-14

Family

ID=75036020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110190708.8A Active CN112560451B (en) 2021-02-20 2021-02-20 Wrongly written character proofreading method and device for automatically generating training data

Country Status (1)

Country Link
CN (1) CN112560451B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204966B (en) * 2021-06-08 2023-03-28 重庆度小满优扬科技有限公司 Corpus augmentation method, apparatus, device and storage medium
CN113627191A (en) * 2021-07-05 2021-11-09 中国气象局公共气象服务中心(国家预警信息发布中心) Automatic labeling method and system for meteorological early warning sample semantics
CN114936549B (en) * 2022-06-06 2024-02-13 湖南环境生物职业技术学院 Artificial intelligent text proofreading method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101847140A (en) * 2009-03-23 2010-09-29 中国科学院计算技术研究所 Wrongly-written or mispronounced character processing method and system
CN105654945A (en) * 2015-10-29 2016-06-08 乐视致新电子科技(天津)有限公司 Training method of language model, apparatus and equipment thereof
CN110704391A (en) * 2019-09-23 2020-01-17 车智互联(北京)科技有限公司 Word stock construction method and computing device
CN110705217A (en) * 2019-09-09 2020-01-17 上海凯京信达科技集团有限公司 Wrongly-written character detection method and device, computer storage medium and electronic equipment
US10671182B2 (en) * 2014-10-16 2020-06-02 Touchtype Limited Text prediction integration
US10762298B2 (en) * 2018-02-10 2020-09-01 Wipro Limited Method and device for automatic data correction using context and semantic aware learning techniques
WO2020213842A1 (en) * 2019-04-19 2020-10-22 Samsung Electronics Co., Ltd. Multi-model structures for classification and intent determination
CN112287670A (en) * 2020-11-18 2021-01-29 北京明略软件系统有限公司 Text error correction method, system, computer device and readable storage medium
CN112329447A (en) * 2020-10-29 2021-02-05 语联网(武汉)信息技术有限公司 Training method of Chinese error correction model, and Chinese error correction method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201321927D0 (en) * 2013-12-11 2014-01-22 Touchtype Ltd System and method for inputting text into electronic devices
US9678664B2 (en) * 2015-04-10 2017-06-13 Google Inc. Neural network for keyboard input decoding
CN108052937B (en) * 2017-12-28 2019-05-31 百度在线网络技术(北京)有限公司 Based on Weakly supervised character machining device training method, device, system and medium
CN109858023B (en) * 2019-01-04 2020-07-03 北京车慧科技有限公司 Statement error correction device
CN111523306A (en) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 Text error correction method, device and system
US11222622B2 (en) * 2019-05-05 2022-01-11 Microsoft Technology Licensing, Llc Wake word selection assistance architectures and methods

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101847140A (en) * 2009-03-23 2010-09-29 中国科学院计算技术研究所 Wrongly-written or mispronounced character processing method and system
US10671182B2 (en) * 2014-10-16 2020-06-02 Touchtype Limited Text prediction integration
CN105654945A (en) * 2015-10-29 2016-06-08 乐视致新电子科技(天津)有限公司 Training method of language model, apparatus and equipment thereof
US10762298B2 (en) * 2018-02-10 2020-09-01 Wipro Limited Method and device for automatic data correction using context and semantic aware learning techniques
WO2020213842A1 (en) * 2019-04-19 2020-10-22 Samsung Electronics Co., Ltd. Multi-model structures for classification and intent determination
CN110705217A (en) * 2019-09-09 2020-01-17 上海凯京信达科技集团有限公司 Wrongly-written character detection method and device, computer storage medium and electronic equipment
CN110704391A (en) * 2019-09-23 2020-01-17 车智互联(北京)科技有限公司 Word stock construction method and computing device
CN112329447A (en) * 2020-10-29 2021-02-05 语联网(武汉)信息技术有限公司 Training method of Chinese error correction model, and Chinese error correction method and device
CN112287670A (en) * 2020-11-18 2021-01-29 北京明略软件系统有限公司 Text error correction method, system, computer device and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
字词级中文文本自动校对的方法研究;卓利艳;《中国优秀硕士学位论文全文数据库》;20190115(第12期);I138-1766 *

Also Published As

Publication number Publication date
CN112560451A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN112560451B (en) Wrongly written character proofreading method and device for automatically generating training data
CN110110585B (en) Intelligent paper reading implementation method and system based on deep learning and computer program
CN107133220B (en) Geographic science field named entity identification method
CN108363743B (en) Intelligent problem generation method and device and computer readable storage medium
CN109284400B (en) Named entity identification method based on Lattice LSTM and language model
CN104809103B (en) A kind of interactive semantic analysis and system
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN108121702B (en) Method and system for evaluating and reading mathematical subjective questions
CN105975454A (en) Chinese word segmentation method and device of webpage text
CN110851599A (en) Automatic scoring method and teaching and assisting system for Chinese composition
CN103761975A (en) Method and device for oral evaluation
CN108090099B (en) Text processing method and device
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN105261246A (en) Spoken English error correcting system based on big data mining technology
CN110175246A (en) A method of extracting notional word from video caption
CN110751234B (en) OCR (optical character recognition) error correction method, device and equipment
CN111104513A (en) Short text classification method for game platform user question-answer service
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
CN115438154A (en) Chinese automatic speech recognition text restoration method and system based on representation learning
CN115640200A (en) Method and device for evaluating dialog system, electronic equipment and storage medium
CN110705217A (en) Wrongly-written character detection method and device, computer storage medium and electronic equipment
CN112151019A (en) Text processing method and device and computing equipment
CN107783958B (en) Target statement identification method and device
CN111046663A (en) Intelligent correction method for Chinese form

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant