CN116306598B - Customized error correction method, system, equipment and medium for words in different fields - Google Patents

Customized error correction method, system, equipment and medium for words in different fields Download PDF

Info

Publication number
CN116306598B
CN116306598B CN202310573154.9A CN202310573154A CN116306598B CN 116306598 B CN116306598 B CN 116306598B CN 202310573154 A CN202310573154 A CN 202310573154A CN 116306598 B CN116306598 B CN 116306598B
Authority
CN
China
Prior art keywords
words
word
error correction
extracted
correction reference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310573154.9A
Other languages
Chinese (zh)
Other versions
CN116306598A (en
Inventor
季婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mido Technology Co ltd
Original Assignee
Shanghai Mdata Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mdata Information Technology Co ltd filed Critical Shanghai Mdata Information Technology Co ltd
Priority to CN202310573154.9A priority Critical patent/CN116306598B/en
Publication of CN116306598A publication Critical patent/CN116306598A/en
Application granted granted Critical
Publication of CN116306598B publication Critical patent/CN116306598B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a method, a system, equipment and a medium for customizing error correction aiming at words in different fields, wherein the method comprises the following steps: acquiring an input text to be corrected; extracting words with preset attributes in the text to be corrected by using a word recognition model; the preset attributes comprise a preset field and a preset length; matching error correction reference words similar to the extracted words; and comparing the difference degree between the extracted words and the error correction reference words, and correcting errors according to the difference degree. The application can carry out high-efficiency error correction aiming at different error types in a specific field.

Description

Customized error correction method, system, equipment and medium for words in different fields
Technical Field
The application belongs to the technical field of text detection, relates to an error correction method, and in particular relates to a customized error correction method, system, equipment and medium for words in different fields.
Background
With the continuous development of natural language processing technology, text error correction technology has made tremendous progress. From the original rule-based error correction system to the present machine-learning based error correction system, advances in technology have helped one solve a number of text error correction problems.
Rule-based error correction systems are the earliest text error correction techniques. They detect errors in text by predefined grammatical rules. However, this approach has some drawbacks, such as the inability to recognize misspellings and new words. With the development of machine learning technology, text error correction technology has also changed significantly. Error correction systems based on machine learning learn language models by analyzing large amounts of text data to identify and correct errors in text. However, many trained language models require the construction of error sets corresponding to the correct vocabulary, so that the training data and the constructed models are complex, and the matching of words may not be a word, and the large amount of text does not actually contain the word to be detected, which is time-consuming.
Disclosure of Invention
The application aims to provide a method, a system, equipment and a medium for customizing error correction of words in different fields, which are used for solving the problem of error correction of various error types of words in specific fields.
The first aspect of the embodiment of the application provides a customized error correction method for words in different fields, which comprises the following steps: acquiring an input text to be corrected; extracting words with preset attributes in the text to be corrected by using a word recognition model; the preset attributes comprise a preset field and a preset length; matching error correction reference words similar to the extracted words; and comparing the difference degree between the extracted words and the error correction reference words, and correcting errors according to the difference degree.
In an implementation manner of the first aspect, the training data of the word recognition model is constructed based on correct words of preset attributes.
In an implementation manner of the first aspect, the training data constructing process includes: setting the number of sentences with different error types; the error types include wrong words, multiple words, few words and disorder; randomly replacing one word in the correct words with another word homophonic or homophonic with the correct word according to the set sentence quantity of the wrong word; randomly inserting a word into the correct word according to the set sentence quantity of multiple words; randomly deleting a word from the correct words according to the set sentence quantity of the few words; randomly exchanging the positions of two words in the correct words according to the set disordered sentence quantity; placing the original sentence containing the correct word into the training data; the positions of the correct words which are processed in error or are original are marked in sentences which participate in the construction.
In an implementation manner of the first aspect, the step of matching the error correction reference word similar to the extracted word includes: obtaining alternative correct words similar to the extracted words on the text and the pinyin by using a similarity algorithm; and carrying out similarity calculation again on the correct alternative words, and determining the final error correction reference words.
In an implementation manner of the first aspect, the step of comparing the difference degree between the extracted word and the error correction reference word, and performing error correction according to the difference degree includes: determining the difference between the extracted word and the error correction reference word in response to the difference between the extracted word and the error correction reference word, and analyzing the difference type of the difference; and determining an error correction mode according to the difference type.
In an implementation manner of the first aspect, the step of determining an error correction manner according to the difference type includes: determining the sum of the number of the extracted words compared with each error correction reference word aiming at different error correction reference words, and selecting the error correction reference word which can be converted into with the least number of the extracted word replacement words for error correction; responding to the fact that the number of the extracted words is the same as that of the error correction reference words, and selecting error correction reference words with error positions behind sentences for error correction; and responding to the existence of multiple difference types between the extracted words and different correction reference words, and selecting the correction reference word closest to the extracted words according to the pinyin for correction.
In an implementation manner of the first aspect, after the step of correcting errors according to the degree of difference, the method further includes: combining the accumulated false alarm cases in the preset field to determine a post-processing filtering rule; and carrying out false alarm analysis on the corrected words by utilizing the post-processing filtering rules, and eliminating false alarm conditions.
A second aspect of an embodiment of the present application provides a customized error correction system for words in different domains, the system including: the text acquisition module is configured to acquire an input text to be corrected; the word extraction module is configured to extract words with preset attributes in the text to be corrected by using a word recognition model; the preset attributes comprise a preset field and a preset length; the word matching module is configured to match error correction reference words similar to the extracted words; and the contrast correction module is configured to compare the difference degree between the extracted words and the correction reference words, and correct the errors according to the difference degree.
A third aspect of an embodiment of the present application provides an electronic device, including: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory so as to enable the electronic device to execute the method.
A fourth aspect of the embodiments of the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method.
As described above, the method, system, equipment and medium for customizing error correction for words in different fields have the following beneficial effects:
according to the application, words in the text are extracted through the word recognition model, a plurality of error correction reference words similar to the extracted words are further matched, the error types of the extracted words are determined by comparing the difference degrees of the extracted words and the plurality of error correction reference words, and then error correction is carried out. The application does not need to carry out multi-word, few-word and wrong-word replacement on the correct words and phrases, and constructs the wrong set corresponding to the correct words and phrases, so that training data is greatly reduced, and the trained word recognition model is more simplified by matching the nearest reference words and phrases in a word extraction mode.
Drawings
Fig. 1 shows an application scenario schematic diagram of a method for customizing error correction for words in different fields according to an embodiment of the present application.
FIG. 2 shows a schematic flow diagram of a method for customizing correction of different domain words according to an embodiment of the present application.
Fig. 3 shows a matching flow chart of a method for customizing error correction for words in different fields according to an embodiment of the present application.
FIG. 4 is a flow chart showing a comparative error correction method for customizing error correction for different domain words according to an embodiment of the present application.
Fig. 5 shows a schematic structural diagram of a customized correction system for words in different fields according to an embodiment of the present application.
Fig. 6 is a schematic diagram showing structural connection of an electronic device according to an embodiment of the application.
Description of element reference numerals
5-a customized error correction system for different domain words; 51—text acquisition module; 52—word extraction module; 53-word matching module; 54—a contrast error correction module; 6-electronic device; 61-a processor; 62—memory; 63—a communication interface; 64—a system bus; S21-S24; S231-S232; S241-S242.
Description of the embodiments
Other advantages and effects of the present application will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present application with reference to specific examples. The application may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present application by way of illustration, and only the components related to the present application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
The following embodiments of the present application provide a method, system, device and medium for customizing error correction for words in different fields, including but not limited to application to various electronic devices including processors and memories, and will be described below by taking a hardware application scenario of the present application as an example.
As shown in fig. 1, the present embodiment provides an application scenario of a method for customizing and correcting words in different fields, and the method for customizing and correcting words in different fields is applied to an electronic device, such as a mobile phone terminal. In the text of the Tang poetry field, a sentence is "spring sleep is not known", in the long word or sentence, the "block" word is wrongly written word, through the application, the "spring sleep is not known" according to the correct error correction reference word.
The electronic device may be, for example, a computer including all or part of the components of a memory, a memory controller, one or more processing units (CPUs), a peripheral interface, RF circuitry, audio circuitry, speakers, a microphone, an input/output (I/O) subsystem, a display screen, other output or control devices, and an external port; the computer includes, but is not limited to, a personal computer such as a desktop computer, a notebook computer, a tablet computer, a smart phone, a personal digital assistant (Personal Digital Assistant, PDA for short), and the like. In other embodiments, the electronic device may also be a server, where the server may be disposed on one or more entity servers according to multiple factors such as functions, loads, and the like, and may also be a cloud server formed by a distributed or centralized server cluster, which is not limited in this embodiment.
The following describes the technical solution in the embodiment of the present application in detail with reference to the drawings in the embodiment of the present application.
Referring to fig. 2, a schematic flow chart of a method for customizing correction of words in different fields according to an embodiment of the present application is shown. As shown in fig. 2, the method for customizing error correction for words in different fields provided in this embodiment specifically includes the following steps:
s21, acquiring an input text to be corrected.
In practical application, the text to be corrected refers to text content in a specific field, such as legal names in legal fields, words and sentences of ancient poems in ancient Chinese fields, disease names in medical fields, drug names and other proper nouns in different fields, and preferably words in the application refer to long words with a certain word number length, which are proper in different fields.
S22, extracting words with preset attributes in the text to be corrected by using a word recognition model; the preset attributes include a preset field and a preset length. For example, the preset field is a medical field, and the preset length is 5 words or more. The word recognition model is also a recognition model used in the specific field corresponding to the preset attribute, thereby realizing customized error correction.
Specifically, the word recognition model refers to a NER model, the NER model is trained, and long word extraction is performed on the input text to be corrected by using the NER model. Thus, NER model extraction is performed on the text, and long words which may exist are extracted.
Where NER (Named Entity Recognition ) refers to the recognition of text fragments belonging to a predefined category from free text. The NER task was originally proposed by the sixth semantic understanding conference (Message Understanding Conference), when only a few general entity categories were defined, such as places, institutions, people, etc. The task of identifying named entities has been advanced into various vertical fields such as medical treatment, finance, etc. The usual models for NER are: BERT+CRF, BERT+GlobalPointer, etc.
In other embodiments, the NER model may be replaced with other NER models such as bert+crf.
In one embodiment, the training data of the word recognition model is constructed based on correct words of a predetermined attribute.
Specifically, the construction process of the training data comprises the following steps:
(1) Setting the number of sentences with different error types; the error types include miscord, multiword, few words, and out-of-order.
Specifically, the names of all proprietary words in a certain field are collected by a website related to the field or other text data collection channels to form a correct long word set. The number of wrong words, multiple words and few words is set. Firstly, 10 mispronounced sentences are generated by each text containing long words, wherein the 10 mispronounced sentences comprise 5 mispronounced sentences, 2 multi-word sentences, 2 few word sentences and 1 disordered sentence.
(2) And randomly replacing one word in the correct words with another word homophonic or homophonic with the correct word according to the set sentence quantity of the wrong word.
Specifically, according to the set 5 word-misplaced sentences, randomly misplaced one word. The construction method comprises the following steps: randomly replacing one word with another word, where the other word being replaced is homophonic or homophonic. Homophones or homophones each have a 50% chance. For example, a wrong word sentence "spring sleep is not known" is constructed according to a sentence of Tang poem "spring sleep is not known" in the national field.
(3) And randomly inserting a word into the correct word according to the set sentence quantity of the multiple words.
Specifically, according to the set 2 multi-word sentences, one word is randomly added. The construction method comprises the following steps: a word is randomly inserted into the long word. 1/3 is a second word inserted with a random word, 1/3 is a first word inserted with a homonym. For example, a multi-word sentence "spring sleep not or dawn" is constructed from a sentence of Tang poem "spring sleep not dawn" in the field of national science.
(4) And randomly deleting one word from the correct words according to the set sentence quantity of the few words.
Specifically, according to the set 2 few words, one word is randomly reduced. The construction method comprises the following steps: a word is randomly deleted from the long word. For example, a few-word sentence "spring sleep dawn" is constructed from a sentence of Tang poem "spring sleep dawn" in the field of national science.
(5) According to the set number of out-of-order sentences, randomly exchanging the positions of two words in the correct words.
Specifically, according to the set 1 out-of-order sentences, the positions of 2 words are randomly replaced. The construction method comprises the following steps: 50% replaces the adjacent 2 words and 50% replaces the adjacent 2 words. For example, according to a sentence of Tang poetry in the field of national science, "spring sleep is not aware of the dawn", an out-of-order sentence is constructed "spring sleep is not aware of the dawn".
(6) And placing the original sentence containing the correct word into the training data.
Specifically, the correct phrase "spring sleep unconscious dawn" in Tang poems is put into the training data.
(7) The positions of the correct words which are processed in error or are original are marked in sentences which participate in the construction.
Specifically, for each constructed sentence, the corresponding longword position is marked. And training the model by using the constructed data to obtain a word recognition model. In practical applications, the bert+globalpinter model may be selected.
S23, matching error correction reference words similar to the extracted words.
Specifically, the BM25 is used to recall (match) the correct word that is close to the extracted longword, i.e., the error correction reference word, with text similarity.
The BM25 is the most mainstream algorithm in the field of information indexing at present for calculating similarity scores between user inputs and existing documents. The formulation of BM25 consists essentially of three parts: (1) carrying out morpheme analysis on Query to generate morpheme qi; (2) For each search result D, calculating a relevance score of each morpheme qi and D; (3) And (3) weighting and summing the correlation scores of qi relative to D, so as to obtain the correlation score of Query and D. The algorithm formula is as follows:
wherein Q represents Query, Q i Representing a morpheme after Q parsing, d representing a search result document, wi representing the weight of the speed qi, R (qi, d) representing the relevance score of the morpheme qi to the document d.
In other embodiments, the BM25 algorithm used in the correct word recall may be replaced with other mature search libraries, such as: the nonce.
Referring to fig. 3, a matching flow chart of a method for customizing error correction for words in different fields according to an embodiment of the present application is shown. As shown in fig. 3, step S23 specifically includes:
s231, obtaining alternative correct words similar to the extracted words on the text and the pinyin by using a similarity algorithm.
Specifically, the BM25 algorithm is used to obtain the first 20 correct words that are most similar in both text and pinyin to the extracted vocabulary, for a total of 40 alternatives.
In practical application, the extracted words are: living garbage classification management regulations in Zhejiang province.
The 20 words using text recall are:
'Zhejiang province household garbage management regulations', 'Jincheng municipal household garbage classification management regulations', 'Chaoyang municipal household garbage classification management regulations', 'Shenzhen municipal household garbage classification management regulations', 'Anhui province household garbage classification management regulations' disc-market household garbage classification management regulations ',' saddle mountain market household garbage classification management regulations ',' flourishing name market household garbage classification regulations ',' state household garbage classification regulations ',' western security market household garbage classification regulations ', and' western security market household garbage classification regulations 'Liaoyang city household garbage classification management regulations', 'Fuzhou city household garbage classification management regulations', 'Suzhou city household garbage classification management regulations', 'Changzhi city household garbage classification management regulations', 'copper city household garbage classification management regulations', 'Job city household garbage classification management regulations', 'Yi ren Manchu county household garbage classification management regulations', 'Shandong economic special district household garbage classification management regulations', and 'cucurbit island city household garbage classification management regulations'.
The 20 word pinyin recalled using pinyin (zhejiangshaenghenghuola jifiveleignili) is:
'zhejaingshengshenghuolajiguanlitiaoli'、'jinchengshishenghuolajifenleiguanlitiaoli'、'chaoyangshishenghuolajifenleiguanlitiaoli'、'shenzhenshishenghuolajifenleiguanlitiaoli'、'anhuishengshenghuolajifenleiguanlitiaoli'、'panjinshishenghuolajifenleiguanlitiaoli'、'maanshanshishenghuolajifenleiguanlitiaoli'、'maomingshishenghuolajifenleiguanlitiaoli'、'jinzhoushishenghuolajifenleiguanlitiaoli'、'xianshishenghuolajifenleiguanlitiaoli'、'liaoyangshishenghuolajifenleiguanlitiaoli'、'fuzhoushishenghuolajifenleiguanlitiaoli'、'fushunshishenghuolajifenleiguanlitiaoli'、'suzhoushishenghuolajifenleiguanlitiaoli'、'changzhishishenghuolajifenleiguanlitiaoli'、'tonglingshishenghuolajifenleiguanlitiaoli'、'jiaozuoshishenghuolajifenleiguanlitiaoli'、'huanrenmanzuzizhixianshenghuolajifenleiguanlitiaoli'、'shantoujngjitequshenghuolajifenleiguanlitiaoli'、'huludaoshishenghuolajifenleiguanlitiaoli'。
when the pinyin is used for recall, recall words can be obtained through one-to-one correspondence of the pinyin of the recall words.
S232, carrying out similarity calculation again on the correct words to be replaced, and determining the final error correction reference words.
Specifically, similarity comparison is conducted again on the text and the words recalled by pinyin, and 5 words with highest similarity are selected for specific comparison. In practical application, a similarity method is used, the similarity corresponding to 40 candidate words and the extracted words is calculated again, the similarity is reduced to 5 candidate words again, and 5 words with similarity greater than 0.6 and highest similarity are selected. Wherein similarity = 2 x the sum of the total length of the common string to which the two strings match/the length of the two contrasting strings.
S24, comparing the difference degree between the extracted words and the error correction reference words, and correcting errors according to the difference degree.
Specifically, the 5 words with the highest similarity finally determined are used as 5 error correction reference words, the 5 error correction reference words are compared with the extracted long words, if the error is judged, the error is reported, and the corresponding correct words are returned.
Referring to fig. 4, a comparative error correction flowchart of a method for customizing error correction for words in different fields according to an embodiment of the present application is shown. As shown in fig. 4, step S24 specifically includes:
s241, determining the difference between the extracted word and the error correction reference word in response to the difference between the extracted word and the error correction reference word, and analyzing the difference type of the difference.
Specifically, the difference between the extracted word and the two comparison character strings of the error correction reference word is obtained by using a difflib library, and the comparison results of wrong words, multiple words, few words and the like can be obtained. Wherein difflib is the standard library module of python, which acts to compare differences between texts.
S242, determining an error correction mode according to the difference type.
In one embodiment, step S242 includes:
(1) For different error correction reference words, determining the sum of the number of the extracted words compared with each error correction reference word, and selecting the error correction reference word which can be converted into by the least number of the extracted word replacement words for error correction.
Specifically, find the sum of all the erroneous words, select the least number of replacement words.
For example, a sentence of Tang poem "spring sleep is not known" extracted in the national field, the error correction reference word 1 is "spring sleep is not known", and the error correction reference word 2 is "spring coming branches". In comparison, the extracted "spring sleep is not in charge of" 1 word wrong compared with the error correction reference word 1 and 4 words wrong compared with the error correction reference word 2, so that the error correction reference word 1 with fewer error correction words is taken as the final error correction reference word.
(2) And selecting error correction reference words with error positions behind the sentence for error correction in response to the fact that the number of the extracted words is the same as that of the error correction reference words.
Specifically, if the number of erroneous words is as large, the preceding one is the one that is the error.
For example, the extract to word is: office administration regulations, possibly matched: advertising management regulations and office business regulations.
The advertisement management regulations and the office business regulations are 2 words wrong compared with the office management regulations, but the advertisement in the advertisement management regulations is different from the office, and the wrong word index is 0-1; since the "transaction" in the office transaction arrangement is different from the "management" in that the misword index is 2 to 3, the office transaction arrangement is prioritized by the office transaction arrangement which is later mised, and the office transaction arrangement is taken as a final error correction reference word.
(3) And responding to the existence of multiple difference types between the extracted words and different correction reference words, and selecting the correction reference word closest to the extracted words according to the pinyin for correction.
Specifically, if multiple choices that can report errors occur, the closest choice is selected based on pinyin.
In practical application, the extracted words are: XXX inverted Dragon method, may be matched to: XXX reverse-monopoly method and XXX reverse-monopoly method, but because monopoly and Pinyin of the dragon end are consistent, XXX reverse-monopoly method is selected as the final error correction reference word.
In one embodiment, after step S24, the method further comprises:
combining the accumulated false alarm cases in the preset field to determine a post-processing filtering rule;
and carrying out false alarm analysis on the corrected words by utilizing the post-processing filtering rules, and eliminating false alarm conditions.
In practical application, coronary atherosclerotic heart disease is abbreviated as coronary heart disease. For example, coronary atherosclerotic heart disease (hereinafter referred to as "coronary heart disease") occurs in medical texts. The coronary heart disease should not be reported in error although the number of words is small.
The protection scope of the method for customizing error correction for words in different fields according to the embodiment of the present application is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes implemented by increasing or decreasing steps and replacing steps according to the prior art made by the principles of the present application are included in the protection scope of the present application.
The embodiment of the application also provides a customized error correction system for different field words, which can realize the customized error correction method for different field words, but the realization device of the customized error correction method for different field words comprises, but is not limited to, the structure of the customized error correction system for different field words listed in the embodiment, and all the structural changes and substitutions of the prior art according to the principles of the application are included in the protection scope of the application.
Referring to fig. 5, a schematic structural diagram of a customized error correction system for words in different fields according to an embodiment of the present application is shown. As shown in fig. 5, the customized error correction system 5 for words in different fields provided in this embodiment includes: a text acquisition module 51, a word extraction module 52, a word matching module 53, and a contrast correction module 54.
The text obtaining module 51 is configured to obtain the inputted text to be corrected.
The word extracting module 52 is configured to extract words with preset attributes in the text to be corrected by using a word recognition model; the preset attributes include a preset field and a preset length.
In one embodiment, the training data of the word recognition model is constructed based on correct words of a predetermined attribute.
In one embodiment, the training data construction process includes: setting the number of sentences with different error types; the error types include wrong words, multiple words, few words and disorder; randomly replacing one word in the correct words with another word homophonic or homophonic with the correct word according to the set sentence quantity of the wrong word; randomly inserting a word into the correct word according to the set sentence quantity of multiple words; randomly deleting a word from the correct words according to the set sentence quantity of the few words; randomly exchanging the positions of two words in the correct words according to the set disordered sentence quantity; placing the original sentence containing the correct word into the training data; the positions of the correct words which are processed in error or are original are marked in sentences which participate in the construction.
The word matching module 53 is configured to match out the error correction reference word that is similar to the extracted word.
In one embodiment, the word matching module 53 is specifically configured to obtain, by using a similarity algorithm, alternative correct words similar to the extracted words in both text and pinyin; and carrying out similarity calculation again on the correct alternative words, and determining the final error correction reference words.
The contrast correction module 54 is configured to compare the degree of difference between the extracted word and the correction reference word, and correct the correction according to the degree of difference.
In one embodiment, the contrast correction module 54 is specifically configured to determine a difference between the extracted word and the correction reference word in response to a difference between the extracted word and the correction reference word, and analyze a difference type of the difference; and determining an error correction mode according to the difference type.
In one embodiment, the contrast correction module 54 is more specifically configured to determine, for different correction reference words, a sum of the number of words in error of the extracted word compared to each correction reference word, and select the correction reference word that can be converted into with the least number of replacement words of the extracted word for correction; responding to the fact that the number of the extracted words is the same as that of the error correction reference words, and selecting error correction reference words with error positions behind sentences for error correction; and responding to the existence of multiple difference types between the extracted words and different correction reference words, and selecting the correction reference word closest to the extracted words according to the pinyin for correction.
In an embodiment, the customized error correction system for words in different fields further includes: the false alarm analysis module is configured to combine the accumulated false alarm cases in the preset field to determine a post-processing filtering rule; and carrying out false alarm analysis on the corrected words by utilizing the post-processing filtering rules, and eliminating false alarm conditions.
In the several embodiments provided by the present application, it should be understood that the disclosed system or method may be implemented in other manners. For example, the system embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.
The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the objectives of the embodiments of the present application. For example, functional modules/units in various embodiments of the application may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Fig. 6 is a schematic diagram showing structural connection of an electronic device according to an embodiment of the application. As shown in fig. 6, the electronic device 6 of the present application includes: a processor 61, a memory 62, a communication interface 63, or/and a system bus 64. The memory 62 and the communication interface 63 are connected to the processor 61 via a system bus 64 and perform communication with each other, the memory 62 is used for storing a computer program, the communication interface 63 is used for communicating with other devices, and the processor 61 is used for running the computer program to make the electronic device 6 execute the steps of the customized error correction method for words in different fields.
The processor 61 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field programmable gate arrays (Field Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
The memory 62 may include a random access memory (Random Access Memory, simply referred to as RAM), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.
The system bus 64 mentioned above may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The system bus 64 may be divided into an address bus, a data bus, a control bus, and the like. The communication interface is used for realizing communication between the database access device and other devices (such as a client, a read-write library and a read-only library).
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, realizes the customized error correction method for words in different fields. Those of ordinary skill in the art will appreciate that all or part of the steps in the method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.
The above embodiments are merely illustrative of the principles of the present application and its effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the application. Accordingly, it is intended that all equivalent modifications and variations of the application be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims (6)

1. A method for customizing error correction for words in different fields, the method comprising:
acquiring an input text to be corrected;
extracting words with preset attributes in the text to be corrected by using a word recognition model; the preset attributes comprise a preset field and a preset length;
matching error correction reference words similar to the extracted words;
comparing the difference degree between the extracted words and the error correction reference words, and correcting errors according to the difference degree;
the training data of the word recognition model is constructed based on correct words with preset attributes; the construction process of the training data comprises the following steps:
setting the number of sentences with different error types; the error types include wrong words, multiple words, few words and disorder;
randomly replacing one word in the correct words with another word homophonic or homophonic with the correct word according to the set sentence quantity of the wrong word;
randomly inserting a word into the correct word according to the set sentence quantity of multiple words;
randomly deleting a word from the correct words according to the set sentence quantity of the few words;
randomly exchanging the positions of two words in the correct words according to the set disordered sentence quantity;
placing the original sentence containing the correct word into the training data;
marking the positions of the correct words which are subjected to error processing or original in sentences participating in construction;
the step of comparing the difference degree between the extracted words and the error correction reference words and correcting errors according to the difference degree comprises the following steps:
determining the difference between the extracted word and the error correction reference word in response to the difference between the extracted word and the error correction reference word, and analyzing the difference type of the difference;
determining an error correction mode according to the difference type;
the step of determining the error correction mode according to the difference type comprises the following steps:
determining the sum of the number of the extracted words compared with each error correction reference word aiming at different error correction reference words, and selecting the error correction reference word which can be converted into with the least number of the extracted word replacement words for error correction;
responding to the fact that the number of the extracted words is the same as that of the error correction reference words, and selecting error correction reference words with error positions behind sentences for error correction;
and responding to the existence of multiple difference types between the extracted words and different correction reference words, and selecting the correction reference word closest to the extracted words according to the pinyin for correction.
2. The method of claim 1, wherein the step of matching the error correction reference word that is similar to the extracted word comprises:
obtaining alternative correct words similar to the extracted words on the text and the pinyin by using a similarity algorithm;
and carrying out similarity calculation again on the correct alternative words, and determining the final error correction reference words.
3. The method of claim 1, wherein after the step of correcting errors based on the degree of difference, the method further comprises:
combining the accumulated false alarm cases in the preset field to determine a post-processing filtering rule;
and carrying out false alarm analysis on the corrected words by utilizing the post-processing filtering rules, and eliminating false alarm conditions.
4. A customized error correction system for different domain words, the system comprising:
the text acquisition module is configured to acquire an input text to be corrected;
the word extraction module is configured to extract words with preset attributes in the text to be corrected by using a word recognition model; the preset attributes comprise a preset field and a preset length; the training data of the word recognition model is constructed based on correct words with preset attributes; the construction process of the training data comprises the following steps:
setting the number of sentences with different error types; the error types include wrong words, multiple words, few words and disorder;
randomly replacing one word in the correct words with another word homophonic or homophonic with the correct word according to the set sentence quantity of the wrong word;
randomly inserting a word into the correct word according to the set sentence quantity of multiple words;
randomly deleting a word from the correct words according to the set sentence quantity of the few words;
randomly exchanging the positions of two words in the correct words according to the set disordered sentence quantity;
placing the original sentence containing the correct word into the training data;
marking the positions of the correct words which are subjected to error processing or original in sentences participating in construction;
the word matching module is configured to match error correction reference words similar to the extracted words;
the comparison error correction module is configured to compare the difference degree between the extracted words and the error correction reference words and correct errors according to the difference degree; comparing the difference degree between the extracted words and the error correction reference words, and correcting errors according to the difference degree comprises:
determining the difference between the extracted word and the error correction reference word in response to the difference between the extracted word and the error correction reference word, and analyzing the difference type of the difference;
determining an error correction mode according to the difference type;
the determining the error correction mode according to the difference type comprises the following steps:
determining the sum of the number of the extracted words compared with each error correction reference word aiming at different error correction reference words, and selecting the error correction reference word which can be converted into with the least number of the extracted word replacement words for error correction;
responding to the fact that the number of the extracted words is the same as that of the error correction reference words, and selecting error correction reference words with error positions behind sentences for error correction;
and responding to the existence of multiple difference types between the extracted words and different correction reference words, and selecting the correction reference word closest to the extracted words according to the pinyin for correction.
5. An electronic device, comprising: a processor and a memory;
the memory is configured to store a computer program, and the processor is configured to execute the computer program stored in the memory, to cause the electronic device to perform the method according to any one of claims 1 to 3.
6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1 to 3.
CN202310573154.9A 2023-05-22 2023-05-22 Customized error correction method, system, equipment and medium for words in different fields Active CN116306598B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310573154.9A CN116306598B (en) 2023-05-22 2023-05-22 Customized error correction method, system, equipment and medium for words in different fields

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310573154.9A CN116306598B (en) 2023-05-22 2023-05-22 Customized error correction method, system, equipment and medium for words in different fields

Publications (2)

Publication Number Publication Date
CN116306598A CN116306598A (en) 2023-06-23
CN116306598B true CN116306598B (en) 2023-09-08

Family

ID=86785391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310573154.9A Active CN116306598B (en) 2023-05-22 2023-05-22 Customized error correction method, system, equipment and medium for words in different fields

Country Status (1)

Country Link
CN (1) CN116306598B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2013219188A1 (en) * 2007-01-04 2013-09-12 Thinking Solutions Pty Ltd Linguistic Analysis
WO2014117549A1 (en) * 2013-01-29 2014-08-07 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
WO2019024050A1 (en) * 2017-08-03 2019-02-07 Lingochamp Information Technology (Shanghai) Co., Ltd. Deep context-based grammatical error correction using artificial neural networks
CN111062217A (en) * 2019-12-19 2020-04-24 江苏满运软件科技有限公司 Language information processing method and device, storage medium and electronic equipment
CN112597753A (en) * 2020-12-22 2021-04-02 北京百度网讯科技有限公司 Text error correction processing method and device, electronic equipment and storage medium
CN112949290A (en) * 2021-02-03 2021-06-11 深圳市优必选科技股份有限公司 Text error correction method and device and communication equipment
CN114238370A (en) * 2021-12-08 2022-03-25 中信银行股份有限公司 Method and system for applying NER entity recognition algorithm in report query
WO2022105083A1 (en) * 2020-11-19 2022-05-27 平安科技(深圳)有限公司 Text error correction method and apparatus, device, and medium
CN114861637A (en) * 2022-05-18 2022-08-05 北京百度网讯科技有限公司 Method and device for generating spelling error correction model and method and device for spelling error correction
CN114861636A (en) * 2022-05-10 2022-08-05 网易(杭州)网络有限公司 Training method and device of text error correction model and text error correction method and device
CN115130463A (en) * 2022-04-19 2022-09-30 腾讯科技(深圳)有限公司 Error correction method, model training method, computer medium, and apparatus
WO2022267353A1 (en) * 2021-06-25 2022-12-29 北京市商汤科技开发有限公司 Text error correction method and apparatus, and electronic device and storage medium
WO2023005293A1 (en) * 2021-07-30 2023-02-02 平安科技(深圳)有限公司 Text error correction method, apparatus, and device, and storage medium
CN115965009A (en) * 2022-12-23 2023-04-14 中国联合网络通信集团有限公司 Training and text error correction method and device for text error correction model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597754B (en) * 2020-12-23 2023-11-21 北京百度网讯科技有限公司 Text error correction method, apparatus, electronic device and readable storage medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2013219188A1 (en) * 2007-01-04 2013-09-12 Thinking Solutions Pty Ltd Linguistic Analysis
WO2014117549A1 (en) * 2013-01-29 2014-08-07 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
WO2019024050A1 (en) * 2017-08-03 2019-02-07 Lingochamp Information Technology (Shanghai) Co., Ltd. Deep context-based grammatical error correction using artificial neural networks
CN111062217A (en) * 2019-12-19 2020-04-24 江苏满运软件科技有限公司 Language information processing method and device, storage medium and electronic equipment
WO2022105083A1 (en) * 2020-11-19 2022-05-27 平安科技(深圳)有限公司 Text error correction method and apparatus, device, and medium
CN112597753A (en) * 2020-12-22 2021-04-02 北京百度网讯科技有限公司 Text error correction processing method and device, electronic equipment and storage medium
CN112949290A (en) * 2021-02-03 2021-06-11 深圳市优必选科技股份有限公司 Text error correction method and device and communication equipment
WO2022267353A1 (en) * 2021-06-25 2022-12-29 北京市商汤科技开发有限公司 Text error correction method and apparatus, and electronic device and storage medium
WO2023005293A1 (en) * 2021-07-30 2023-02-02 平安科技(深圳)有限公司 Text error correction method, apparatus, and device, and storage medium
CN114238370A (en) * 2021-12-08 2022-03-25 中信银行股份有限公司 Method and system for applying NER entity recognition algorithm in report query
CN115130463A (en) * 2022-04-19 2022-09-30 腾讯科技(深圳)有限公司 Error correction method, model training method, computer medium, and apparatus
CN114861636A (en) * 2022-05-10 2022-08-05 网易(杭州)网络有限公司 Training method and device of text error correction model and text error correction method and device
CN114861637A (en) * 2022-05-18 2022-08-05 北京百度网讯科技有限公司 Method and device for generating spelling error correction model and method and device for spelling error correction
CN115965009A (en) * 2022-12-23 2023-04-14 中国联合网络通信集团有限公司 Training and text error correction method and device for text error correction model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郝亚男 ; 乔钢柱 ; 谭瑛 ; .面向OCR文本识别词错误自动校对方法研究.计算机仿真.2020,(第09期),全文. *

Also Published As

Publication number Publication date
CN116306598A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
US20200081899A1 (en) Automated database schema matching
US8452772B1 (en) Methods, systems, and articles of manufacture for addressing popular topics in a socials sphere
CN110427623A (en) Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
Goyal et al. A context-based word indexing model for document summarization
US20130060769A1 (en) System and method for identifying social media interactions
US20200134398A1 (en) Determining intent from multimodal content embedded in a common geometric space
Yan et al. Named entity recognition by using XLNet-BiLSTM-CRF
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN111324771B (en) Video tag determination method and device, electronic equipment and storage medium
CN101689198A (en) Phonetic search using normalized string
CN111126067A (en) Entity relationship extraction method and device
Wang et al. Data set and evaluation of automated construction of financial knowledge graph
Wu et al. Efficient reuse of natural language processing models for phenotype-mention identification in free-text electronic medical records: a phenotype embedding approach
CN112559895B (en) Data processing method and device, electronic equipment and storage medium
CN111444712B (en) Keyword extraction method, terminal and computer readable storage medium
CN113392205A (en) User portrait construction method, device and equipment and storage medium
Nasim et al. Evaluation of clustering techniques on Urdu News head-lines: A case of short length text
CN116306598B (en) Customized error correction method, system, equipment and medium for words in different fields
CN111460808A (en) Synonymous text recognition and content recommendation method and device and electronic equipment
Chen et al. Distant supervision for relation extraction with sentence selection and interaction representation
KR102559849B1 (en) Malicious comment filter device and method
Chou et al. On the Construction of Web NER Model Training Tool based on Distant Supervision
Ling et al. Chinese organization name recognition based on multiple features
Zhang et al. NewsQuote: A Dataset Built on Quote Extraction and Attribution for Expert Recommendation in Fact-Checking
CN112307070A (en) Mask data query method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: Room 301ab, No.10, Lane 198, zhangheng Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai 201204

Patentee after: Shanghai Mido Technology Co.,Ltd.

Address before: Room 301ab, No.10, Lane 198, zhangheng Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai 201204

Patentee before: SHANGHAI MDATA INFORMATION TECHNOLOGY Co.,Ltd.