CN111626048A - Text error correction method, device, equipment and storage medium - Google Patents

Text error correction method, device, equipment and storage medium Download PDF

Info

Publication number
CN111626048A
CN111626048A CN202010442510.XA CN202010442510A CN111626048A CN 111626048 A CN111626048 A CN 111626048A CN 202010442510 A CN202010442510 A CN 202010442510A CN 111626048 A CN111626048 A CN 111626048A
Authority
CN
China
Prior art keywords
text
character
confusion
corrected
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010442510.XA
Other languages
Chinese (zh)
Inventor
洪科元
李斌
章秦
苏晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010442510.XA priority Critical patent/CN111626048A/en
Publication of CN111626048A publication Critical patent/CN111626048A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Abstract

The embodiment of the application provides a text error correction method, a text error correction device, text error correction equipment and a storage medium; the method comprises the following steps: replacing at least one confusion character in a text to be corrected by adopting a preset confusion word library to obtain a first text set; in the first text set, determining candidate texts meeting preset conditions; replacing at least one confusion character in the candidate text by adopting the preset confusion word library to obtain a second text set; traversing a domain word bank storing at least two words which are the same as the domain to which the text to be corrected belongs according to the second text set to obtain a target text matched with the second text; therefore, the text to be corrected is corrected by adopting the confusion word library and the field word library, and the field proper nouns can be corrected, so that the accuracy of correcting the text is improved.

Description

Text error correction method, device, equipment and storage medium
Technical Field
The present application relates to the field of text processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for text error correction.
Background
In the process of identifying characters, a candidate set for character error correction is generated by a full-scale dictionary, and when candidate characters are searched, the full-scale retrieval causes overlarge search space and long time consumption; in addition, in the scene of error correction of the shape and shape characters, word vectors of words composed of different shape and shape characters may be relatively close to each other, and the accuracy of distinction cannot be guaranteed.
Disclosure of Invention
The embodiment of the application provides a text error correction method, a text error correction device, text error correction equipment and a storage medium, wherein the text to be corrected is corrected by adopting a confusion word library and a field word library, and a field proper noun can be corrected, so that the accuracy of text error correction is improved.
The technical scheme of the embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a text error correction method, including:
replacing at least one confusion character in a text to be corrected by adopting a preset confusion word library to obtain a first text set;
in the first text set, determining candidate texts meeting preset conditions;
replacing at least one confusion character in the candidate text by adopting the preset confusion word library to obtain a second text set;
according to the second text set, traversing a domain word bank in which at least two words which are the same as the domain to which the text to be corrected belongs are stored, and obtaining a target text matched with the second text.
In a second aspect, an embodiment of the present application provides a text error correction apparatus, where the apparatus includes:
the system comprises a first replacement module, a second replacement module and a third replacement module, wherein the first replacement module is used for replacing at least one confusion character in a text to be corrected by adopting a preset confusion word library to obtain a first text set;
the first determining module is used for determining candidate texts meeting preset conditions in the first text set;
the second replacement module is used for replacing at least one confusion character in the candidate text by adopting the preset confusion word library to obtain a second text set;
and the first traversal module is used for traversing a domain word bank which stores at least two words in the same domain as the domain to which the text to be corrected belongs according to the second text set to obtain a target text matched with the second text.
In a third aspect, an embodiment of the present application provides an apparatus for text error correction, including: a memory for storing executable instructions; and the processor is used for realizing the text error correction method when executing the executable instructions stored in the memory.
In a fourth aspect, an embodiment of the present application provides a storage medium storing executable instructions for causing a processor to implement a text error correction method provided in an embodiment of the present application when executed.
The embodiment of the application has the following beneficial effects: for the acquired text to be corrected, firstly, constructing a plurality of first texts of the text to be corrected by using a confusion word library, and then correcting the plurality of first texts to determine candidate texts meeting preset conditions; therefore, when the text to be corrected is corrected, the candidate characters only select the similar confusion characters on the character patterns from the confusion set to form the first text, so that the calculation amount for judging the legality of the sentence can be greatly reduced; then, replacing the confusion words in the candidate text by adopting a confusion word library; traversing the field lexicons with the same field according to the second text set to obtain a target text; therefore, the domain word library constructed by the domain specific nouns is used for correcting errors of the domain specific nouns, and the distinguishing accuracy of the same words in different domains can be improved.
Drawings
FIG. 1 is a schematic diagram of an alternative architecture of a text correction system according to an embodiment of the present application;
FIG. 2A is a schematic diagram of an alternative architecture of a text correction system according to an embodiment of the present application;
FIG. 2B is a schematic structural diagram of a text correction system according to an embodiment of the present disclosure;
fig. 3 is a schematic flowchart of an implementation process of a text error correction method provided in an embodiment of the present application;
fig. 4A is a schematic flowchart of another implementation of the text error correction method according to the embodiment of the present application;
fig. 4B is a schematic flowchart of another implementation of the text error correction method according to the embodiment of the present application;
fig. 5 is a schematic flowchart of an implementation flow of a text error correction method provided in an embodiment of the present application;
FIG. 6 is a diagram of an application scenario of the text error correction method according to the embodiment of the present application;
FIG. 7 is a diagram of another application scenario of the text error correction method according to the embodiment of the present application;
FIG. 8 is a schematic flowchart of another implementation of a text error correction method provided in an embodiment of the present application;
FIG. 9A is an architectural diagram of a process for performing OCR recognition according to an embodiment of the present application;
FIG. 9B is a diagram of an application scenario of the text error correction method according to the embodiment of the present application;
fig. 10 is a schematic diagram of a composition structure of a domain dictionary tree according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) Error correction of shape-word: and correcting the characters with the shape similar to the character error in the text.
2) Optical Character Recognition (OCR): refers to a process in which an electronic device (e.g., a scanner or a digital camera) checks a character printed on paper, determines its shape by detecting dark and light patterns, and then translates the shape into a computer word using a character recognition method.
3) Natural Language Processing (NLP): is an important direction in the fields of computer science and artificial intelligence. Natural language processing studies various theories and methods that enable efficient communication between humans and computers using natural language.
4) Confusion set: and (4) manually or automatically arranging the easily mixed near character set from the Chinese material according to the character similarity.
5) Prefix (Trie) tree: is a tree structure and is a variety of hash tree. Typical applications are for statistics, sorting and storing a large number of strings (but not limited to strings), and are therefore often used by search engine systems for text word frequency statistics. It has the advantages that: the public prefix of the character string is utilized to reduce the query time, so that unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is higher than that of a Hash tree.
6) The binary language model: for a piece of text "W" containing 3 characters1W2W3”,W1,W2,W3Are respectively sequences (W)11,W12,W13,W14,W15) Sequence (W)21,W22,W23,W24,W25) And sequence (W)31,W32,W33,W34,W35). Among the language models, based on the markov assumption, the probability of any word appearing is only related to the previous word or words, and is related to the previous n words, and the corresponding language model is called n-1 meta language model. A binary language model is used here, i.e. the probability of an occurrence of any word is only related to its predecessor.
7) Blockchain (Blockchain): an encrypted, chained transactional memory structure formed of blocks (blocks).
8) Block chain Network (Blockchain Network): the new block is incorporated into the set of a series of nodes of the block chain in a consensus manner.
9) Cloud Technology (Cloud Technology) is based on a general term of network Technology, information Technology, integration Technology, management platform Technology, application Technology and the like applied in a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.
10) Cloud Storage (Cloud Storage) is a new concept extended and developed on the Cloud computing concept, and a distributed Cloud Storage system (hereinafter referred to as a Storage system) refers to a Storage system which integrates a large number of Storage devices (Storage devices are also referred to as Storage nodes) of various types in a network through application software or application interfaces to cooperatively work through functions of cluster application, grid technology, distributed Storage file system and the like, and provides data Storage and service access functions to the outside.
In the related art, in the process of correcting a text, when a candidate set for correcting the text is generated by a full-scale dictionary and candidate words are searched, the full-scale search results in an overlarge search space, and because most of the candidate words are irrelevant to the current word, a large amount of time is wasted for the full-scale search of the candidate words to do useless work; in the scene of error correction of shape-similar characters, word vectors of words composed of different shape-similar characters may be relatively close to each other and cannot be effectively distinguished, so that part of shape-similar character errors cannot be effectively detected.
Based on this, embodiments of the present application provide a text error correction method, apparatus, device, and storage medium, where a confusion word set is constructed based on shape-similar words, and when a text to be error corrected is corrected, a first text set only selects a small number of words that are closest in shape from the confusion set as a first text, so that the calculation amount related to sentence validity judgment can be greatly reduced. Meanwhile, in the embodiment of the application, different shape-similar characters are clearly and differently represented, so that the problem that the neural language model cannot be effectively distinguished due to similar shape-similar character vectors can be solved. In addition, the embodiment of the application adopts the domain dictionary tree constructed by the domain dictionary to correct the domain proper nouns, so that the problem that the same words cannot be effectively distinguished in different domains due to ambiguity can be effectively solved.
An exemplary application of the text error correction device provided in the embodiment of the present application is described below, and the terminal provided in the embodiment of the present application may be implemented as various types of user equipment, and may also be implemented as a server. In the following, an exemplary application will be explained when the terminal is implemented as a device or a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, big data and artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
Referring to fig. 1, fig. 1 is an optional architecture schematic diagram of a text error correction system provided in an embodiment of the present application, and to implement supporting an exemplary application, first, when an identified text 101 to be error corrected is acquired, a confusing word library 102 is used to replace at least one confusing character 103 in the text 101 to be error corrected, so as to obtain a first text set; and determining 104 the most reasonable candidate text from the first set of texts; therefore, when the text to be corrected is corrected, the candidate characters only select the similar confusing characters on the character patterns from the confusing set to form the first text, and the calculation amount for judging the legality of the sentence can be greatly reduced. Then, at least one confusion character 105 in the candidate text 104 is replaced by adopting the confusion word library 102 to obtain a second text set 106; finally, traversing the domain dictionary tree 107 according to the second text set 106 to correct the proper name in the second text again through the target dictionary tree to obtain a target text 108, and outputting the target text; the domain dictionary tree stores at least two words in a tree structure, wherein the words are the same as the domain to which the text to be corrected belongs. Therefore, the domain dictionary tree model constructed by the domain dictionary is used for correcting the proper nouns in the domain, and the problem that the same word cannot be effectively distinguished in different domains due to ambiguity can be effectively solved.
Referring to fig. 2A, fig. 2A is another alternative architecture diagram of the text error correction system provided in the embodiment of the present application, which includes a blockchain network 20 (exemplarily showing a server 200 as a native node), a monitoring system 30 (exemplarily showing a device 300 belonging to the monitoring system 30 and a graphical interface 301 thereof), and the following descriptions are separately provided.
The type of blockchain network 20 is flexible and may be, for example, any of a public chain, a private chain, or a federation chain. Taking a public link as an example, electronic devices such as user equipment and servers of any service entity can access the blockchain network 20 without authorization; taking the alliance chain as an example, after obtaining authorization, the electronic device (e.g., device/server) under the jurisdiction of the service entity may access the blockchain network 20, and at this time, the service entity becomes a special node, i.e., a terminal node, in the blockchain network 20.
It should be noted that the end node may only provide functionality for supporting the business entity to initiate transactions (e.g., for uplink storage of data or for querying of data on the chain), and that the end node may be implemented by default or selectively (e.g., depending on the specific business requirements of the business entity) for functions of native nodes of the blockchain network 20, such as the ranking function, consensus service, and ledger function, etc., described below. Therefore, the data and the service processing logic of the service subject can be migrated to the blockchain network 20 to the maximum extent, and the credibility and traceability of the data and service processing process are realized through the blockchain network 20.
Blockchain network 20 receives a transaction submitted from an end node (e.g., device 300 shown in fig. 2A belonging to monitoring system 30) of a business entity (e.g., monitoring system 30 shown in fig. 2A), executes the transaction to update or query the ledger, and displays various intermediate or final results of executing the transaction on a user interface (e.g., graphical interface 301 of device 300) of the device.
An exemplary application of the blockchain network is described below by taking monitoring system access to the blockchain network and taking uplink for implementing text error correction as an example.
The device 300 of the monitoring system 30 accesses the blockchain network 20 and becomes an end node of the blockchain network 20. The device 300 acquires a text to be corrected through a sensor; and, the final processed instruction and the target text are fed back to the server 200 in the blockchain network 20 or stored in the device 300; in the case where the upload logic has been deployed for the device 300 or the user has performed an operation, the device 300 generates a transaction corresponding to the update operation/query operation according to the to-be-processed task/synchronous time query request, specifies an intelligent contract to be called for implementing the update operation/query operation and parameters transferred to the intelligent contract in the transaction, and also carries a digital signature signed by the monitoring system 30 (for example, a digest of the transaction is encrypted by using a private key in a digital certificate of the monitoring system 30), and broadcasts the transaction to the blockchain network 20. The digital certificate can be obtained by registering the monitoring system 30 with the certificate authority 31.
A native node in the blockchain network 20, for example, the server 200 verifies a digital signature carried by the transaction when receiving the transaction, and after the verification of the digital signature is successful, it is determined whether the monitoring system 30 has a transaction right according to the identity of the monitoring system 30 carried in the transaction, and any verification judgment of the digital signature and the right verification will result in a transaction failure. After successful verification, the native node signs its own digital signature (e.g., by encrypting a digest of the transaction using the native node's private key) and continues to broadcast in the blockchain network 20.
After the node with the sorting function in the blockchain network 20 receives the transaction successfully verified, the transaction is filled into a new block and broadcasted to the node providing the consensus service in the blockchain network 20.
The nodes in the blockchain network 20 that provide the consensus service perform a consensus process on the new block to reach agreement, the nodes that provide the ledger function append the new block to the end of the blockchain, and perform the transaction in the new block: for a text error correction request initiated by a terminal, the text to be corrected can be corrected for multiple times through a preset confusion word bank and a domain dictionary tree, so that a target text with a high preparation rate is obtained, and the target text is displayed in the graphical interface 301 of the device 300.
The native node in the blockchain network 20 may read the text to be corrected from the blockchain and present the text to be corrected on the monitoring page of the native node, and the native node may also process the text to be corrected by using the text to be corrected stored in the blockchain.
In practical applications, different functions may be set for different native nodes of the blockchain network 20, for example, the server 200 is set to have a text error correction function and a billing function. For this situation, in the transaction process, the server 200 receives the text to be corrected sent by the device 300, and in the server 200, the text to be corrected is corrected for multiple times, and the text to be corrected is corrected by using the confusion word bank and the domain dictionary tree, so that the domain proper nouns can be corrected, and the distinguishing accuracy of the same word in different domains is improved.
Referring to fig. 2B, fig. 2B is a schematic structural diagram of a text correction system according to an embodiment of the present application, and the apparatus 400 shown in fig. 2B includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in device 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in FIG. 2B.
The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, a digital signal processor, or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc., wherein the general purpose processor may be a microprocessor or any conventional processor, etc.
The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, in some examples, a keyboard, a mouse, a microphone, a touch screen display, a camera, other input buttons and controls.
The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.
The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 450 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.
An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;
a network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication, and Universal Serial Bus (USB), etc.;
a presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430;
an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.
In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 2B illustrates a server 455 stored in the memory 450, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: a first replacement module 4551, a first determination module 4552, a second replacement module 4553 and a first traversal module 4554; these modules are logical and thus may be combined or further split according to the functionality implemented. The functions of the respective modules will be explained below.
In other embodiments, the apparatus provided in this embodiment may be implemented in hardware, and for example, the apparatus provided in this embodiment may be a processor in the form of a hardware decoding processor, which is programmed to execute the text error correction method provided in this embodiment, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field-Programmable Gate arrays (FPGAs), or other electronic components.
The text error correction method provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the apparatus provided by the embodiment of the present application.
Referring to fig. 3, fig. 3 is a schematic flow chart of an implementation of the text error correction method provided in the embodiment of the present application, and is described with reference to the steps shown in fig. 3.
Step S301, a preset confusion word bank is adopted to replace at least one confusion character in the text to be corrected, and a first text set is obtained.
In some embodiments, the text to be corrected is first obtained, for example, the text to be corrected may be a recognition result obtained by performing optical character recognition on a picture, or may be a recognition result obtained by performing text recognition on a voice. In some possible implementation manners, if the text to be corrected is from the picture to be recognized, the image to be recognized including the text information may be obtained first; for example, in the scene of an intelligent underwriting product, a physical examination report of a user is collected to obtain an image to be identified. Then, determining a text area occupied by the text information in the image to be recognized; for example, by preprocessing an image to be recognized, a picture is binarized and represented by using pixel points of the picture so as to be processed by a subsequent algorithm model, and then deformation anomalies such as inclination, bending and wrinkling which may exist in the picture are restored. And finally, extracting the characteristics of the text region to obtain the text to be corrected. For example, the character region in the picture is detected by using the algorithm model, and finally, the characters in the identified character region are classified and identified by using the classification and identification algorithm to obtain an identification result, and the identification result is used as the text to be corrected. And then, replacing at least one confusion character in the text to be corrected by adopting a preset confusion word library to obtain a first text set.
In some embodiments, the predetermined confusing word library includes a plurality of words with similar font styles, for example, the confusing words of nits include: eliminating, pinning, chiffon, and shao, the human confusion word includes: big and big, etc. For the obfuscated word library adopted above, it may be created by the following process: firstly, acquiring a character library at least comprising two characters; for example, a Chinese character library including a large number of Chinese characters is established; then, determining similarity between the fonts of the characters in the character library; for example, characters in the character library are classified according to the font similarity between the characters; and finally, creating a preset confusion word stock according to the characters with the similarity between the fonts larger than a preset similarity threshold. For example, for a character C in the character library, finding out characters with a similarity of 50% or more with the character C from the character library, and using the characters as the confusion words of the character C to construct a confusion word library; in this way, the characters with larger similarity between the fonts are used as the confusion words of one character, so that the number of the characters in the confusion word library is reduced, the retrieval speed of the confusion words can be improved, and the problem of wasting a large amount of time due to the fact that candidate words are retrieved in full quantity is solved.
In some possible implementations, it is first determined that the text to be corrected contains the confusing characters in the confusing word library, and then at least one of the corresponding confusing words in the confusing word library is replaced for the confusing characters. For example, the text to be corrected includes two confusing characters a and B, and each confusing character has 3 confusing words in the confusing word library, that is, there are 3 characters with similar characters (a1a2 A3) and (B1B 2B3), at least one of the 3 characters replaces the corresponding confusing character, and after multiple replacements, 16 first texts are finally obtained; for example, in the case where B is not replaced, a is replaced with each of (A1a2 A3) separately, resulting in 3 kinds; in the case of a not replaced, respectively replacing B with (B1B 2B3) to obtain 3 kinds, then respectively replacing a with each of (a1a2 A3), respectively replacing B with one of (B1B 2B3) to obtain 9 kinds, and adding a text of which AB is not replaced, finally obtaining 16 first texts, namely a first text set. In some embodiments, in order to reduce the amount of computation, only a part of the obfuscated words may be replaced, for example, based on the above example, in the case where B is not replaced, a is replaced by any two of (a1a2 A3), respectively, resulting in 2; in the case that a is not replaced, respectively replacing B with any two of (B1B 2B3) to obtain 2 types, then respectively replacing a with any two of (a1a2 A3), respectively replacing B with any two of (B1B 2B3) to obtain 4 types, and adding a text of which AB is not replaced to finally obtain 9 first texts; but in order to ensure the richness of the first text set, the number of the first texts is ensured to be more than or equal to half of the maximum number; that is, the resulting number of first text sets may be equal to or greater than half the maximum number of texts and equal to or less than the maximum number of texts. Therefore, when the text to be corrected is corrected, the candidate characters only select the similar confusing characters on the character patterns from the confusing set to form the first text, so that the calculated amount of sentence legality judgment can be greatly reduced, and the richness of the first text can be ensured.
Step S302, in the first text set, candidate texts meeting preset conditions are determined.
And selecting candidate texts meeting preset conditions from the first text set. In some possible implementation modes, a language model which is obtained through general corpus training and used for grammar detection on texts can be adopted to detect the reasonability of the grammar and the semantics of each first text, and the text with the highest reasonability is used as a candidate text; therefore, the reasonability of the grammar and the semantics of the first text is judged by adopting the language model to obtain a more reasonable candidate text, so that the purpose of performing primary error correction on the text to be corrected is achieved.
Step S303, replacing at least one confusion character in the candidate text by adopting a preset confusion word library to obtain a second text set.
In some possible implementations, firstly, the confusing characters in the confusing word library are determined to be included in the candidate texts, and then, for the confusing characters, at least one of the corresponding confusing words in the confusing word library is used for replacement to obtain the second text set. As shown in the above example, if two confusing characters a and B are included in the candidate text and each confusing character has 3 confusing words in the confusing word library, that is, there are 3 words with similar characters (a1a2 A3) and (B1B 2B3), at least one of the 3 words is used to replace the corresponding confusing character, and after multiple replacements, 16 second texts are obtained. In some embodiments, in order to reduce the amount of computation, only a part of the obfuscated words may be replaced, for example, based on the above example, in the case where B is not replaced, a is replaced by any two of (a1a2 A3), respectively, resulting in 2; in the case that a is not replaced, respectively replacing B with any two of (B1B 2B3) to obtain 2 types, then respectively replacing a with any two of (a1a2 A3), respectively replacing B with any two of (B1B 2B3) to obtain 4 types, and adding a text of which AB is not replaced to finally obtain 9 second texts; but in order to ensure the richness of the second text set, the number of the second texts is ensured to be more than or equal to half of the maximum number of the second texts; that is, the number of the resulting second text sets may be equal to or greater than half the maximum second text number and equal to or less than the maximum second text number.
Step S304, according to the second text set, traversing a domain word bank in which at least two words which are the same as the domain to which the text to be corrected belongs are stored, and obtaining a target text matched with the second text.
In some embodiments, first, a domain dictionary tree that is the same as a domain to which the second text belongs and stores at least two words is determined; and traversing the dictionary tree according to the second text set to obtain the target text. The domain dictionary tree can be a prefix tree model constructed by a domain dictionary, and the domain dictionary can be domain proper nouns manually arranged by experts through expert knowledge; the method can also be used for automatically classifying the field to which the text belongs through semantic recognition and scene recognition to obtain a name word library.
In some possible implementations, a domain to which the text to be corrected belongs is determined first, then a domain dictionary tree belonging to the domain is obtained, for each second text in the second text set, traversal is performed in the dictionary tree, and if a text sequence identical to the second text can be traversed in the domain dictionary tree, the second text can be used as the target text.
In the embodiment of the application, when the text to be corrected is corrected, the candidate characters only select the similar confusing characters on the character patterns from the confusing set to form the first text, so that the calculation amount for judging the legality of the sentence can be greatly reduced; and the domain dictionary tree constructed by the domain dictionary is adopted to correct the domain proper nouns in the second text, so that the problem that the same words cannot be effectively distinguished in different domains due to ambiguity can be effectively solved, and the accuracy of correcting the text is improved.
In some embodiments, the confusion word with the similarity between the font in the preset confusion word library and the font of the text to be corrected being greater than the preset threshold is used as the confusion word capable of replacing the corresponding character in the text to be corrected, so that the calculation amount of sentence validity judgment in the error correction process can be greatly reduced, that is, step S301 can be implemented by the following steps, referring to fig. 4A and fig. 4A, a further implementation flow diagram of the text error correction method provided in the embodiment of the present application is described below with reference to fig. 3:
step S401, in a preset confusion word library, determining a first confusion word set of which the similarity between the font and the font of the character in the text to be corrected is greater than or equal to a first preset similarity threshold.
In some embodiments, the confusing word library includes a plurality of confusing words, and for each character in the text to be corrected, the confusing word library is searched for that the character is a confusing word with a higher font similarity, so as to obtain a first set of confusing words formed by combining the confusing words of a plurality of characters.
Step S402, at least one confusion word in the first confusion word set is adopted to replace the corresponding character in the text to be corrected, so as to obtain a first text set.
In some embodiments, characters similar to the zigzag in the text to be corrected are replaced by at least one confusing character, so as to obtain a first text, for example, if two confusing characters a and B are included in the text to be corrected, and each confusing character has 2 confusing words in a confusing word library, that is, 2 characters with similar characters (a1a2 A3) and (B1B 2B3), at least one of the 2 characters is used to replace the corresponding confusing character, and after multiple replacements, 9 first texts are obtained. Therefore, when the text to be corrected is corrected, the candidate characters only select the similar confusing characters on the character patterns from the confusing set to form the first text, and the calculation amount for judging the legality of the sentence can be greatly reduced.
In some embodiments, the trained language model may be used to classify the first text to obtain the most reasonable candidate text, so as to achieve the goal of error correction, that is, step S302 may be implemented by:
step S331, a first probability of occurrence of each character of the first text in the first text set is determined.
For example, firstly, training a neural network by using a general corpus to obtain a language model which can detect whether an input text has correct grammar and reasonable semantics; then, a first probability of each character occurring in the first text is determined using the language model. For example, a binary language model is used to determine the probability of each character occurring in the first text.
Step S332, determining a second probability of the first text of each character according to the first probability of the occurrence of each character.
For example, a first probability of each character appearing in the first text is multiplied to obtain a second probability of the text sequence appearing, in one specific example, for a text "W" comprising 3 characters1W2W3”,W1,W2,W3Respectively, as a sequence 921 (W)11,W12,W13,W14,W15) Sequence 922 (W)21,W22,W23,W24,W25) And sequence 923 (W)31,W32,W33,W34,W35). . Where a binary language model is used, i.e. the probability of any word occurring is only related to its predecessor, then for the sequence W11W21W31Word W31Probability of occurrence and W21Correlation, i.e. W31Probability of occurrence P (W)31) It can be expressed as: p (W)31)=P(W31|W21)=P(W21)*P(W31)*P(W21W31) (ii) a Then the sequence W11W21W31At W11W21W31The probability of binary multiplication of all occurrences is shown in equation (1):
Figure BDA0002504478570000141
step S333, determining the first text with the second probability being greater than or equal to the preset probability threshold as a candidate text.
In a specific example, a first text with the highest second probability in the first text set can be used as a candidate text; the second probability is the largest, which indicates that the grammar of the first text is the most correct and the semantics are the most reasonable, i.e. the text sequence is the most reasonable text sequence in the first text set.
In the embodiment of the application, the neural network is trained by adopting the universal linguistic data to obtain the language model capable of detecting whether the input text has correct grammar and reasonable semantics, and each first text is input into the language model to obtain the probability that the first text possibly appears, namely the probability of the first text is higher, which indicates that the first text is more reasonable, so that candidate texts with correct grammar and reasonable semantics can be determined from a first text set comprising a plurality of first texts, and the error correction accuracy is improved.
In some embodiments, the confusing word with a similarity between the glyph in the preset confusing word library and the glyph in the candidate text being greater than a preset threshold is used as the confusing word capable of replacing the corresponding character in the candidate text, so that the amount of calculation for sentence validity judgment in the error correction process can be greatly reduced, that is, step S303 can be implemented by the following steps, referring to fig. 4B and fig. 4B, a further implementation flow diagram of the text error correction method provided in the embodiment of the present application is described below with reference to fig. 3:
step S421, determining a second confusion word set in the preset confusion word library, where a similarity between the character pattern and the character pattern of the character in the candidate text is greater than or equal to a second preset similarity threshold.
In some embodiments, the confusing word library comprises confusing words of a plurality of characters, and for each character in the candidate text, the confusing word library is searched for the character which is the confusing word with higher font similarity, so as to obtain a first confusing word set formed by combining the confusing words of the plurality of characters.
Step S422, at least one confusion word in the second confusion word set is adopted to replace the corresponding character in the candidate text, so as to obtain a second text set.
In some embodiments, at least one confusing word is used to replace characters similar to the zigzags in the candidate text, so as to obtain a second text, for example, if two confusing characters a and B are included in the candidate text, and each confusing character has 3 confusing words in the confusing word library, i.e. 3 characters with similar characters (a1a2 A3) and (B1B 2B3), at least one of the 3 characters is used to replace the corresponding confusing character, and after multiple replacements, 16 second texts are obtained. Therefore, when the candidate text is corrected, the candidate characters only select the similar confusion characters on the character patterns from the confusion set to form the second text, and the problems of error correction, omission correction and the like caused by character ambiguity can be reduced.
In some embodiments, a prefix tree containing the proper nouns of the same domain is constructed by using a noun library containing the proper nouns of multiple domains, and a second text is traversed in the prefix tree, so as to obtain a target text with higher accuracy, and the process of constructing the domain dictionary tree is as follows:
first, M proper nouns belonging to the same domain are determined from a dictionary base including at least two kinds of domain proper nouns.
In some possible implementations, M is an integer greater than 0. Finding out multiple proper nouns in the same domain from a dictionary library storing proper nouns by domain,
secondly, the first character of the ith proper noun in the same domain is given to the father node of the domain dictionary tree.
In some possible implementations, the structure of the domain dictionary tree may be a prefix tree structure. For a plurality of proper nouns in the same domain found from the dictionary database, the first character of the proper noun, namely the first character of the proper noun, is given to a father node of the domain dictionary tree. That is, for any proper noun, the first word of the name is stored in the father node of the dictionary tree, the next word of the first word is stored in the child node of the father node, and so on, so that the proper noun can be found on the path of the father node when searching. i is an integer greater than 0 and equal to or less than M.
And thirdly, giving characters adjacent to the first character to the child node of the parent node.
In a proper noun, the adjacent character of the first character, namely the next character of the first character, is given to the child node of the node for storing the first character; then, the next character of the next character, i.e. the third character, is assigned to the child node of the node storing the next character (i.e. the second character), and so on, all the characters of the proper noun are assigned to the nodes of the domain dictionary tree.
And finally, giving the last character of the proper noun to the leaf nodes of the child nodes to construct a domain dictionary tree.
For example, the domain dictionary tree including the same domain proper noun is constructed by assigning the first character to the parent node, the second character to the child node of the parent node, and the third character to the leaf node of the child node.
In some embodiments, the domain-specific nouns are corrected by a domain dictionary tree constructed by the domain dictionary to improve the distinguishing accuracy of the same word in different domains, i.e. step S304 can be implemented by:
step S351, determining a target domain to which the text to be corrected belongs.
For example, the field to which the content in the text to be corrected belongs is determined by performing semantic analysis and scene analysis on the text to be corrected, for example, the text to be corrected is a "raspberry group", and the application scene of the text to be corrected is determined by performing scene analysis in a text segment describing the internet, so that the target field of the "raspberry group" here is determined to be the computer field. If the application scene of the text to be corrected is determined to be in the text segment describing the food, the target field of the raspberry pie is determined to be the cooking field.
In step S352, the first character of the jth second text in the second text set is determined.
After the target domain to which the text to be corrected belongs is determined in step S351, a domain to which any second text in the second text set belongs is obtained; for any second text, the first character in the second text is firstly determined so as to search the node storing the character in the domain dictionary tree according to the first character.
And step S353, traversing nodes in the domain dictionary tree belonging to the target domain according to the first character of the jth second text.
In some embodiments, the domain to which the second text belongs is determined according to the domain to which the text to be corrected belongs, and based on the domain, a domain dictionary tree belonging to the domain is determined; then, for any second text, for example, for the jth second text, determining the first character of the jth second text, and then searching whether the first character is stored in a node in the domain dictionary tree; if not, explaining and searching to show that errors may still exist in the jth second text, and preferably not as a target text; if so, continuing to search whether the next character of the first character is stored in the child node of the node, and similarly, if so, continuing to search whether the next character of the next character is stored in the child node of the child node; if all the characters of the jth second text can be found in the path, the jth second text can be used as the target text.
In some possible modes, firstly, traversing a father node in the domain dictionary tree according to the first character of the jth second text; for example, if the text to be corrected is "acacid," the jth second text is "acacid," and the field to which the text to be corrected belongs is the chemical field, then the field to which the jth second text belongs is also the chemical field
Secondly, if the kth father node in the domain dictionary tree stores the first character of the jth second text, the next character of the first character is determined.
For example, if a node storing "ya" is found, the next character "vanish" of the first character "ya" is determined.
And thirdly, traversing the child nodes of the kth parent node according to the next character of the first character.
For example, find "vanish" in the child node of the node.
And finally, if characters except the first character in the jth second text are stored in a path from the child node of the kth parent node to the leaf node of the child node, determining the jth second text as the target text.
For example, if "vanish" is not found in the child nodes of the node, the traversal is terminated, and the jth second text "vanish salt" is not the target text, that is, the jth second text "vanish salt" is a text which is not grammatically reasonable or semantically correct, and still has errors, so that the text is discarded. If "vanish" is found in a child node, then "acid" is looked up in the child node of the child node, and similarly, if "acid" can be found, "salt" continues to be looked up in the child node of the child node, and similarly, if "salt" can be found, "the jth second text" sub-vanish salt "can be used as a target text.
In step S354, if the character stored in the path from the kth parent node to the leaf node of the kth parent node in the domain dictionary tree matches the jth second text, the jth second text is determined as the target text.
In some embodiments, for the jth second text, if all characters included in the jth second text can be found in the path from the kth parent node to the leaf node of the kth parent node, the jth second text may be used as the target text. Thus, the domain dictionary tree constructed by the domain dictionary is used for correcting the domain proper nouns, and the distinguishing accuracy of the same word in different domains can be improved.
In some embodiments, if all the characters in the second texts can be found from the domain dictionary tree, a plurality of target texts are obtained, and in order to further determine the text with the highest accuracy from the target texts, the following process is performed:
firstly, if the number of the target texts is more than or equal to 2, determining the similarity between each target text and the text to be corrected.
Here, the similarity between each target text and the text to be corrected is determined separately.
Then, the target text with the maximum similarity is determined as the final text.
For example, the target text with the maximum similarity is used as the final text, and the final text is output. Therefore, when a plurality of target texts are provided, the most similar texts are determined by determining the similarity between the target texts and the texts to be corrected, and the texts are used as final texts, so that the situation that the replacement is not accurate enough in the process of replacing by using the confusing words can be reduced.
Next, an exemplary application of the embodiment of the present application in an actual application scenario will be described, taking an example of performing text error correction on an image to be recognized including text information by obtaining a recognition result through optical character recognition.
In some embodiments, a large number of near-word shapes exist in chinese, and for OCR character recognition results, due to the influence of picture quality, a large number of near-word shape recognition errors exist, for example, "nitrite" may be recognized as "hypo-salt", "project name" may be recognized as "project day name", and the like. Meanwhile, similar entity words composed of different shapes of similar characters may be reasonable entity words, and whether errors exist in the similar entity words can be judged only in a specific field, for example: "Urea Nitrogen" and "Urea helium" and the like. The embodiment of the application is practiced on an insurance system in the insurance field, and three difficulties are found:
1) the data volume is large, a large number of similar characters are involved, a professional linguist is needed for proofreading, and the marking difficulty is high.
2) With strong domain dependency properties, a large number of proper nouns require professional knowledge to determine whether a word error exists.
3) If the picture quality is poor, there may be a case where a plurality of characters are consecutively erroneous.
In some embodiments, error correction for text may be implemented in two ways:
mode one, error correction is performed based on a neural language model, and the process is as follows:
firstly, the word vector is used for vectorizing and expressing words, then the language model is trained according to the neural network, and finally the language model is used for correcting the text. The scheme has the problems that the candidate word searching space is large and the similar words cannot be effectively distinguished due to the fact that word vectors are likely to be similar under the scene.
Mode two, error correction based on homophones is performed as follows:
based on correction of homophones, firstly, a homophone candidate set is constructed based on the same Chinese pinyin of characters, and then the correction of the text is carried out based on a language model. This scheme is not suitable for situations where recognition errors are caused by proximity of glyphs.
Therefore, when the candidate set is generated by the full-scale dictionary and the candidate words are searched, the search space is too large due to full-scale retrieval, and a large amount of time is wasted for doing useless work due to the fact that most of the candidate words are irrelevant to the current words; in the scene of error correction of shape-similar characters, word vectors of words composed of different shape-similar characters may be relatively close to each other and cannot be effectively distinguished, so that part of shape-similar character errors cannot be effectively detected.
Based on this, the embodiment of the present application provides a text error correction method, where a confusion set is constructed based on shape-similar words, and when a language model is used for error correction, candidate words only select the closest 5 words in the shape from the confusion set as candidate items, so that the computation amount for sentence validity determination in the language model can be greatly reduced. Meanwhile, the statistical language model is adopted, and different near-shape characters are clearly and differently represented, so that the problem that the neural language model cannot be effectively distinguished due to the fact that the near-shape character vectors are similar can be solved. Moreover, the embodiment of the application integrates a prefix tree model constructed by a domain dictionary to correct the domain proper nouns, so that the problem that the same word cannot be effectively distinguished in different domains due to ambiguity can be effectively solved. As shown in fig. 5, fig. 5 is a schematic view of an implementation flow of a text error correction method provided in an embodiment of the present application, and the following description is made with reference to fig. 5:
step S501, obtaining the general corpus.
The universal corpus may be understood as a universal library of words and phrases.
And step S502, training the universal language model by adopting the corpus.
In step S503, the domain proper noun is determined to obtain a domain dictionary.
For example, based on expert knowledge, the domain proper nouns are manually sorted out by experts to obtain a domain dictionary.
And step S504, constructing a search tree by adopting the domain dictionary and the prefix tree to obtain a domain dictionary tree.
In the embodiment, the shape and character confusion set is sorted based on the similarity of Chinese character patterns.
In step S505, a character recognition result of OCR is acquired for the image.
And S506, correcting the character recognition result by adopting the trained language model to obtain an error correction result.
And step S507, re-correcting the error correction result by using the domain dictionary to obtain a final error correction result.
The following is a scene graph of the short text error correction technology for the shape-word in the application scenes of the intelligent underwriting products and the disease prediction:
in the intelligent underwriting product, in order to examine the insurance application qualification of an applicant, the health condition of the applicant is evaluated based on the recent physical examination report of the applicant. For a health examination report of a insurance applicant, firstly, OCR character recognition is used for recognizing all character information from a photographed piece or a scanned piece of the health examination report, then, a near-character short text error correction technology is used for correcting possible character errors, then, natural language processing technologies such as text classification, entity recognition, relation extraction and the like are used for carrying out structured information extraction on corrected correct texts, and finally, based on a check and guarantee model, check and guarantee evaluation is carried out on extracted physical examination information characteristics.
Fig. 6 is a diagram of an application scenario of the text error correction method according to the embodiment of the present application, and the following description is made with reference to fig. 6:
step S601, a physical examination report is photographed or scanned.
Step S602, performing OCR character recognition on the physical examination report image to obtain an OCR recognition result.
And step S603, correcting the OCR recognition result by adopting the shape-near character short text correction to obtain a correction result.
Step S604, text classification is carried out on the error correction result to obtain a classification result.
The text classification of the error correction result includes two processes, namely, entity word recognition 41 and entity relationship extraction 42, on the error correction result, so that the classification result processed by the entity word recognition 41 and the entity relationship extraction 42 is subjected to structured information extraction.
And step S605, performing structured information extraction on the classification result to obtain an extraction result.
And step S606, inputting the extraction result into an underwriting model, and underwriting to obtain an underwriting result.
In step S607, the underwriting result is output.
In the embodiment of the application, in the disease prediction, similarly, useful physical examination characteristic information is extracted from a physical examination report based on an OCR character recognition technology and an NLP natural language processing technology, then a near-word short text error correction technology is used to correct possible character errors, and then the corrected physical examination characteristic information is input into a disease prediction model, so that a result of predicting possible disease conditions of a physical examination person is obtained.
Fig. 7 is a diagram of another application scenario of the text error correction method according to the embodiment of the present application, and the following description is made in conjunction with fig. 7:
in step S701, a physical examination report is acquired.
Step S702, performing OCR character recognition on the physical examination report to obtain an OCR recognition result.
Step S703, extracting NLP information from the OCR recognition result to obtain useful physical examination feature information.
Step S704, performing text error correction on the physical examination feature information to obtain an error correction result.
Step S705, inputting the error correction result into the disease prediction model to obtain a disease prediction result.
In the embodiment of the present application, in the process of correcting a text, a streaming process of data in an algorithm is shown in fig. 8, and the following description is performed in combination with the steps shown in fig. 8:
step S801, acquiring a picture to be identified.
And S802, performing OCR recognition on the picture to be recognized to obtain an OCR recognition result.
In some embodiments, for the bill pictures of physical examination reports, bank documents, invoices and the like, the characters on the pictures are firstly recognized by utilizing an OCR character recognition technology. The OCR character recognition algorithm generally includes preprocessing, layout processing, feature extraction and model training, and post-recognition processing. As shown in fig. 9A, fig. 9A is an architecture diagram of a process of performing OCR recognition according to an embodiment of the present application, including:
the image input module 901 is configured to input an image to be identified.
Here, the picture to be recognized may be a picture containing text information.
The preprocessing module 902 is configured to preprocess the picture, and perform binarization representation on the picture by using pixel points of the picture so as to enable the picture to be processed by a subsequent algorithm model.
The layout processing module 903 is configured to perform reduction processing on deformation anomalies such as an inclination, a bending, and a wrinkle that may exist in the picture.
And the feature extraction and model training module 904 is configured to detect a text region in the picture by using an algorithm model.
And the recognition post-processing module 905 is configured to perform classification recognition on the characters in the recognized character region by using a classification recognition algorithm to obtain characters.
And a character output module 906, configured to output the recognized characters.
And for the recognized text fields, constructing a first text sequence by a shape-near word confusion set based on a binary language model, then calculating the binary continuous multiplication probability of each sequence, and selecting the sequence with the highest probability as the most reasonable sequence as the correct sequence after error correction. For example: as shown in FIG. 9B, for a piece of text "W" containing 3 characters1W2W3”,W1,W2,W3Respectively, as a sequence 921 (W)11,W12,W13,W14,W15) Sequence 922 (W)21,W22,W23,W24,W25) And sequence 923 (W)31,W32,W33,W34,W35). Among the language models, based on the markov assumption, the probability of any word appearing is only related to the previous word or words, and is related to the previous n words, and the corresponding language model is called n-1 meta language model. Where a binary language model is used, i.e. the probability of any word occurring is only related to its predecessor, then for the sequence W11W21W31Word W31Probability of occurrence and W21Correlation, i.e. W31Probability of occurrence P (W)31) It can be expressed as:
P(W31)=P(W31|W21)=P(W21)*P(W31)*P(W21W31) (ii) a Then the sequence W11W21W31At W11W21W31The probability of binary multiplication of all occurrences is shown in equation (1).
The binary language model calculation formula is shown as formula (2):
Figure BDA0002504478570000231
in step S803, general language knowledge is acquired.
And step S804, training the language model according to the language knowledge to obtain a trained language model.
Step S805, domain knowledge is acquired.
Here, the domain dictionary tree 81 and the confusing word set 82 are constructed based on the domain knowledge.
And step 806, adopting the trained language model, the dictionary tree and the confusion word set to correct and sort the OCR recognition results to obtain a correct text.
In the embodiment of the application, for the result of the language model after error correction, the Trie tree model is used for performing domain noun error correction. As shown in fig. 10: "sub 1001" as a parent node includes child nodes "nitre 1011", "state 1012", "fortune 1013"; "Zhong 1002" as a parent node includes child nodes "nation 1021", "Hua1022" and "mediate 1023"; "Nitro 1011" includes the child node "acid 1031"; "fortune 1013" includes child nodes: "village 1032" and "party 1033"; "nation 1021" includes child nodes "people 1034" and "heart 1035"; "acid 1031" includes child nodes: "salt 1041" and "sodium 1042".
As can be seen from fig. 10, for each character in the text to be corrected, the corresponding word in the original text is replaced based on the candidate word of the near-word confusion set, and then whether the replaced text is in the tree is searched from the Trie tree, if so, the corresponding word in the original text is replaced by the candidate word as the result of correction. For example, for text "inferior xiao chlorate" and "vanish" similar characters are confused and are collected into "vanishing Xiaoshui chiffon", when the word "vanish" in the second text of "nitre" is used, nitrite "is found in the Trie tree, the" vanish "in the" inferior xiao chlorate "is replaced by the" nitre ", and the nitrite" is used as a result after error correction. As shown in fig. 10, the words that can be traversed from fig. 10 are "nitrite, sodium nitrite, asia, subvillage, subfortune, chinese heart, china and intermediary" which constitute the Trie tree structure of the dictionary shown in fig. 10.
In the medical insurance industry, due to the technical equipment limitation, history leaving and the like, a large number of paper documents are accumulated, and the electronization of the documents is helpful for improving the information processing efficiency and is a future trend. Due to the limitation of the technology at the present stage, a certain error exists in the recognition of characters on the picture, and the text error correction scheme based on the form-word confusion set is beneficial to reducing the errors of character recognition and improving the accuracy of document electronization. Therefore, the error correction is carried out by adopting the shape-similar character confusion set, the error correction range can be effectively reduced, and the error correction precision is improved. The tree model is constructed by utilizing the domain dictionary to correct errors, so that the adaptability of the model in a special domain can be improved, and the problems of error correction, omission of correction and the like caused by character ambiguity are reduced.
Continuing with the exemplary structure of the text error correction server 455 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 2B, the software modules stored in the text error correction server 455 of the memory 450 may include:
a first replacing module 4551, configured to replace at least one confusing character in a text to be corrected by using a preset confusing word library, so as to obtain a first text set;
a first determining module 4552, configured to determine candidate texts meeting preset conditions in the first text set;
a second replacement module 4553, configured to replace at least one confusing character in the candidate text with the preset confusing word library, so as to obtain a second text set;
and the first traversal module 4554 is configured to traverse a domain lexicon in which at least two words that are the same as the domain to which the text to be corrected belongs are stored according to the second text set, so as to obtain a target text matched with the second text.
In some embodiments, the first replacement module 4551 is configured to: determining a first confusion word set of which the similarity between the font and the font of the character in the text to be corrected is greater than or equal to a first preset similarity threshold in the preset confusion word library; and replacing corresponding characters in the text to be corrected by adopting at least one confusion word in the first confusion word set to obtain the first text set.
In some embodiments, the first determining module 4552 is configured to: determining a first probability of occurrence of each character of a first text in the first set of texts; determining a second probability of the first text to which each character belongs according to the first probability of the occurrence of each character; and determining the first text with the second probability being greater than or equal to a preset probability threshold as the candidate text.
In some embodiments, the second replacement module 4553 is further configured to: determining similarity between the font and the fonts of the characters in the candidate text in the preset confusing character library, wherein the similarity is greater than or equal to a second confusing character set with a second preset similarity threshold; and replacing the corresponding characters in the candidate text by adopting at least one confusion word in the second confusion word set to obtain the second text set.
In some embodiments, at the first replacement module 4551, the method further comprises: acquiring a character library at least comprising two characters; determining similarity between glyphs of characters in the character library; and creating the preset confusion word stock according to the characters with the similarity between the fonts larger than a preset similarity threshold.
In some embodiments, the first traversal module 4554 is configured to: determining a domain dictionary tree which is the same as the domain to which the second text belongs and stores at least two words; and traversing the dictionary tree according to the second text set to obtain the target text.
In some embodiments, at the first replacement module 4551, the method further comprises: determining M proper nouns belonging to the same field from a dictionary library comprising at least two field proper nouns; wherein M is an integer greater than 0; assigning the first character of the ith proper noun of the same field to a father node of the field dictionary tree; wherein i is an integer greater than 0 and less than or equal to M; giving characters adjacent to the first character to child nodes of the father node; and assigning the last character of the proper noun to the leaf nodes of the child nodes to construct the domain dictionary tree.
In some embodiments, in the first replacement module 4551, the method further comprises: determining a target field to which the text to be corrected belongs; determining the first character of the jth second text in the second text set; wherein j is an integer greater than 0; traversing nodes in a domain dictionary tree belonging to the target domain according to the first character of the jth second text; and if the characters stored in the path from the kth parent node to the leaf node of the kth parent node in the domain dictionary tree are matched with the jth second text, determining the jth second text as the target text.
In some embodiments, at the first replacement module 4551, the method further comprises: traversing a father node in the domain dictionary tree according to the first character of the jth second text; if the kth father node in the domain dictionary tree stores the first character of the jth second text, determining the next character of the first character; traversing child nodes of the kth parent node according to a next character of the first character; and if characters except the first character in the jth second text are stored in a path from the child node of the kth parent node to the leaf node of the child node, determining the jth second text as the target text.
In some embodiments, at the first replacement module 4551, the method further comprises: if the number of the target texts is more than or equal to 2, determining the similarity between each target text and the text to be corrected; and determining the target text with the maximum similarity as the final text.
Embodiments of the present application provide a storage medium storing executable instructions, which when executed by a processor, will cause the processor to execute the method provided by the embodiments of the present application. In some embodiments, the storage medium may be a memory such as a flash memory, a magnetic surface memory, an optical disk, or an optical disk memory; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. By way of example, executable instructions may, but need not, correspond to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). By way of example, executable instructions may be deployed to be executed on one in-vehicle computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network. In summary, in the embodiment of the present application, for an obtained text to be corrected, a confusion word library is first used to construct a plurality of first texts of the text to be corrected, and then, the plurality of first texts are corrected to determine candidate texts meeting preset conditions; therefore, when the text to be corrected is corrected, the candidate characters only select the similar confusion characters on the character patterns from the confusion set to form the first text, so that the calculation amount for judging the legality of the sentence can be greatly reduced; then, replacing the confusion words in the candidate text by adopting a confusion word library; traversing the dictionary trees with the same field according to the second text set so as to obtain a target text; thus, the domain word library constructed by the domain dictionary is used for correcting the error of the domain proper nouns, and the distinguishing accuracy of the same word in different domains can be improved.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (10)

1. A method for correcting text, the method comprising:
replacing at least one confusion character in a text to be corrected by adopting a preset confusion word library to obtain a first text set;
in the first text set, determining candidate texts meeting preset conditions;
replacing at least one confusion character in the candidate text by adopting the preset confusion word library to obtain a second text set;
according to the second text set, traversing a domain word bank in which at least two words which are the same as the domain to which the text to be corrected belongs are stored, and obtaining a target text matched with the second text.
2. The method according to claim 1, wherein the replacing at least one confusing character in the text to be corrected by using a preset confusing word library to obtain a first text set comprises:
determining a first confusion word set of which the similarity between the font and the font of the character in the text to be corrected is greater than or equal to a first preset similarity threshold in the preset confusion word library;
and replacing corresponding characters in the text to be corrected by adopting at least one confusion word in the first confusion word set to obtain the first text set.
3. The method according to claim 1, wherein the determining candidate texts satisfying a preset condition in the first text set comprises:
determining a first probability of occurrence of each character of a first text in the first set of texts;
determining a second probability of the first text to which each character belongs according to the first probability of the occurrence of each character;
and determining the first text with the second probability being greater than or equal to a preset probability threshold as the candidate text.
4. The method according to claim 1, wherein the replacing at least one confusing character in the candidate text with the preset confusing word library to obtain a second text set comprises:
determining similarity between the font and the fonts of the characters in the candidate text in the preset confusing character library, wherein the similarity is greater than or equal to a second confusing character set with a second preset similarity threshold;
and replacing the corresponding characters in the candidate text by adopting at least one confusion word in the second confusion word set to obtain the second text set.
5. The method according to any one of claims 1 to 4, wherein before the replacing, by using a preset confusion word library, the corresponding at least one confusion character in the text to be corrected to obtain the first text set, the method further comprises:
acquiring a character library at least comprising two characters;
determining similarity between glyphs of characters in the character library;
and creating the preset confusion word stock according to the characters with the similarity between the fonts larger than a preset similarity threshold.
6. The method according to claim 1, wherein traversing a domain thesaurus storing at least two words belonging to the same domain as the text to be corrected according to the second text set to obtain a target text matching the second text comprises:
determining a domain dictionary tree which is the same as the domain to which the second text belongs and stores at least two words;
and traversing the dictionary tree according to the second text set to obtain the target text.
7. The method of claim 6, further comprising:
determining M proper nouns belonging to the same field from a dictionary library comprising at least two field proper nouns; wherein M is an integer greater than 0;
assigning the first character of the ith proper noun of the same field to a father node of the field dictionary tree; wherein i is an integer greater than 0 and less than or equal to M;
giving characters adjacent to the first character to child nodes of the father node;
and assigning the last character of the proper noun to the leaf nodes of the child nodes to construct the domain dictionary tree.
8. A text correction apparatus, characterized in that the apparatus comprises:
the system comprises a first replacement module, a second replacement module and a third replacement module, wherein the first replacement module is used for replacing at least one confusion character in a text to be corrected by adopting a preset confusion word library to obtain a first text set;
the first determining module is used for determining candidate texts meeting preset conditions in the first text set;
the second replacement module is used for replacing at least one confusion character in the candidate text by adopting the preset confusion word library to obtain a second text set;
and the first traversal module is used for traversing a domain word bank which stores at least two words in the same domain as the domain to which the text to be corrected belongs according to the second text set to obtain a target text matched with the second text.
9. An apparatus for correcting text, comprising:
a memory for storing executable instructions;
a processor for implementing the method of any one of claims 1 to 7 when executing executable instructions stored in the memory.
10. A storage medium having stored thereon executable instructions for causing a processor to perform the method of any one of claims 1 to 7 when executed.
CN202010442510.XA 2020-05-22 2020-05-22 Text error correction method, device, equipment and storage medium Pending CN111626048A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010442510.XA CN111626048A (en) 2020-05-22 2020-05-22 Text error correction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010442510.XA CN111626048A (en) 2020-05-22 2020-05-22 Text error correction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111626048A true CN111626048A (en) 2020-09-04

Family

ID=72272517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010442510.XA Pending CN111626048A (en) 2020-05-22 2020-05-22 Text error correction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111626048A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016305A (en) * 2020-09-09 2020-12-01 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN112560450A (en) * 2020-12-11 2021-03-26 科大讯飞股份有限公司 Text error correction method and device
CN112597771A (en) * 2020-12-29 2021-04-02 重庆邮电大学 Chinese text error correction method based on prefix tree combination
CN112597768A (en) * 2020-12-08 2021-04-02 北京百度网讯科技有限公司 Text auditing method and device, electronic equipment, storage medium and program product
CN113051894A (en) * 2021-03-16 2021-06-29 京东数字科技控股股份有限公司 Text error correction method and device
CN113361266A (en) * 2021-06-25 2021-09-07 达闼机器人有限公司 Text error correction method, electronic device and storage medium
CN113468871A (en) * 2021-08-16 2021-10-01 北京北大方正电子有限公司 Text error correction method, device and storage medium
CN113591440A (en) * 2021-07-29 2021-11-02 百度在线网络技术(北京)有限公司 Text processing method and device and electronic equipment
CN113761881A (en) * 2021-09-06 2021-12-07 北京字跳网络技术有限公司 Wrong-word recognition method and device
CN114822532A (en) * 2022-04-12 2022-07-29 广州小鹏汽车科技有限公司 Voice interaction method, electronic device and storage medium
CN115659078A (en) * 2022-10-12 2023-01-31 湖北盈隆腾辉科技有限公司 Network information security monitoring method and system based on artificial intelligence
CN117349071A (en) * 2023-10-26 2024-01-05 易康(广州)数字科技有限公司 Error correction mechanism online evaluation method, system and storage medium based on big data
CN117371445A (en) * 2023-12-07 2024-01-09 深圳市慧动创想科技有限公司 Information error correction method, device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210029A (en) * 2019-05-30 2019-09-06 浙江远传信息技术股份有限公司 Speech text error correction method, system, equipment and medium based on vertical field
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device
CN110807319A (en) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 Text content detection method and device, electronic equipment and storage medium
CN110969012A (en) * 2019-11-29 2020-04-07 北京字节跳动网络技术有限公司 Text error correction method and device, storage medium and electronic equipment
CN111062376A (en) * 2019-12-18 2020-04-24 厦门商集网络科技有限责任公司 Text recognition method based on optical character recognition and error correction tight coupling processing
CN111079768A (en) * 2019-12-23 2020-04-28 北京爱医生智慧医疗科技有限公司 Character and image recognition method and device based on OCR

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210029A (en) * 2019-05-30 2019-09-06 浙江远传信息技术股份有限公司 Speech text error correction method, system, equipment and medium based on vertical field
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device
CN110807319A (en) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 Text content detection method and device, electronic equipment and storage medium
CN110969012A (en) * 2019-11-29 2020-04-07 北京字节跳动网络技术有限公司 Text error correction method and device, storage medium and electronic equipment
CN111062376A (en) * 2019-12-18 2020-04-24 厦门商集网络科技有限责任公司 Text recognition method based on optical character recognition and error correction tight coupling processing
CN111079768A (en) * 2019-12-23 2020-04-28 北京爱医生智慧医疗科技有限公司 Character and image recognition method and device based on OCR

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016305B (en) * 2020-09-09 2023-03-28 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN112016305A (en) * 2020-09-09 2020-12-01 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN112597768A (en) * 2020-12-08 2021-04-02 北京百度网讯科技有限公司 Text auditing method and device, electronic equipment, storage medium and program product
CN112560450A (en) * 2020-12-11 2021-03-26 科大讯飞股份有限公司 Text error correction method and device
CN112560450B (en) * 2020-12-11 2024-02-13 科大讯飞股份有限公司 Text error correction method and device
CN112597771A (en) * 2020-12-29 2021-04-02 重庆邮电大学 Chinese text error correction method based on prefix tree combination
CN113051894A (en) * 2021-03-16 2021-06-29 京东数字科技控股股份有限公司 Text error correction method and device
CN113361266B (en) * 2021-06-25 2022-12-06 达闼机器人股份有限公司 Text error correction method, electronic device and storage medium
CN113361266A (en) * 2021-06-25 2021-09-07 达闼机器人有限公司 Text error correction method, electronic device and storage medium
CN113591440A (en) * 2021-07-29 2021-11-02 百度在线网络技术(北京)有限公司 Text processing method and device and electronic equipment
CN113468871A (en) * 2021-08-16 2021-10-01 北京北大方正电子有限公司 Text error correction method, device and storage medium
CN113761881A (en) * 2021-09-06 2021-12-07 北京字跳网络技术有限公司 Wrong-word recognition method and device
CN114822532A (en) * 2022-04-12 2022-07-29 广州小鹏汽车科技有限公司 Voice interaction method, electronic device and storage medium
CN115659078A (en) * 2022-10-12 2023-01-31 湖北盈隆腾辉科技有限公司 Network information security monitoring method and system based on artificial intelligence
CN117349071A (en) * 2023-10-26 2024-01-05 易康(广州)数字科技有限公司 Error correction mechanism online evaluation method, system and storage medium based on big data
CN117349071B (en) * 2023-10-26 2024-04-12 易康(广州)数字科技有限公司 Error correction mechanism online evaluation method, system and storage medium based on big data
CN117371445A (en) * 2023-12-07 2024-01-09 深圳市慧动创想科技有限公司 Information error correction method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111626048A (en) Text error correction method, device, equipment and storage medium
US20240078386A1 (en) Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
US20210150130A1 (en) Methods for generating natural language processing systems
RU2678716C1 (en) Use of autoencoders for learning text classifiers in natural language
CN111753060A (en) Information retrieval method, device, equipment and computer readable storage medium
US20210110111A1 (en) Methods and systems for providing universal portability in machine learning
CN110866098B (en) Machine reading method and device based on transformer and lstm and readable storage medium
US10963647B2 (en) Predicting probability of occurrence of a string using sequence of vectors
KR20200087977A (en) Multimodal ducument summary system and method
CN112307164A (en) Information recommendation method and device, computer equipment and storage medium
US9575957B2 (en) Recognizing chemical names in a chinese document
CN113010679A (en) Question and answer pair generation method, device and equipment and computer readable storage medium
CN111681731A (en) Method for automatically marking colors of inspection report
CN114742058B (en) Named entity extraction method, named entity extraction device, computer equipment and storage medium
EP4064038B1 (en) Automated generation and integration of an optimized regular expression
US20230138491A1 (en) Continuous learning for document processing and analysis
CN115759082A (en) Text duplicate checking method and device based on improved Simhash algorithm
CN115344668A (en) Multi-field and multi-disciplinary science and technology policy resource retrieval method and device
CN114722174A (en) Word extraction method and device, electronic equipment and storage medium
CN110083817B (en) Naming disambiguation method, device and computer readable storage medium
US20210312131A1 (en) Technical document issues scanner
CN114723073B (en) Language model pre-training method, product searching method, device and computer equipment
CN112733492B (en) Knowledge base-based aided design method and device, terminal and storage medium
Nguyen et al. Medical Prescription Recognition Using Heuristic Clustering and Similarity Search
US20240020476A1 (en) Determining linked spam content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200904