CN111428494A - Intelligent error correction method, device and equipment for proper nouns and storage medium - Google Patents

Intelligent error correction method, device and equipment for proper nouns and storage medium Download PDF

Info

Publication number
CN111428494A
CN111428494A CN202010164805.5A CN202010164805A CN111428494A CN 111428494 A CN111428494 A CN 111428494A CN 202010164805 A CN202010164805 A CN 202010164805A CN 111428494 A CN111428494 A CN 111428494A
Authority
CN
China
Prior art keywords
word
retrieval
error correction
word segmentation
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010164805.5A
Other languages
Chinese (zh)
Inventor
曾增烽
刘东煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010164805.5A priority Critical patent/CN111428494A/en
Publication of CN111428494A publication Critical patent/CN111428494A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Abstract

The invention relates to the technical field of big data, and discloses an intelligent error correction method of proper nouns, which comprises the following steps: acquiring proper nouns to be corrected, performing word segmentation processing on the proper nouns to be corrected to obtain a plurality of word segmentation segments of a text to be corrected, outputting the word segmentation segments in a pinyin format, respectively taking pinyin of each word segmentation segment as a keyword, and retrieving candidate words corresponding to the word segmentation segments from a preset homophone dictionary to obtain a retrieval result; if the retrieval result is not empty, determining each retrieval candidate word based on the retrieval result, respectively calculating the scores of the retrieval candidate words, sorting, and outputting a sorting result; and taking the candidate word with the highest score as a replacement item to replace the corresponding word segmentation based on the sorting result. The invention also discloses an intelligent proper noun error correction device, equipment and a computer readable storage medium. The invention provides more accurate intelligent error correction service of proper nouns for users and improves the error correction efficiency.

Description

Intelligent error correction method, device and equipment for proper nouns and storage medium
Technical Field
The invention relates to the technical field of big data, in particular to an intelligent error correction method, an intelligent error correction device, intelligent error correction equipment and a computer readable storage medium for proper nouns.
Background
In recent years, with the development of economic society, when a client consults a problem in a vertical field, the client often aims at a specific professional direction, and special terms in the field are often included in the similar problems. The user often makes mistakes or partial wrong characters appear in the proper nouns due to language conversion, so that the subsequent modules are difficult to accurately judge the real expression of the user.
In the current input method or language recognition, error correction is performed on errors in proper nouns caused by wrong words or language conversion of a user, most of the errors are corrected by means of a sequence labeling model, such as ner, lstm + crf, and the like.
Disclosure of Invention
The invention mainly aims to provide a proper noun intelligent error correction method, a device, equipment and a computer readable storage medium, and aims to solve the technical problem that the existing error correction method is low in operation efficiency.
In order to achieve the above object, the present invention provides an intelligent noun error correction method, which includes the following steps:
acquiring proper nouns to be corrected;
performing word segmentation processing on the proper nouns to be corrected to obtain a plurality of word segmentation segments of the text to be corrected, and outputting the word segmentation segments in a pinyin format;
based on the word segmentation fragments in the Pinyin format, taking the Pinyin of each word segmentation fragment as a keyword, and retrieving candidate words corresponding to the word segmentation fragments from a preset homophone dictionary to obtain a retrieval result;
if the retrieval result is not empty, determining each retrieval candidate word based on the retrieval result;
calculating the scores of the retrieval candidate words based on the retrieval candidate words, sequencing the scores, and outputting a sequencing result;
and based on the sorting result, taking the candidate word with the highest score as a replacement item to replace the corresponding word segmentation.
Optionally, before the step of obtaining the proper noun to be corrected, the method further includes:
acquiring a first original corpus;
performing word segmentation processing on the first original corpus to obtain a plurality of word fragments of the original corpus;
inputting the word segments in a pinyin format, and counting the pinyins of the word segments;
and determining the word segments with the same pinyin based on the pinyins of the word segments, and constructing a homophone dictionary, wherein the homophone dictionary comprises the corresponding relation between the same pinyin and different characters.
Optionally, before the step of obtaining proper nouns to be corrected, the method further includes:
acquiring a second original corpus;
performing word segmentation processing on the second original corpus to obtain a plurality of word fragments of the second original corpus;
based on the word segments, respectively carrying out word segmentation on the word segments to obtain a single word set;
and constructing an inverted index dictionary based on the single character set.
Optionally, after the step of retrieving candidate words corresponding to the word segmentation segment from a preset homophone dictionary by using the pinyin of each word segmentation segment as a keyword in the word segmentation segment based on the pinyin format to obtain a retrieval result, the method further includes:
if the retrieval result is empty, traversing and eliminating the characters in the word segmentation segment to obtain a plurality of word groups;
respectively taking the phrases as keywords, calling a preset inverted index dictionary, and retrieving a plurality of candidate words corresponding to the phrases to obtain retrieval results;
and outputting the retrieval candidate words corresponding to each phrase based on the retrieval result.
Optionally, after the step of outputting the search candidate word corresponding to each word group based on the search result, the method further includes:
calculating the scores of the retrieval candidate words based on the retrieval candidate words, sequencing the scores, and outputting a sequencing result;
and based on the sorting result, taking the candidate word with the highest score as a replacement item to replace the corresponding word segmentation.
Optionally, the calculating and sorting scores of the search candidate words based on the search candidate words, and outputting a sorting result includes:
determining word frequency information of the corresponding search candidate words based on the search candidate words;
calculating a score corresponding to the retrieval candidate word based on the word frequency information of the retrieval candidate word, wherein the word frequency information is in direct proportion to the score;
and sorting the search candidate words based on the scores.
Optionally, the selecting, based on the ranking result, a candidate word with a highest score as a replacement term, where replacing a corresponding word segmentation includes:
obtaining the scores of the retrieval candidate words based on the sorting result;
and taking the retrieval candidate word with the highest score as a replacement item to replace the corresponding participle segment based on the score of the retrieval candidate word.
Further, the present invention also provides an intelligent noun error correction device, which includes:
the acquisition module is used for acquiring proper nouns to be corrected;
the word segmentation module is used for carrying out word segmentation processing on the proper nouns to be corrected to obtain a plurality of word segmentation segments of the text to be corrected and outputting the word segmentation segments in a pinyin format;
the candidate word determining module is used for respectively taking the pinyin of each word segmentation segment as a keyword based on the word segmentation segments in the pinyin format, and retrieving candidate words corresponding to the word segmentation segments from a preset homophone dictionary to obtain a retrieval result; if the retrieval result is not empty, determining each retrieval candidate word based on the retrieval result;
the score calculating module is used for calculating the scores of the retrieval candidate words based on the retrieval candidate words, sequencing the scores and outputting a sequencing result;
and the error correction module is used for taking the candidate word with the highest score as a replacement item to replace the corresponding word segmentation based on the sorting result.
Further, the obtaining module is further configured to: acquiring a first original corpus;
the word segmentation module is further configured to: performing word segmentation processing on the first original corpus to obtain a plurality of word fragments of the original corpus;
the intelligent noun error correction device further comprises:
the statistic module is used for inputting the word segments in a pinyin format and counting the pinyins of the word segments;
and the first construction module is used for determining the word segments with the same pinyin based on the pinyins of the word segments and constructing a homophone dictionary.
Further, the obtaining module is configured to obtain a second original corpus;
the intelligent noun error correction device further comprises:
the word segmentation module is used for performing word segmentation processing on the second original corpus to obtain a plurality of word segments of the second original corpus, and performing single word segmentation on the word segments respectively to obtain a single word set;
and the second construction module is used for constructing the inverted index dictionary based on the single character set.
Optionally, the candidate word determination module is further specifically configured to:
when the retrieval result is empty, traversing and eliminating the characters in the word segmentation segment to obtain a plurality of word groups;
respectively taking the phrases as keywords, calling a preset inverted index dictionary, and retrieving a plurality of candidate words corresponding to the phrases to obtain retrieval results; and outputting the retrieval candidate words corresponding to each phrase based on the retrieval result.
Optionally, the score calculation module is configured to: on the basis of the retrieval candidate words, calculating scores of the retrieval candidate words of the inverted index dictionary, sequencing the scores, and outputting a sequencing result;
the error correction module is further configured to: and based on the sorting result, taking the candidate word with the highest score as a replacement item to replace the corresponding word segmentation.
Optionally, the score calculating module is further specifically configured to:
determining word frequency information of the corresponding search candidate words based on the search candidate words;
determining word frequency information of the corresponding search candidate words based on the search candidate words; calculating a score corresponding to the retrieval candidate word based on the word frequency information of the retrieval candidate word, wherein the word frequency information is in direct proportion to the score; and sorting the search candidate words based on the scores.
Optionally, the error correction module is specifically configured to:
obtaining the scores of the retrieval candidate words based on the sorting result; and taking the retrieval candidate word with the highest score as a replacement item to replace the corresponding participle segment based on the score of the retrieval candidate word.
Further, to achieve the above object, the present invention also provides a proper noun intelligent error correction device, which includes a memory, a processor and a proper noun intelligent error correction program stored in the memory and executable on the processor, wherein the proper noun intelligent error correction program, when executed by the processor, implements the steps of the proper noun intelligent error correction method as described in any one of the above items.
Further, to achieve the above object, the present invention also provides a computer readable storage medium, on which an intelligent noun error correction program is stored, and the intelligent noun error correction program, when executed by a processor, implements the steps of the intelligent noun error correction method as described in any one of the above.
The method comprises the steps of firstly carrying out word segmentation on the proper nouns to be corrected, then converting a plurality of word segmentation fragments after word segmentation into a pinyin format for output, taking the pinyin of each word segmentation fragment as a keyword during retrieval, retrieving candidate words corresponding to each word segmentation fragment from a preset homophone dictionary, finally sequencing the retrieved candidate words, and taking the candidate word with the highest score as a replacement to replace the corresponding word segmentation fragment, thereby realizing the correction of wrongly-written characters in the proper nouns and ensuring the accuracy of the proper nouns. The invention provides more accurate intelligent error correction service of proper nouns for users, and the realization process does not depend on large batches of linguistic data, thereby greatly improving the error correction efficiency of the proper nouns.
Drawings
FIG. 1 is a schematic structural diagram of a hardware operating environment of an apparatus according to an embodiment of the intelligent error correction apparatus of the proper noun of the present invention;
FIG. 2 is a flowchart illustrating a first embodiment of the intelligent noun error correction method of the present invention;
FIG. 3 is a flowchart illustrating a second embodiment of the intelligent noun error correction method of the present invention;
FIG. 4 is a flowchart illustrating a third embodiment of the intelligent noun error correction method of the present invention;
FIG. 5 is a flowchart illustrating a fourth embodiment of the intelligent noun error correction method of the present invention;
FIG. 6 is a schematic view of a detailed process of step S440 in FIG. 5;
FIG. 7 is a schematic view of a detailed process of step S160 in FIG. 2;
FIG. 8 is a functional block diagram of an embodiment of an intelligent error correction apparatus according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides intelligent error correction equipment for proper nouns.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a hardware operating environment of an apparatus according to an embodiment of the intelligent error correction apparatus of the proper noun of the present invention.
As shown in FIG. 1, the intelligent error correction device of proper noun may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.
Those skilled in the art will appreciate that the hardware configuration of the proper term intelligent error correction device shown in FIG. 1 does not constitute a limitation of the proper term intelligent error correction device, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer-readable storage medium, may include an operating system, a network communication module, a user interface module, and a smart error correction program of proper noun therein. The operating system is a program for managing and controlling the intelligent error correction equipment of proper nouns and software resources, and supports the operation of a network communication module, a user interface module, the intelligent error correction program of proper nouns and other programs or software; the network communication module is used to manage and control the network interface 1004; the user interface module is used to manage and control the user interface 1003.
In the hardware structure of the intelligent error correction device of proper nouns shown in fig. 1, the network interface 1004 is mainly used for connecting to the system background and performing data communication with the system background; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; the intelligent error correction of proper nouns device invokes the intelligent error correction of proper nouns program stored in the memory 1005 via the processor 1001 and performs the operations of the following embodiments of the intelligent error correction of proper nouns method.
Based on the hardware structure of the intelligent error correction device of proper nouns, various embodiments of the intelligent error correction method of proper nouns are provided.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the intelligent error correction method of proper nouns of the present invention. In this embodiment, the intelligent error correction method for proper nouns includes the following steps:
step S110, acquiring a text to be corrected, and determining a proper noun to be corrected based on the text to be corrected;
the proper nouns to be corrected in this embodiment may be text data acquired from a preset database, such as an electronic medical record, or may be acquired from data information input by a user.
In this embodiment, the term to be corrected refers to a category that the client usually aims at a specific professional direction when consulting a problem, for example, in life insurance, the client usually aims at diseases. For example, would the user ask "can systemic lupus erythematosus be insured"? In the problems, some special proper nouns in the field are included, but in the current input method and language recognition, users often mistake one of the characters, or the proper nouns may include wrong characters due to language conversion, such as "plaque" (macula and macula homophone) in "systemic lupus erythematosus" or "warm" (egg is not correctly pronounced to be "warm") in "insurance can be bought in the heating tube operation", and other proper nouns needing error correction.
The proper nouns to be corrected in this embodiment may be obtained from a database, for example, documents such as an electronic medical record, or may be obtained from data information input by a user. The proper nouns are special nouns in a certain field, such as in life risk, which are usually directed to the disease category.
Step S120, performing word segmentation processing on the proper nouns to obtain a plurality of word segmentation segments of the text to be corrected, and outputting the word segmentation segments in a pinyin format;
in this embodiment, the word segmentation processing is performed on the proper nouns to be corrected, so as to obtain a plurality of word segmentation combinations. For example, the word segmentation of the "systemic lupus erythematosus/lupus/posttreatment" usually results in the segment of the word comprising 1-3 words. And performing word segmentation processing on all texts to be corrected to obtain a plurality of word segments.
In this embodiment, the word segmentation is also called word segmentation, which means that a Chinese character sequence is segmented into a single word.
In this embodiment, the word segmentation segment refers to a word segment obtained after word segmentation processing is performed on a text to be corrected, for example, the word segmentation of "systemic lupus erythematosus/lupus/postionalization" is "systemic/erythema/macula/lupus/postionalization", where the "systemic", "lupus", "postionalization" and the like are the obtained word segmentation segments.
Step S130, based on the word segmentation segments in the Pinyin format, taking the Pinyin of each word segmentation segment as a keyword, and retrieving candidate words corresponding to the word segmentation segments from a preset homophone dictionary to obtain a retrieval result;
in this embodiment, all the participle segments are input in a pinyin format, wherein the data size of the obtained participle segments is huge because the data size of the proper nouns to be corrected is huge, and each participle segment in the pinyin format is used as a keyword to retrieve a plurality of candidate words corresponding to the participle segment from a homophone dictionary. For example, the pinyin (toubao) of the word segment "application" is used as a keyword to perform homophone search, and the obtained candidate words include "application", "cephalosporin", "bag stolen", and the like. Meanwhile, the search result may also be null, for example, a homophone search is performed with the pinyin (shunuanguan) of the word segment "heating pipe" as a keyword, and the search result is null without other words homophonic with the word segment "heating pipe".
Step S140, if the retrieval result is not empty, determining each retrieval candidate word based on the retrieval result;
in this embodiment, according to the multiple candidate words corresponding to all the word segmentation segments, the search candidate word corresponding to each word segmentation segment is output respectively.
In this embodiment, the score of each candidate word is calculated by using a preset scoring ranking model according to the search candidate words corresponding to the word segmentation.
In this embodiment, the candidate word is retrieved from a preset homophone dictionary by using pinyin of the word segmentation segment as a keyword. For example, the pinyin (toubao) of the word segment "application" is used as a keyword to perform homophonic character retrieval, and the obtained candidate words include "application", "cephalosporin", "stealing package", "application", "cephalosporin" and "stealing package", which are candidate words for homophonic retrieval by using the pinyin (toubao) of the word segment "application" as the keyword.
Step S150, calculating the scores of the retrieval candidate words based on the retrieval candidate words, sorting the scores, and outputting a sorting result;
in this embodiment, the scores of the candidate words are respectively calculated according to the retrieved candidate words, and are sorted according to the scores, and further, a sorting result is output.
And step S160, based on the sorting result, taking the candidate word with the highest score as a replacement item to replace the corresponding participle segment.
In this embodiment, the candidate words are ranked according to the scores of the search candidate words, where the candidate word with the highest score is the replacement item, and replaces the corresponding participle segment.
In the embodiment, word segmentation is performed on the proper nouns to be corrected, then a plurality of word segmentation segments after word segmentation are converted into a pinyin format to be output, when retrieval is performed, the pinyin of each word segmentation segment is used as a keyword, candidate words corresponding to the word segmentation segments are retrieved from a preset homophone dictionary, finally, the retrieved candidate words are sequenced, the candidate word with the highest score is used as a replacement item to replace the corresponding word segmentation segment, error correction of wrongly-written characters in the proper nouns is further realized, and the accuracy of the proper nouns is ensured. The invention provides more accurate intelligent error correction service of proper nouns for users, and the realization process does not depend on large batches of linguistic data, thereby greatly improving the error correction efficiency of the proper nouns.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the intelligent term error correction method according to the present invention. In this embodiment, before the step S110, the method further includes:
step S210, acquiring a first original corpus;
in this embodiment, a large number of original corpora are obtained, and these corpora contain a large number of proper nouns (e.g., the disease word "systemic lupus erythematosus" or the like).
Step S220, performing word segmentation processing on the first original corpus to obtain a plurality of word segments of the original corpus;
in this embodiment, the original corpora are subjected to word segmentation processing to obtain a plurality of word fragments.
Step S230, inputting the word segments in the pinyin format, and counting the pinyins of the word segments.
In this embodiment, the word segments are input and stored in the pinyin format, and the pinyin of the word segments is counted.
Step S240, determining the word segments with the same pinyin based on the pinyins of the word segments, and constructing a homophone dictionary, wherein the homophone dictionary comprises the corresponding relation between the same pinyin and different characters.
In this embodiment, word segments with the same pinyin, such as "insuring", "cephalosporin", "stolen bag", and "jackpot", "grandma", "great general", "great jiang", "great river", "fermented soybean paste", etc., are found out from the pinyin of these word segments.
In this embodiment, the homophone dictionary is constructed by performing word segmentation on a large amount of corpus, counting the pinyins of 1-3 grams after word segmentation, and grouping the same pinyins together. The homophone word dictionary can be created by finding out homophone words with different character forms according to dictionary pinyin sequencing, and establishing the corresponding relation between the same pinyin and the different character forms, wherein the corresponding relation comprises the corresponding relation between the pinyin and the word, and can also comprise the corresponding relation between the word and a word containing the word. For example, the pinyin "tou" and the corresponding homonymous and different-font characters include: throw, head, steal, …, throw, head, steal, … are homophones. As another example, words corresponding to "throw" include: insuring, throwing, descending, throwing and wind throwing.
Referring to fig. 4, fig. 4 is a flowchart illustrating a second embodiment of the intelligent error correction method according to proper nouns of the present invention. In this embodiment, before the step S110, the method further includes:
step S310, acquiring a second original corpus;
in this embodiment, a large number of original corpora are obtained, and these corpora contain a large number of proper nouns (e.g., the disease word "systemic lupus erythematosus" or the like).
Step S320, performing word segmentation processing on the second original corpus to obtain a plurality of word segments of the second original corpus;
in this embodiment, the corpora are subjected to word segmentation processing to obtain a plurality of word fragments. "for example, do you can guarantee for the heat pipe operation" after the word segmentation processing is "heat/warm/pipe/operation/can/guarantee/do".
Step S330, based on the plurality of word segments, respectively carrying out single word segmentation on the word segments to obtain a single word set;
in this embodiment, word segmentation is performed on the word segments according to a plurality of word segments, the word segments are segmented into single words, and the single words are used as keywords to search out related content. For example, the word segment "people in China" is cut into words to obtain four characters of "China", "people" and "people". Wherein, the collection of all the words obtained after the words are cut into the words of all the word segments is called as a word collection.
And step S340, constructing an inverted index based on the single character set.
In this embodiment, a corpus is converted into a "term-document" pair according to each word in a single word set, terms and corresponding documents are sorted respectively, documents with the same term pair are merged into an inverted record table corresponding to the term, further, the generated inverted index is written into a disk, an intermediate file is generated, and finally, all the intermediate files are merged to construct a final inverted index. For example, 1: converting a document set into a "term-document" pair, "AAA → clothes a", "blue → clothes a", "M code → clothes a", "monkey → clothes a" through a series of processes;
ordering the terms and the documents, and merging the documents with the same terms into an inverted record table corresponding to the terms, wherein the inverted record table comprises a color → a color 1, a color 2, a. 3, writing the inverted index generated in the step into a disk to generate an intermediate file; and 4, merging all the intermediate files into a final inverted index.
In this embodiment, the index is a storage structure created in advance based on the target information content to speed up the information search process. For example: a book, without a directory, is theoretically readable, and only when you close the current content being read, the next time you open the book to look up, it is time consuming. If the directory of several pages is added, we can quickly know the distribution of the general content of the book and the distribution of the page position of each chapter, so that the efficiency of querying the content is naturally improved. The book catalog is a simple index of book contents.
In this embodiment, the inverted index is an indexing method, and is often used in a word document mapping structure in a full-text retrieval system. The method is mainly constructed based on key attribute values of the information main body. For example, assume that there is only one item in the retrieval system: clothes A, the trade mark is "AAA", the color is "blue", the size is "M sign indicating number", the pattern is "monkey", construct its reverse index structure on the basis of this commodity, will produce the corresponding index structure: the user can find the commodity by searching the AAA, the blue, the M code and the monkey, thereby quickening the retrieval speed and expanding the retrieval range.
Referring to FIG. 5, FIG. 5 is a flowchart illustrating a fourth embodiment of the intelligent term correction method according to the present invention. Based on the foregoing embodiment, in this embodiment, after the step S130, the method further includes:
step S410, if the retrieval result is empty, traversing and eliminating the characters in the word segmentation segment to obtain a plurality of word groups;
in this embodiment, if an error that the homophonic search cannot be performed due to an abnormal pronunciation occurs, the search result of the candidate words corresponding to the participle segment retrieved from the preset homophonic character dictionary is empty. At this time, words in the proper noun to be corrected, such as "warm input pipe insurance", can be removed in a traversal manner, and if the "warm" words are removed, the combination of words formed by the remaining two words "(input, pipe)", that is, the phrase is removed.
In this embodiment, the number of the front word and the number of the rear word are not limited, that is, as long as a word group composed of a word from which a certain word is removed and the remaining words is used as a keyword, the search can be performed.
Step S420, the phrase is used as a keyword, a preset reverse index is called, a plurality of candidate words corresponding to the phrase are searched, and a search result is obtained;
in this embodiment, the obtained multiple phrases are used as keywords, a preset reverse index search is invoked, and multiple candidate words corresponding to each phrase are searched.
Generally, when a user query request is received, the process of entering the reverse index for retrieval is carried out until a retrieval result is returned, and the method mainly comprises the following steps:
(1) analyzing the user request at the word segmentation system to generate a corresponding item, for example;
(2) item find multiple candidate words for the corresponding item in the list of terms in the inverted index, such as;
(3) respectively carrying out micro-operation on the plurality of candidate word data;
(4) comprehensively sorting the candidate words based on the operation scores, and finally returning the result to the user.
The above process is a relatively simple retrieval process.
For example, "warm" word is removed, if two remaining words "(lose, pipe)" are removed, the word segmentation system first analyzes the word "(lose, pipe)" as a keyword, and generates corresponding items "lose XX", and searches a plurality of corresponding candidate words "fallopian tube", "seminal duct", "blood vessel", and the like in a term list in an inverted index. And if the score of the candidate word is larger than the preset threshold, taking the candidate word as a candidate replacement item, outputting the result, and returning the result to the user to obtain a retrieval result.
And step S430, outputting the search candidate words corresponding to each phrase based on the search result.
In this embodiment, the candidate word corresponding to each phrase is output according to the multiple candidate words corresponding to each phrase in the word segmentation segment. For example, "protection of the warming tube", if the word "warm" is removed, the word group "(tube, fallopian tube)" can be used as the key word, and the inverted index is used to search for the word selection of "fallopian tube", "vas deferens" and "vas deferens".
Step S440, calculating the scores of the retrieval candidate words based on the retrieval candidate words, sorting the scores, and outputting a sorting result;
in this embodiment, feature extraction is performed on the search candidate words respectively, combination features of the search candidate words are obtained, and each search candidate word is scored according to the combination features.
In this embodiment, the combination features of the search candidate words include features of word frequency variation, features of word segmentation variation, and features of prediction probability values of the neural network language model.
And S450, based on the sorting result, replacing the corresponding participle segment by taking the candidate word with the highest score as a replacement item.
In this embodiment, according to the sorting result, the candidate word with the highest score is used as a replacement item to replace the corresponding participle segment, thereby completing error correction of the proper noun to be corrected.
Referring to fig. 6, fig. 6 is a schematic diagram of a refinement flow of the fourth embodiment in step S440 in fig. 5. Based on the foregoing embodiment, in this embodiment, the foregoing step S440 further includes:
step S4401, determining word frequency information of the corresponding search candidate words based on the search candidate words;
in this embodiment, the corresponding candidate search word includes a matching entry and/or an associated entry, for example, a word group (fallopian tube, tubal) "is used as a keyword, and the inverted index is used to search for" fallopian tube "," seminal duct "," blood transfusion "," money "," input "," transportation "waiting for word selection, where the fallopian tube", "seminal duct" and blood transfusion "are matching entries, and" blood transfusion "," input "," transportation "waiting for word selection are associated entries. And obtaining word frequency information corresponding to each retrieval candidate word by counting the occurrence times of the entries in the original corpus.
Step S4402, calculating a score corresponding to the search candidate word based on the word frequency information of the search candidate word, wherein the word frequency information is in direct proportion to the score;
in this embodiment, according to the word frequency information of the search candidate word, the word frequency information may be directly used as a score, or the word frequency information may be divided into different sections according to size, and each section corresponds to a different score. For example, the word frequency information is directly used as the score, so that the word frequency information and the score are in a direct proportion relationship, the search candidate words can be sorted according to the magnitude sequence of the word frequency information, and the candidate words with larger word frequency information are arranged in front, that is, the words with larger word frequency can be preferentially displayed, or the word frequency information can be sorted in a reverse order according to the magnitude thereof. It can be understood that, in practical applications, a person skilled in the art may flexibly select a sorting manner according to needs, and the embodiment of the present invention does not limit a specific manner of sorting candidate words according to word frequency information.
In this embodiment, the scores of the candidate words may be calculated according to the word frequency information and the edit distance, and the candidate words may be sorted according to the scores. The editing distance refers to the number of the original word and the candidate word which are different from each other in the same position character.
Specifically, the score of the candidate may be calculated by the following formula:
score log10(word frequency) -edit distance
Step S1503, sorting the search candidate words based on the scores.
In this embodiment, the candidate words may be sorted according to the score. Such as: "can the water beans be reimbursed? The candidate words of the error word 'chickpea' are 'chicken pox' and 'chicken city', the editing distances of the candidate words and the chicken city are all 1, and at the moment, the candidate words can be distinguished according to word frequency; thereby selecting varicella with high score and varicella log10(10w) -1 ═ 5; log of water city10(426)–1=2.63
And finally, the search candidate words are sorted according to the scores, so that the user can quickly complete error correction, and the problem of low operation efficiency of the conventional error correction method is solved.
Referring to fig. 7, fig. 7 is a schematic view of a detailed flow of the step S160 in fig. 2. Based on the foregoing embodiment, in this embodiment, the foregoing step S160 further includes:
step S1601, acquiring scores of the search candidate words based on the sorting result;
in this embodiment, the scores of the search candidate words are obtained according to the ranking result, where the ranking result is arranged from high to low according to the scores. For example, the pinyin (dajiang) is used as a keyword, the searched candidate words include "big prize", "grand general", "great jiang", "great river" and "fermented soybean paste", the ranking results of the candidate words include "big prize", "great general" and "fermented soybean paste", and the scores are 0.85,0.6,0.55,0.3,0.15 and 0.1 respectively.
Step S1602, based on the scores of the search candidate words, using the search candidate word with the highest score as a replacement item to replace the corresponding participle segment.
In this embodiment, the candidate search word with the highest score is used as a replacement item to replace the corresponding participle segment. For example, "cephalo with moderate anemia" may be used as a keyword to search for the corresponding candidate words "application, cephalo, stealing package, …," the scores of the three candidate words are 0.95,0.6, and 0.2, respectively, and the scores of the three candidate words "application", "cephalo", and "stealing package" are compared, and the candidate word with the highest score is used as a replacement, that is, "application" is used as a replacement, and error correction of the text to be corrected is completed.
Referring to fig. 8, fig. 8 is a functional module diagram of an embodiment of the intelligent error correction apparatus according to the proper noun of the present invention. In this embodiment, the intelligent error correction device for proper nouns includes:
an obtaining module 10, configured to obtain a proper noun to be corrected;
a word segmentation module 20, configured to perform word segmentation processing on the proper nouns to be corrected to obtain multiple word segmentation segments of the text to be corrected, and output the word segmentation segments in a pinyin format;
a candidate word determining module 30, configured to, based on the word segmentation segment in the pinyin format, respectively use the pinyin of each word segmentation segment as a keyword, and retrieve a candidate word corresponding to the word segmentation segment from a preset homophone dictionary to obtain a retrieval result; if the retrieval result is not empty, determining each retrieval candidate word based on the retrieval result;
the score calculating module 40 is configured to calculate scores of the search candidate words based on the search candidate words, sort the scores, and output a sorting result;
and the error correction module 50 is configured to take the candidate word with the highest score as a replacement item to replace the corresponding participle segment based on the sorting result.
Optionally, in a specific embodiment, the intelligent noun error correction device further includes:
the acquisition module is used for acquiring a first original corpus;
the word segmentation module is used for carrying out word segmentation processing on the first original corpus to obtain a plurality of word fragments of the original corpus;
the statistic module is used for inputting the word segments in a pinyin format and counting the pinyins of the word segments;
and the first construction module is used for determining the word segments with the same pinyin based on the pinyins of the word segments and constructing a homophone dictionary.
Optionally, in a specific embodiment, the obtaining module is further configured to obtain a second original corpus;
optionally, in a specific embodiment, the intelligent noun error correction device further includes:
the word segmentation module is used for performing word segmentation processing on the second original corpus to obtain a plurality of word segments of the second original corpus, and performing single word segmentation on the word segments respectively to obtain a single word set;
and the second construction module is used for constructing the inverted index dictionary based on the single character set.
Optionally, in a specific embodiment, the candidate word determining module is specifically configured to:
when the retrieval result is empty, traversing and eliminating the characters in the word segmentation segment to obtain a plurality of word groups;
respectively taking the phrases as keywords, calling a preset inverted index dictionary, and retrieving a plurality of candidate words corresponding to the phrases to obtain retrieval results; and outputting the retrieval candidate words corresponding to each phrase based on the retrieval result.
Optionally, the score calculating module is specifically configured to:
on the basis of the retrieval candidate words, calculating scores of the retrieval candidate words of the inverted index dictionary, sequencing the scores, and outputting a sequencing result;
the error correction module is specifically configured to: and based on the sorting result, taking the candidate word with the highest score as a replacement item to replace the corresponding word segmentation.
Optionally, in a specific embodiment, the score calculating module is further specifically configured to:
determining word frequency information of the corresponding search candidate words based on the search candidate words; calculating a score corresponding to the retrieval candidate word based on the word frequency information of the retrieval candidate word, wherein the word frequency information is in direct proportion to the score; and sorting the search candidate words based on the scores.
Optionally, in a specific embodiment, the error correction module is further specifically configured to:
obtaining the scores of the retrieval candidate words based on the sorting result; and taking the retrieval candidate word with the highest score as a replacement item to replace the corresponding participle segment based on the score of the retrieval candidate word.
In the embodiment, word segmentation is performed on the proper nouns to be corrected, then a plurality of word segmentation segments after word segmentation are converted into a pinyin format to be output, when retrieval is performed, the pinyin of each word segmentation segment is used as a keyword, candidate words corresponding to the word segmentation segments are retrieved from a preset homophone dictionary, finally, the retrieved candidate words are sequenced, the candidate word with the highest score is used as a replacement item to replace the corresponding word segmentation segment, error correction of wrongly-written characters in the proper nouns is further realized, and the accuracy of the proper nouns is ensured. The invention provides more accurate intelligent error correction service of proper nouns for users, and the realization process does not depend on large batches of linguistic data, thereby greatly improving the error correction efficiency of the proper nouns.
Based on the same embodiment as the above-mentioned intelligent error correction method of proper nouns of the present invention, the contents of the embodiment of the intelligent error correction device of proper nouns are not described in detail in this embodiment.
The invention also provides a computer readable storage medium.
In this embodiment, the computer readable storage medium stores an intelligent noun error correction program, and the intelligent noun error correction program, when executed by the processor, implements the steps of the intelligent noun error correction method described in any one of the above embodiments. The method implemented when the intelligent error correction program of proper nouns is executed by the processor may refer to various embodiments of the intelligent error correction method of proper nouns of the present invention, and thus, redundant description is not repeated.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims (10)

1. An intelligent proper noun error correction method is characterized by comprising the following steps:
acquiring a text to be corrected, and determining a proper noun to be corrected based on the text to be corrected;
performing word segmentation processing on the proper noun to be corrected to obtain a plurality of word segmentation segments of the proper noun to be corrected, and outputting the pinyin of each word segmentation segment;
taking the pinyin of each word segmentation segment as a keyword, and retrieving candidate words corresponding to the word segmentation segment from a preset homophone dictionary to obtain a retrieval result;
if the retrieval result is not empty, determining each retrieval candidate word based on the retrieval result;
calculating the scores of the retrieval candidate words based on the retrieval candidate words, sequencing the scores, and outputting a sequencing result;
and based on the sorting result, taking the candidate word with the highest score as a replacement item to replace the corresponding word segmentation.
2. The intelligent noun correction method of claim 1, wherein before the step of obtaining proper nouns to be corrected, further comprising:
acquiring a first original corpus;
performing word segmentation processing on the first original corpus to obtain a plurality of word fragments of the original corpus;
inputting the word segments in a pinyin format, and counting the pinyins of the word segments;
and determining the word segments with the same pinyin based on the pinyins of the word segments, and constructing a homophone dictionary, wherein the homophone dictionary comprises the corresponding relation between the same pinyin and different characters.
3. The intelligent noun correction method of claim 1, wherein before the step of obtaining proper nouns to be corrected, further comprising:
acquiring a second original corpus;
performing word segmentation processing on the second original corpus to obtain a plurality of word fragments of the second original corpus;
based on the word segments, respectively carrying out word segmentation on the word segments to obtain a single word set;
and constructing an inverted index dictionary based on the single character set.
4. The intelligent noun error correction method of claim 1, wherein after the step of retrieving candidate words corresponding to the participle fragments from a preset homophone dictionary by using the pinyin of each participle fragment as a keyword based on the participle fragments in the pinyin format to obtain a retrieval result, the method further comprises:
if the retrieval result is empty, traversing and eliminating the characters in the word segmentation segment to obtain a plurality of word groups;
respectively taking the phrases as keywords, calling a preset inverted index dictionary, and retrieving a plurality of candidate words corresponding to the phrases to obtain retrieval results;
and outputting the retrieval candidate words corresponding to each phrase based on the retrieval result.
5. The intelligent noun error correction method of claim 4, wherein after the step of outputting the search candidate word corresponding to each word group based on the search result, further comprising:
calculating the scores of the retrieval candidate words based on the retrieval candidate words, sequencing the scores, and outputting a sequencing result;
and based on the sorting result, taking the candidate word with the highest score as a replacement item to replace the corresponding word segmentation.
6. The intelligent noun correction method of claim 1 or 5, wherein the calculating and ranking scores of the search candidate words based on the search candidate words, and outputting the ranking result comprises:
determining word frequency information of the corresponding search candidate words based on the search candidate words;
calculating a score corresponding to the retrieval candidate word based on the word frequency information of the retrieval candidate word, wherein the word frequency information is in direct proportion to the score;
and sorting the search candidate words based on the scores.
7. The intelligent noun error correction method of claim 6, wherein the candidate word with the highest score is used as a replacement item based on the sorting result, and replacing the corresponding participle segment comprises:
obtaining the scores of the retrieval candidate words based on the sorting result;
and taking the retrieval candidate word with the highest score as a replacement item to replace the corresponding participle segment based on the score of the retrieval candidate word.
8. An intelligent noun error correction device, characterized in that the intelligent noun error correction device comprises:
the acquisition module is used for acquiring proper nouns to be corrected;
the word segmentation module is used for carrying out word segmentation processing on the proper nouns to be corrected to obtain a plurality of word segmentation segments of the text to be corrected and outputting the word segmentation segments in a pinyin format;
the candidate word determining module is used for respectively taking the pinyin of each word segmentation segment as a keyword based on the word segmentation segments in the pinyin format, and retrieving candidate words corresponding to the word segmentation segments from a preset homophone dictionary to obtain a retrieval result; if the retrieval result is not empty, determining each retrieval candidate word based on the retrieval result;
the score calculating module is used for calculating the scores of the retrieval candidate words based on the retrieval candidate words, sequencing the scores and outputting a sequencing result;
and the error correction module is used for taking the candidate word with the highest score as a replacement item to replace the corresponding word segmentation based on the sorting result.
9. An intelligent noun error correction device, characterized in that the intelligent noun error correction device comprises a memory, a processor and an intelligent noun error correction program stored on the memory and executable on the processor, and when the intelligent noun error correction program is executed by the processor, the steps of the intelligent noun error correction method according to any one of claims 1-7 are implemented.
10. A computer-readable storage medium, wherein the computer-readable storage medium has stored thereon a smart error correction program of proper nouns, which when executed by a processor implements the steps of the smart error correction method of proper nouns according to any one of claims 1-7.
CN202010164805.5A 2020-03-11 2020-03-11 Intelligent error correction method, device and equipment for proper nouns and storage medium Pending CN111428494A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010164805.5A CN111428494A (en) 2020-03-11 2020-03-11 Intelligent error correction method, device and equipment for proper nouns and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010164805.5A CN111428494A (en) 2020-03-11 2020-03-11 Intelligent error correction method, device and equipment for proper nouns and storage medium

Publications (1)

Publication Number Publication Date
CN111428494A true CN111428494A (en) 2020-07-17

Family

ID=71547714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010164805.5A Pending CN111428494A (en) 2020-03-11 2020-03-11 Intelligent error correction method, device and equipment for proper nouns and storage medium

Country Status (1)

Country Link
CN (1) CN111428494A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183072A (en) * 2020-10-16 2021-01-05 北京奇艺世纪科技有限公司 Text error correction method and device, electronic equipment and readable storage medium
CN112182140A (en) * 2020-08-17 2021-01-05 北京来也网络科技有限公司 Information input method and device combining RPA and AI, computer equipment and medium
CN112307183A (en) * 2020-10-30 2021-02-02 北京金堤征信服务有限公司 Search data identification method and device, electronic equipment and computer storage medium
CN112800765A (en) * 2021-01-22 2021-05-14 南京亚派软件技术有限公司 Automatic work order generation method
CN113051894A (en) * 2021-03-16 2021-06-29 京东数字科技控股股份有限公司 Text error correction method and device
CN113268977A (en) * 2021-07-19 2021-08-17 中国平安人寿保险股份有限公司 Text error correction method and device based on language model, terminal equipment and medium
CN113535895A (en) * 2021-06-22 2021-10-22 北京三快在线科技有限公司 Search text processing method and device, electronic equipment and medium
CN113553833A (en) * 2021-06-30 2021-10-26 北京百度网讯科技有限公司 Text error correction method and device and electronic equipment
WO2023035525A1 (en) * 2021-09-10 2023-03-16 平安科技(深圳)有限公司 Speech recognition error correction method and system, and apparatus and storage medium
CN116681070A (en) * 2023-08-04 2023-09-01 北京永辉科技有限公司 Text error correction method, system, model training method, medium and equipment

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182140A (en) * 2020-08-17 2021-01-05 北京来也网络科技有限公司 Information input method and device combining RPA and AI, computer equipment and medium
CN112183072B (en) * 2020-10-16 2023-07-21 北京奇艺世纪科技有限公司 Text error correction method and device, electronic equipment and readable storage medium
CN112183072A (en) * 2020-10-16 2021-01-05 北京奇艺世纪科技有限公司 Text error correction method and device, electronic equipment and readable storage medium
CN112307183A (en) * 2020-10-30 2021-02-02 北京金堤征信服务有限公司 Search data identification method and device, electronic equipment and computer storage medium
CN112307183B (en) * 2020-10-30 2024-04-19 北京金堤征信服务有限公司 Search data identification method, apparatus, electronic device and computer storage medium
CN112800765A (en) * 2021-01-22 2021-05-14 南京亚派软件技术有限公司 Automatic work order generation method
CN113051894A (en) * 2021-03-16 2021-06-29 京东数字科技控股股份有限公司 Text error correction method and device
CN113535895A (en) * 2021-06-22 2021-10-22 北京三快在线科技有限公司 Search text processing method and device, electronic equipment and medium
CN113553833A (en) * 2021-06-30 2021-10-26 北京百度网讯科技有限公司 Text error correction method and device and electronic equipment
CN113553833B (en) * 2021-06-30 2024-01-19 北京百度网讯科技有限公司 Text error correction method and device and electronic equipment
CN113268977A (en) * 2021-07-19 2021-08-17 中国平安人寿保险股份有限公司 Text error correction method and device based on language model, terminal equipment and medium
WO2023035525A1 (en) * 2021-09-10 2023-03-16 平安科技(深圳)有限公司 Speech recognition error correction method and system, and apparatus and storage medium
CN116681070A (en) * 2023-08-04 2023-09-01 北京永辉科技有限公司 Text error correction method, system, model training method, medium and equipment

Similar Documents

Publication Publication Date Title
CN111428494A (en) Intelligent error correction method, device and equipment for proper nouns and storage medium
JP3696745B2 (en) Document search method, document search system, and computer-readable recording medium storing document search program
CN110069610B (en) Solr-based retrieval method, solr-based retrieval device, solr-based retrieval equipment and storage medium
US4775956A (en) Method and system for information storing and retrieval using word stems and derivative pattern codes representing familes of affixes
US5794177A (en) Method and apparatus for morphological analysis and generation of natural language text
US20050203900A1 (en) Associative retrieval system and associative retrieval method
JP2742115B2 (en) Similar document search device
US20030149686A1 (en) Method and system for searching a multi-lingual database
EP0378848A2 (en) Method for use of morphological information to cross reference keywords used for information retrieval
US20030217071A1 (en) Data processing method and system, program for realizing the method, and computer readable storage medium storing the program
RU2006114696A (en) SYSTEMS AND METHODS FOR SEARCH USING QUESTIONS WRITTEN IN THE LANGUAGE AND / OR A SET OF SYMBOLS DIFFERENT FROM THOSE FOR TARGET PAGES
JPH11110416A (en) Method and device for retrieving document from data base
CN104199965A (en) Semantic information retrieval method
US20160140182A1 (en) Systems and methods for parsing search queries
EP1099171B1 (en) Accessing a semi-structured database
CN111753534A (en) Identifying sequence titles in a document
JPH1145268A (en) Document retrieval device and computer-readable recording medium where eprogram making computer funtion as same device is recorded
JP3371983B2 (en) Method and apparatus for collating incomplete character strings with character strings
CN110909128A (en) Method, equipment and storage medium for data query by using root table
JPH08115330A (en) Method for retrieving similar document and device therefor
Dershowitz et al. Relating articles textually and visually
JPH07296005A (en) Japanese text registration/retrieval device
CN115114412B (en) Method for retrieving information in document, electronic device and storage medium
CN117743562A (en) Retrieval method and system for regulation system
JPH09245051A (en) Device and method for retrieving natural language instance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
CB03 Change of inventor or designer information

Inventor after: Chen Leqing

Inventor after: Zeng Zengfeng

Inventor after: Liu Dongyu

Inventor before: Zeng Zengfeng

Inventor before: Liu Dongyu

CB03 Change of inventor or designer information
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination