CN111444315A

CN111444315A - Method, device and equipment for screening error correction phrase candidate items and storage medium

Info

Publication number: CN111444315A
Application number: CN202010164836.0A
Authority: CN
Inventors: 曾增烽; 刘东煜
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-07-24

Abstract

The invention relates to the technical field of big data, and discloses a method for screening candidate items of error-correcting phrases, which comprises the following steps: reading a plurality of candidate items corresponding to the phrase to be corrected, respectively measuring and calculating the primary weight value of each candidate item, sequencing the candidate items, determining a first sequencing result of the candidate items, and acquiring a first candidate item corresponding to the phrase to be corrected according to the first sequencing result; calling a preset secondary scoring sorting model, respectively measuring and calculating the secondary weight values of the first candidate candidates and sorting the first candidate candidates to obtain a second sorting result of the first candidate candidates, obtaining second candidate candidates corresponding to the phrase to be corrected, screening the second candidate candidates with the highest secondary weight values in the second candidate candidates, and marking the corresponding second candidate candidates as target candidate items. The invention also discloses a screening device and equipment for the error correction phrase candidate items and a computer readable storage medium. The invention provides more accurate screening service of the error correction phrase candidate items for the user and improves the accuracy of risk monitoring.

Description

Method, device and equipment for screening error correction phrase candidate items and storage medium

Technical Field

The present invention relates to the field of big data technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for screening candidate items of error-correcting phrases.

Background

With the development of technology, artificial intelligence is becoming more and more common, for example, in a human-computer conversation scene, a machine needs to accurately acquire the intention of a user through conversation contents with the user, and usually needs to correct errors occurring in texts caused by wrong words or language conversion for the user in the conversation contents of the user. The existing error correction technology mainly needs to go through the processes of error detection, candidate recall, candidate sorting and the like.

In the prior art, an existing error correction model mainly corrects errors in words which are wrongly written by users or in proper nouns due to language conversion, and a single model is required to be used for screening and matching all candidate items of a phrase to be corrected.

Disclosure of Invention

The invention mainly aims to provide a method, a device and equipment for screening candidate items of error correction phrases and a computer-readable storage medium, and aims to solve the technical problem of low implementation efficiency of the existing error correction technology.

In order to achieve the above object, the present invention provides a method for screening error correction phrase candidate items, which comprises the following steps:

reading a plurality of candidate items corresponding to the phrase to be corrected;

measuring and calculating a primary weight value of each candidate item and sequencing the candidate items based on the attribute value of each candidate item, and determining a first sequencing result of the candidate items, wherein the attribute values of the candidate items comprise word frequency, editing distance and pinyin jaccard distance;

acquiring a plurality of first candidate candidates corresponding to the phrase to be corrected based on the first sequencing result;

calling a preset secondary grading sorting model, respectively obtaining secondary weight values of the first candidate items, and sorting to obtain a second sorting result of the first candidate items;

acquiring a plurality of second candidate candidates corresponding to the phrase to be corrected based on the second sorting result;

and screening a second candidate item with the highest secondary weight value in the second candidate items, and marking the corresponding second candidate item as a target candidate item.

Optionally, before reading the multiple candidates corresponding to the phrase to be error-corrected, the method further includes:

obtaining corpus data and taking the corpus data as a training sample set;

extracting first parameter characteristics of the training sample set based on the training sample set, wherein the first parameter characteristics comprise word frequency variation characteristics, word segmentation variation characteristics and language model characteristics;

and training the training sample set by adopting an XGboost algorithm based on the first parameter characteristics of the training sample set so as to construct a secondary scoring and ordering model.

Optionally, the measuring and ranking a primary weight value of each candidate item based on the attribute value of each candidate item, and determining a first ranking result of the candidate items includes:

respectively acquiring the attribute value of each candidate item, wherein the attribute values of the candidate items comprise word frequency, editing distance and pinyin jaccard distance;

respectively calculating primary weight values of the candidate items based on the word frequency, the editing distance and the pinyin jaccard distance of the candidate items;

wherein the primary weight value of the candidate item is calculated by adopting the following formula:

M＝log₁₀(T)-P-Q；

wherein M represents the primary weight value of the candidate item, T represents the word frequency of the corresponding candidate item, P represents the editing distance of the corresponding candidate item, and Q represents the pinyin jaccard distance;

and sorting the candidate items according to the weight values based on the primary weight values of the candidate items to obtain a first sorting result.

Optionally, the calling a preset secondary scoring ranking model, respectively obtaining the secondary weight values of the first candidate items, and ranking, so as to obtain a second ranking result of the first candidate items includes:

respectively extracting second parameter characteristics of the first candidate candidates, wherein the second parameter characteristics comprise word frequency variation characteristics, word segmentation variation characteristics and language model characteristics;

calling a preset secondary scoring sorting model based on the second parameter characteristics, and respectively measuring and calculating a secondary weight value of the first candidate item;

and sorting the first candidate items according to a preset sequence according to the secondary weight values of the first candidate items to obtain a second sorting result of the first candidate items.

Optionally, the obtaining, based on the second sorting result, a plurality of second candidate candidates corresponding to the phrase to be error-corrected includes:

screening the first candidate items based on the second sorting result;

if the secondary weight value of the first candidate item is smaller than a preset threshold value, rejecting the first candidate item;

if the second-level weight value of the first candidate item is larger than a preset threshold value, the first candidate item is reserved;

and marking the first candidate item retained after screening as a second candidate item.

Optionally, after the step of screening a second candidate with a highest secondary weight value among the second candidate candidates and labeling a corresponding second candidate as a target candidate, the method further includes:

and taking the target candidate item as a replacement item to replace the phrase to be corrected.

Further, the present invention also provides a device for screening error-correcting phrase candidate items, where the device for screening error-correcting phrase candidate items includes:

the reading module is used for reading a plurality of candidate items corresponding to the phrases to be corrected;

the measuring and calculating module is used for measuring and calculating the primary weight value of each candidate item based on the attribute value of each candidate item, sorting the candidate items and determining a first sorting result of the candidate items;

a first obtaining module, configured to obtain, based on the first sorting result, a plurality of first candidate candidates corresponding to the phrase to be error-corrected;

the secondary measuring and calculating module is used for calling a preset secondary scoring and sorting model, respectively measuring and calculating the secondary weight value of the first candidate item and sorting to obtain a second sorting result of the first candidate item;

a second obtaining module, configured to obtain, based on the second sorting result, a plurality of second candidate candidates corresponding to the phrase to be corrected;

and the marking module is used for screening the second candidate item with the highest secondary weight value in the second candidate items and marking the corresponding second candidate item as the target candidate item.

Optionally, the apparatus for screening error-correcting phrase candidates further includes:

the third acquisition module is used for acquiring the corpus data and taking the corpus data as a training sample set;

the extraction module is used for extracting a first parameter characteristic of the training sample set based on the training sample set;

and the construction module is used for training the training sample set by adopting an XGboost algorithm based on the first parameter characteristic of the training sample set so as to construct a two-stage scoring and ordering model.

Optionally, the calculation module is specifically configured to:

respectively acquiring the attribute value of each candidate item, wherein the attribute values of the candidate items comprise word frequency, editing distance and pinyin jaccard distance; respectively calculating primary weight values of the candidate items based on the word frequency, the editing distance and the pinyin jaccard distance of the candidate items; sorting the candidate items according to the weight values based on the primary weight values of the candidate items to obtain a first sorting result;

M＝log₁₀(T)-P-Q；

wherein M represents the primary weight value of the candidate item, T represents the word frequency of the corresponding candidate item, P represents the editing distance of the corresponding candidate item, and Q represents the pinyin jaccard distance.

Optionally, the secondary reckoning module is specifically configured to:

respectively extracting second parameter characteristics of the first candidate candidates, wherein the second parameter characteristics comprise word frequency variation characteristics, word segmentation variation characteristics and language model characteristics; calling a preset secondary scoring sorting model based on the second parameter characteristics, and respectively measuring and calculating a secondary weight value of the first candidate item; and sorting the first candidate items according to a preset sequence according to the secondary weight values of the first candidate items to obtain a second sorting result of the first candidate items.

Optionally, the second obtaining module is further specifically configured to:

screening the first candidate items based on the second sorting result; if the second-level weight value of the first candidate item is smaller than a preset threshold value, rejecting the first candidate item; if the second-level weight value of the first candidate item is larger than a preset threshold value, the first candidate item is reserved; and marking the first candidate item retained after screening as a second candidate item.

Further, the apparatus for screening error-correcting phrase candidates further includes:

and the replacing module is used for replacing the phrase to be corrected by taking the target candidate item as a replacing item.

Further, in order to achieve the above object, the present invention further provides a screening apparatus for an error correction phrase candidate item, where the screening apparatus for an error correction phrase candidate item includes a memory, a processor, and a screening program for an error correction phrase candidate item stored in the memory and executable on the processor, and when the screening program for an error correction phrase candidate item is executed by the processor, the step of implementing the screening method for an error correction phrase candidate item as in any one of the above.

Further, in order to achieve the above object, the present invention further provides a computer-readable storage medium, on which a filter of the error correction phrase candidates is stored, and when being executed by a processor, the filter of the error correction phrase candidates implements the steps of the method for filtering the error correction phrase candidates according to any one of the above items.

In this embodiment, a large amount of phrases to be error-corrected are error-corrected by a secondary candidate error correction method, a plurality of candidate items are obtained, a primary weight value of all the candidate items is calculated by using a primary sorting rule, and a first candidate item with a preset number is obtained, where the larger the weight value is, the higher the accuracy of the candidate item is. According to the primary weight value, a part of irrelevant candidate items are screened out, all the candidate items are prevented from entering secondary candidate items, the influence of too many irrelevant candidate items on the accuracy of screening out the correct candidate items by a secondary sorting model is avoided, then the secondary weight values corresponding to the first candidate item are obtained and sorted, the second candidate item with the largest weight value is marked as a target candidate item to be used for replacing a phrase to be corrected to finish error correction, and the technical problem that the conventional error correction technology is low in implementation efficiency is solved. The invention adopts the strategies of primary filtering and secondary scoring, thereby accelerating the system operation efficiency and improving the error correction processing speed.

Drawings

Fig. 1 is a schematic structural diagram of a hardware operating environment of a device according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for screening candidate words of error correcting phrases according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for screening candidate words of error correcting phrases according to a second embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a detailed flow of step S20 in FIG. 2;

FIG. 5 is a schematic view of a detailed flow chart of an embodiment of step S40 in FIG. 2;

FIG. 6 is a schematic diagram illustrating a detailed flow of step S50 in FIG. 2;

fig. 7 is a functional block diagram of an embodiment of an apparatus for screening candidate words of error correcting phrases according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides screening equipment for candidate items of error correction phrases.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a hardware operating environment of a device according to an embodiment of the present invention.

As shown in fig. 1, the filtering device for the error correction phrase candidates may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the hardware configuration of the filtering device for error correction phrase candidates shown in fig. 1 does not constitute a limitation of the filtering device for error correction phrase candidates, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a filter of error correction phrase candidates. The operating system is a program for managing and controlling screening equipment and software resources of the error correction phrase candidate items, and supports the running of a network communication module, a user interface module, a screening program of the error correction phrase candidate items and other programs or software; the network communication module is used to manage and control the network interface 1004; the user interface module is used to manage and control the user interface 1003.

In the hardware structure of the screening device for candidate terms of error correcting phrases shown in fig. 1, the network interface 1004 is mainly used for connecting to a system background and performing data communication with the system background; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; the screening apparatus of the error correction phrase candidates calls a screening program of the error correction phrase candidates stored in the memory 1005 by the processor 1001, and performs the operations of the following embodiments of the screening method of the error correction phrase candidates.

Based on the hardware structure of the screening device for the error correction phrase candidate items, the embodiments of the screening method for the error correction phrase candidate items are provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for screening candidate words of error correcting phrases according to a first embodiment of the present invention. In this embodiment, the method for screening the error correction phrase candidates includes the following steps:

step S10, reading a plurality of candidates corresponding to the phrase to be corrected;

in this embodiment, the phrase to be corrected refers to a word that is wrongly written by the user or a word that is wrongly written by the user and causes a wrong phrase to appear in the text, for example, "juniper" in "juniper can be guaranteed", "warming tube" in "insurance can be bought in the warming tube surgery," and for example, "lupus erythematosus" in "systemic lupus erythematosus", and the like, the text to be corrected.

In this embodiment, the candidate items of the phrase to be corrected refer to corresponding candidate items recalled from the preset dictionary with the phrase to be corrected as a keyword. For example, the "water city" in "water city can be guaranteed" is used as a keyword, and the corresponding candidate items recalled include "chicken pox", "water skip", "water bean", and the like. For another example, the "transfusion" is used as a keyword to recall, and the obtained candidate items include a plurality of corresponding candidate items such as "fallopian tube", "vas deferens", "blood transfusion", "transfusion", and the like.

Step S20, based on the attribute value of each candidate item, measuring and ranking the primary weight value of each candidate item, and determining a first ranking result of the candidate items;

in this embodiment, the primary weight value of each candidate item corresponding to the phrase to be corrected is measured, and further, each candidate item is sorted according to the primary weight value to obtain a corresponding first sorting result. The sorting mode may be arranged from high to low according to the weight value, or from low to high, and the specific sorting mode is not limited.

In this embodiment, the larger the primary weight value of each candidate is, the higher the probability that the candidate is a correct candidate is represented, that is, the higher the accuracy that the candidate is a target candidate is. And sorting all candidate items in the phrase to be corrected according to the weight value, reserving a preset number of candidate items with higher accuracy, avoiding all the candidate items from entering a secondary candidate, accelerating the operation of the system, and avoiding influence of excessive irrelevant candidate items on a secondary sorting model so as to obtain more accurate candidate items.

Step S30, based on the first sorting result, obtaining a plurality of first candidate candidates corresponding to the phrase to be error-corrected;

in this embodiment, a first score of each candidate item is obtained according to the obtained first ranking result, and each candidate item is ranked according to the score, so as to obtain a preset number of candidate items. The method aims to prevent all candidate items from entering a secondary candidate system, so that the running efficiency of the system is accelerated, and the influence of excessive irrelevant candidate items is avoided. For example, a total of 50 candidates are retrieved, wherein many candidates are irrelevant, so the 50 candidates are scored and sorted first, the candidates ranked at TOP k (for example, TOP 10) are screened out, the 10 candidates are used as the first candidate candidates, the second round of scoring sorting is performed, the final replacement item is determined from the candidates, and the error correction of the text to be corrected is completed.

And further, marking a preset number of candidate items as the first candidate item of the word segment, and outputting. For example, a total of 50 candidates are retrieved, some of which are irrelevant candidates, so that the candidates ranked at TOP k (e.g., TOP 10) are selected from the 50 candidates as the first candidate, and output for the next work.

Step S40, calling a preset secondary grading sorting model, respectively obtaining the secondary weight values of the first candidate candidates and sorting to obtain a second sorting result of the first candidate candidates;

in this embodiment, according to the first candidate item corresponding to the phrase to be error-corrected, a preset secondary scoring ranking model is invoked to obtain a secondary weight value of each first candidate item, so as to obtain a second ranking result. Further, according to the second sorting result, the secondary weight value of each first candidate item is read. If the secondary weight value of the corresponding first candidate item is smaller than a preset threshold value, such as 0.5, rejecting the first candidate item; and if the secondary weight value of the corresponding first candidate item is greater than a preset threshold value, accepting the first candidate item. And screening out second candidate items corresponding to the phrases to be corrected through the round of elimination.

Step S50, based on the second sorting result, obtaining a plurality of second candidate candidates corresponding to the phrase to be error-corrected;

in this embodiment, according to the second sorting result of the first candidate candidates corresponding to the phrase to be error-corrected, the secondary weight value corresponding to each first candidate is obtained, and according to the secondary weight value, each first candidate is screened, so as to obtain the second candidate corresponding to the phrase to be error-corrected. For example, the first candidate candidates of the phrase "(heating tube)" include "oviduct", "vas deferens", "blood transfusion", "money transfer", "input", "transport", and the like, and these candidates are respectively put into the secondary scoring ranking model for scoring to obtain the secondary weight values corresponding to each candidate, which are respectively: 0.95, 0.85, 0.7, 0.6, 0.2, 0.2, 0.1 and 0.05, and removing the candidate items with the secondary weight value less than 0.5, and reserving four candidate items of 'oviduct', 'vas deferens', 'blood transfusion' as second candidate items.

Step S60, the second candidate with the highest secondary weight value among the second candidate candidates is screened, and the corresponding second candidate is marked as the target candidate.

In this embodiment, according to the weight values of the corresponding second candidate candidates, the second candidate with the highest score is selected from the candidate and labeled as the target candidate. For example, the first candidate candidates of the phrase "(heating tube)" include "oviduct", "vas deferens", "blood transfusion", "money transfer", "input", "transport", and the like, and these candidates are respectively put into the secondary scoring ranking model for scoring to obtain the secondary weight values corresponding to each candidate, which are respectively: 0.95, 0.85, 0.7, 0.6, 0.2, 0.2, 0.1 and 0.05, and removing the candidate items with the secondary weight value less than 0.5, and reserving four candidate items of 'oviduct', 'vas deferens', 'blood transfusion' as second candidate items. Further, the four second candidate candidates are ordered, and the candidate with the largest weight value, that is, "fallopian tube" with a weight value of 0.95, is labeled as the target candidate. The sequence can be from big to small, or from small to big.

Referring to fig. 3, fig. 3 is a flowchart illustrating a method for screening candidate words of error correcting phrases according to a second embodiment of the present invention. In this embodiment, before the step S10, the method further includes:

step S01, obtaining corpus data and using the corpus data as a training sample set;

in this embodiment, a large amount of original corpus data is collected, and the corpus data includes particularly many related industry data or information, such as the disease word "systemic lupus erythematosus" and the like.

In this embodiment, the original corpora are subjected to word segmentation, and further, the preprocessed data are used as a training sample set. The training sample set (training set) is a learning sample set, and a classifier is established by matching some parameters, that is, a machine learning model is trained by using positive and negative samples in the training sample set to determine parameters of the machine learning model.

The method specifically comprises the steps that original corpus data are preprocessed and used as a training sample set, parameters in the secondary scoring ranking model can be effectively trained, the training result is effectively prevented from being biased to the extreme condition, and the scoring ranking effect of the secondary scoring ranking model obtained through original corpus data training is more accurate.

Step S02, extracting first parameter characteristics of the training sample set based on the training sample set, wherein the first parameter characteristics comprise word frequency variation characteristics, word segmentation variation characteristics and language model characteristics;

in this embodiment, the characteristics of the training sample set, such as word frequency variation, word segmentation variation, and language model, are extracted, and further, the accuracy of the training of the secondary scoring ranking model is improved.

And step S03, training the training sample set by adopting an XGboost algorithm based on the first parameter characteristics of the training sample set so as to construct a secondary scoring and ordering model.

In this embodiment, the XGboost algorithm is used to train the training sample set according to the characteristic parameters of the data in the training sample set to construct a secondary scoring ranking model.

In this embodiment, the xgboost (irregular Gradient Boosting) algorithm is an improvement of the Boosting algorithm based on the GBDT (Gradient Boosting decision tree) algorithm, where an internal decision tree uses a regression tree, an output of the algorithm is a set of regression trees, and includes multiple regression trees, a basic idea of training learning is to traverse all segmentation methods (i.e., a node splitting manner) of all features of a training sample, select a segmentation method with the minimum loss, obtain two leaves (i.e., split nodes to generate new nodes), and then continue traversing until:

(1) if the splitting stopping condition is met, outputting a regression tree;

(2) and if the iteration stopping condition is met, outputting a regression tree set.

In this embodiment, the training samples used in the XGboost algorithm are a training sample set formed by original corpus data, that is, each training sample in the training sample set belongs to the same data party — a preset corpus.

In the XGboost algorithm, when traversing all the segmentation methods of all the characteristics in a training sample set, the advantages and disadvantages of the segmentation methods are evaluated through the profit values, and the segmentation method with the minimum loss is selected every time a node is split. Therefore, the profit value of the split node can be used as an evaluation basis for the importance of the feature, and the larger the profit value of the split node is, the smaller the node division loss is, and further, the greater the importance of the feature corresponding to the split node is.

In this embodiment, since the trained two-level scoring and ranking model includes a plurality of regression trees, and different regression trees may use the same feature to perform node segmentation, it is necessary to count the average profit value of the split nodes corresponding to the same feature in all the regression trees included in the two-level scoring and ranking model, and use the average profit value as the score of the corresponding feature.

Referring to fig. 4, fig. 4 is a schematic view of a detailed flow of the step S20 in fig. 2. Based on the foregoing embodiment, in this embodiment, the foregoing step S20 further includes:

step S201, respectively obtaining attribute values of each candidate item, wherein the attribute values of the candidate items comprise word frequency, editing distance and pinyin jaccard distance;

in this embodiment, according to a plurality of candidate items corresponding to a phrase to be corrected, a word frequency, an editing distance, and a pinyin jaccard distance of each candidate item are respectively obtained.

In this embodiment, the edit distance refers to the number of characters in the same position of the original word and the candidate word, and the pinyin jaccard distance refers to the number of letters in the same position of the pinyin 1 and the pinyin 2 of the word segment or the number of combined letters of the pinyin 1 and the pinyin 2.

In this embodiment, the corresponding candidate words include matching terms and/or associated terms, for example, a phrase "(fallopian tube)" is used as a keyword, and "fallopian tube", "vas deferens", "transfusion", "transfused", "money", "input", "transport" are recalled in the inverted index to wait for selection, where the fallopian tube "," vas deferens "are matching terms, and the" transfusion "," transfused "," money "," input "," transport "wait for selection are associated terms. And obtaining the word frequency corresponding to each recalled candidate word by counting the occurrence times of the entries in the original corpus.

Step S202, respectively calculating primary weight values of the candidate items based on the word frequency, the editing distance and the pinyin jaccard distance of the candidate items;

M＝log₁₀(T)-P-Q；

in this embodiment, the primary weight values of the candidate items may be calculated according to the word frequency information, the edit distance, and the pinyin jaccard distance, and the candidate items may be ranked according to the primary weight values.

In this embodiment, the word frequency of the candidate word is proportional to the weight value, the edit distance of the candidate word is inversely proportional to the weight value, and the pinyin jaccard distance of the candidate word is inversely proportional to the weight value.

In this embodiment, the pinyin jaccard distance refers to the number of letters in the word segment pinyin 1 and the word segment pinyin 2, or the number of letters in the pinyin 1 and 2.

In this embodiment, the word frequency and the edit distance of each candidate item are respectively obtained, where the edit distance refers to the number of characters in the same position of the original word and the candidate word that are different. Such as: "can the water beans be reimbursed? The candidate words of the error word 'chicken bean' are 'chicken pox', 'chicken city', 'chicken bean' and the candidate word 'chicken pox', 'chicken city' and the same position character have the same number which is 1, so the editing distance is 1.

Further, since the word frequency of the candidate word is proportional to the weight value, when the primary weight values of the candidate words are calculated and sorted, the word frequency may be directly used as the weight value, or the word frequency may be divided into different intervals according to the size, and each interval corresponds to a different weight value. For example, by using the relationship that the word frequency information is in direct proportion to the score, and directly using the word frequency information as the weight value, each recalled candidate word may be sorted according to the magnitude order of the word frequency information, and the candidate word with larger word frequency information is arranged in front, that is, the word with larger word frequency may be preferentially displayed, or the word frequency may be sorted in reverse order according to the magnitude of the word frequency. It can be understood that, in practical applications, a person skilled in the art may flexibly select a sorting manner according to needs, and the embodiment of the present invention does not limit a specific manner of sorting candidate words according to word frequency information.

Specifically, the primary weight value T of the candidate may be calculated by the following formula:

primary weight value of log10 (word frequency) -edit distance-Pinyin jaccard distance

Such as: "can the water beans be reimbursed? The candidate words of the error word 'chickpea' are 'chicken pox' and 'chicken city', the editing distances of the candidate words and the chicken city are all 1, the jaccard distance of the pinyin is all 0, and the candidate words can be distinguished according to word frequency; thereby selecting varicella with large weight value "

Varicella log₁₀(10w)–1–jaccard(dou,dou)＝4

Log of water city₁₀(426)–1–jaccard(dou,dou)＝1.63

Furthermore, the pinyin jaccard distance is introduced, so that the difference of initial consonants and vowels caused by the fact that local pronunciation is not standard can be solved;

such as: the ' oviduct ' and the ' vas deferens ' which are candidates for the ' oviduct ' and the ' vas deferens ' error word ' reimbursement flow of the ' oviduct surgery '; the word frequency is 10w, the editing distances are different by 1, but the jaccard distances of the pinyin are different due to the fact that the consonants of the pinyin are different; so that the 'fallopian tube' with a large weight value can be preferentially selected.

Further, since the word frequency of the candidate word is proportional to the weight value, when the primary weight value of each candidate word is calculated, the word frequency information may also be directly used as the weight value. For example, by using the relationship that the word frequency information is in direct proportion to the weight value, and directly using the word frequency information as the weight value, the candidate items can be sorted according to the magnitude sequence of the word frequency information, and the candidate items with larger word frequency information are arranged in front, that is, words with larger word frequency can be preferentially displayed, or the word frequency can be sorted in reverse order according to the magnitude of the word frequency. In practical applications, a person skilled in the art can flexibly select a sorting manner according to needs, and the embodiment of the present invention does not limit the specific manner of sorting candidate words according to the word frequency information.

Step S203, based on the primary weight value of the candidate item, sorting the candidate item according to the weight value, and obtaining a first sorting result.

In this embodiment, the candidate items are sorted according to the primary weight values of the candidate items, and a preset number of candidate items are obtained, so as to prevent all the candidate items from entering the secondary candidate system, thereby increasing the operating efficiency of the system and reducing the influence of irrelevant candidate items. For example, a total of 50 candidates are retrieved, wherein many candidates are irrelevant, so the 50 candidates are scored, sorted, screened out candidates ranked at TOP k (e.g., TOP 10), and the 10 candidates are labeled as the first candidate candidates.

And further, marking the candidate items with the preset number as the first candidate item and outputting. For example, a total of 50 candidates are retrieved, some of which are irrelevant candidates, so that the candidates ranked in TOPk (e.g., top 10) are selected from the 50 candidates as the first candidate candidates and output. Further, the first candidate candidates are sorted according to the weight value, and a first sorting result is obtained.

Referring to fig. 5, fig. 5 is a schematic view of a detailed flow of the step S40 in fig. 2. Based on the foregoing embodiment, in this embodiment, the foregoing step S40 further includes:

step S401, respectively extracting second parameter characteristics of the first candidate candidates;

in this embodiment, according to a first candidate item corresponding to a phrase to be corrected, parameter characteristics of the first candidate item are respectively extracted, where the parameter characteristics include a word frequency variation characteristic, a word segmentation variation characteristic, and a language model characteristic.

Step S402, calling a preset secondary scoring and sorting model based on the second parameter characteristics, and respectively measuring and calculating a secondary weight value of the first candidate item;

in this embodiment, since the word frequency variation characteristics, the word segmentation variation characteristics, and the language model characteristics of the candidate items are important for error correction, a secondary scoring and ranking model can be invoked according to the characteristics of the candidate items to respectively measure and calculate the secondary weight values of the first candidate item to be predicted. Such as: [ SLE (systemic lupus erythematosus) ] is accurate, and if the candidate is wrong [ SLE ], the word frequency is low and is lower than 30; the word frequency of the correct candidate item [ lupus erythematosus ] is 1688; after word segmentation processing is carried out on each candidate item, the word segmentation characteristics are checked, and the fact that word segmentation of wrong candidate items (lupus erythematosus) is scattered and is 1\1\2, namely [ lupus erythematosus \ lupus ]; the correct word cutting granularity is very large and is (4), namely four characters form a word (lupus erythematosus); further, when the language model is used to predict the word "spot", it can be found that the wrong spots are not in top20, and the correct "spot" is first ranked with a probability close to 1, which is 0.94.

In this embodiment, each first candidate item is put into the secondary scoring and ranking model, for example, the first candidate item of the phrase "(warming tube)" includes "fallopian tube", "vas deferens", "blood transfusion", "transfusing", "money transfusion", "input", "transport", and the like, and these candidate items are respectively put into the secondary scoring and ranking model for prediction, so as to obtain the secondary weight values corresponding to each candidate item, which are respectively: 0.95,0.6,0.8,0.65,0.3,0.3,0.1,0.05.

Step S403, sorting the first candidate candidates according to a preset order according to the secondary weight values of the first candidate candidates, and obtaining a second sorting result of the first candidate candidates.

In this embodiment, all the first candidate candidates are ranked according to the preset order according to the secondary weight values of the first candidate candidates, so as to obtain corresponding second ranking results. For example, the first candidate item of the phrase "(heating tube)" is "oviduct", "vas deferens", "blood transfusion", "transfusing", "money transfer", "input", "transport", etc., and the secondary weight values of the candidate items are respectively: 0.95,0.6, 0.8,0.65,0.3,0.3,0.1,0.05. And sorting the first sequence data to obtain a corresponding second sorting result.

In this embodiment, the preset sequence may be from large to small, or from small to large. For example, "fallopian tube" (0.95) > "blood vessel" (0.8) > "blood transfusion" (0.65) > "vas deferens" (0.6) > "transfusion" (0.3) > "infusion" (0.1) > "delivery" (0.05) < "infusion (0.1)", "money transfer" (0.3) < "fallopian tube" (0.6) < "blood transfusion" (0.65) < "blood vessel" (0.8) < "fallopian tube" (0.95).

Referring to fig. 6, fig. 6 is a schematic view of a detailed flow of the step S50 in fig. 2. Based on the foregoing embodiment, in this embodiment, the foregoing step S50 further includes:

step S501, screening the first candidate candidates based on the second sorting result;

in this embodiment, the secondary weight values of all the first candidate candidates are respectively obtained from the second sorting result, and all the first candidate candidates are screened according to the secondary weight values of all the first candidate candidates.

For example, the first candidate of the phrase "(heating tube)" is "fallopian tube", "vas deferens", "blood transfusion", "money transfer", "import", "transportation", etc., and the corresponding second ranking result is "fallopian tube" (0.95) > "vas deferens" (0.8) > "blood transfusion" (0.65) > "vas deferens" (0.6) > "import" (0.3) > "import" (0.1) > "transport" (0.05), or "import" (0.1) > "money transfer" (0.3) > "import" (0.3) < "vas deferens" (0.6) < "blood transfusion" (0.65) < "vas deferens" (0.8) < "fallopian tube" (0.95). Wherein, the score of each first candidate is 0.95, 0.6, 0.8, 0.65, 0.3, 0.3, 0.1, 0.05 respectively.

In this embodiment, according to the secondary weight value corresponding to the first candidate, the corresponding first candidate is rejected, and then a plurality of second candidate candidates corresponding to the phrase to be error-corrected are screened out. For example, the first candidate item of the phrase "(warming tube)" is "fallopian tube", "vas deferens", "blood transfusion", "money transfer", "input", "delivery", etc., and the secondary weight values of the candidate items are respectively: 0.95,0.6,0.8,0.65,0.3,0.3,0.1,0.05. And screening the candidates according to the secondary weight values of the first candidate candidates.

Step S502, if the secondary weight value of the first candidate item is smaller than a preset threshold value, the first candidate item is rejected;

in this embodiment, each first candidate item is screened according to the secondary weight value of each first candidate item, and if the secondary weight value corresponding to the first candidate item is smaller than a preset value, the corresponding first candidate item is rejected. For example, pinyin (dajiang) is used as a keyword, the corresponding first candidate items include "jackpot," "grand general," "great jiang," "great river" and "fermented soybean paste," the weight values of the first candidate items are respectively calculated, wherein the weight values of the "jackpot," "great general" and the "fermented soybean paste" are respectively 0.1, 0.15 and 0.3 which are less than 0.5, and the three first candidate items are removed from the first candidate items.

Step S503, if the secondary weight value of the first candidate item is greater than the preset threshold value, the first candidate item is reserved;

in this embodiment, if the corresponding secondary weight value is greater than the preset threshold, the first candidate is retained, and the corresponding first candidate is labeled as the second candidate, for example, pinyin (da jiang) is used as a keyword, the recalled candidate words have "jackpot", "grand general", "great jiang", "dajiang" and "fermented soybean paste", the weight values of the candidate words are respectively calculated by using a scoring ranking model, wherein the weight values of "great jiang" and "dajiang" are 0.85 and 0.6, and if the weight values of "great jiang" and "dajiang" are both greater than the preset value of 0.5, the corresponding first candidate is accepted. For example, "moderate anemia cephalosporin" may be retrieved by using pinyin "toubao" of "cephalosporin" as a keyword, and corresponding multiple candidate items are rejected, wherein a first candidate item smaller than a preset value of 0.5 is rejected, and "application", two first candidate items of cephalosporin "have secondary weight values of 0.95 and 0.7 respectively, and if the secondary weight values are larger than the preset value of 0.5, the two first candidate items are accepted.

Step S504, based on the first candidate, labeling the corresponding first candidate as a second candidate.

In this embodiment, a plurality of first candidate candidates whose secondary weight values are greater than a preset value are labeled as second candidate candidates. For example, "moderate anemia cephalosporin" may be recalled to correspond to multiple candidate items by using the pinyin "tou bao" of "cephalosporin" as a keyword, and a first candidate item smaller than a preset value of 0.5 is removed, so that "insurance is applied, secondary weight values of two first candidate items of cephalosporin" are 0.95 and 0.7, respectively, and if the secondary weight values are larger than the preset value of 0.5, the two first candidate items are accepted. Further, the two first candidate candidates of "application, cephalosporin" are marked as the second candidate.

Further, after the step S60, the method further includes:

step S70, the target candidate item is used as a replacement item to replace the phrase to be corrected;

in this embodiment, the target candidate item is labeled as a replacement item. For example, the target candidate item of the phrase "heating tube" to be corrected in "heating tube operation can be applied" is "fallopian tube", and "fallopian tube" is marked as a replacement item

In this embodiment, according to the replacement item, the replacement item is replaced with the corresponding phrase to be corrected, so as to complete the correction of the text to be corrected. For example, a plurality of candidate items corresponding to the phrase "heat pipes to be corrected" in "can buy insurance in heat pipe operation" are screened twice, the final target candidate item is "fallopian tube", the target candidate item "heat pipe" is marked as a replacement item, and further, the phrase "can buy insurance in fallopian tube operation" is obtained by replacing the target candidate item "heat pipe" with the phrase "can buy insurance in heat pipe operation", and the correction of the phrase to be corrected is completed.

Referring to fig. 7, fig. 7 is a functional module diagram of an embodiment of the apparatus for screening candidate word groups for error correction according to the present invention. In this embodiment, the apparatus for screening candidate words of error correction phrases includes:

the reading module 10 is configured to read a plurality of candidate items corresponding to a phrase to be corrected;

the calculating module 20 is configured to calculate and sort a primary weight value of each candidate item based on the attribute value of each candidate item, and determine a first sorting result of the candidate items;

a first obtaining module 30, configured to obtain a first candidate corresponding to the phrase to be error-corrected based on the first sorting result;

the secondary measuring and calculating module 40 is configured to invoke a preset secondary scoring and sorting model, measure and calculate a secondary weight value of the first candidate item and sort the secondary weight value of the first candidate item respectively to obtain a second sorting result of the first candidate item;

a second obtaining module 50, configured to obtain, based on the second sorting result, a plurality of second candidate candidates corresponding to the phrase to be corrected;

and a labeling module 60, configured to filter a second candidate with a highest secondary weight value among the second candidate candidates, and label a corresponding second candidate as a target candidate.

Optionally, in a specific embodiment, the apparatus for filtering the error correction phrase candidates further includes:

Optionally, in a specific embodiment, the calculating module is specifically configured to:

respectively acquiring the attribute value of each candidate item, wherein the attribute values of the candidate items comprise word frequency, editing distance and pinyin jaccard distance; respectively calculating primary weight values of the candidate items based on the word frequency, the editing distance and the pinyin jaccard distance of the candidate items; and sorting the candidate items according to the weight values based on the primary weight values of the candidate items to obtain a first sorting result.

Optionally, in a specific embodiment, the secondary calculating module is specifically configured to:

respectively extracting second parameter characteristics of the first candidate candidates, wherein the second parameter characteristics comprise word frequency variation characteristics, word segmentation variation characteristics and language model characteristics; calling a preset secondary scoring sorting model based on the second parameter characteristics, respectively measuring and calculating the first candidate items, and determining a secondary weight value of the first candidate items; and sorting the first candidate items according to a preset sequence according to the secondary weight values of the first candidate items to obtain a second sorting result of the first candidate items.

Optionally, in a specific embodiment, the second obtaining module is specifically configured to:

screening the first candidate items based on the second sorting result; when the secondary weight value of the first candidate item is smaller than a preset threshold value, rejecting the first candidate item; when the secondary weight value of the first candidate item is greater than a preset threshold value, reserving the first candidate item; and marking the first candidate item retained after screening as a second candidate item.

and the replacing module is used for replacing the phrase to be corrected with the target candidate item as a replacing item.

Based on the same contents of the embodiment description as the method for screening error correction phrase candidates of the present invention, the contents of the embodiment of the apparatus for screening error correction phrase candidates of the present embodiment are not described in detail.

The invention also provides a computer readable storage medium.

In this embodiment, the computer-readable storage medium stores thereon a filtering program of error correcting phrase candidates, and the filtering program of error correcting phrase candidates is executed by a processor to implement the steps of the method for filtering error correcting phrase candidates described in any one of the above embodiments. The method implemented by the filtering program of the error correction phrase candidates executed by the processor may refer to each embodiment of the filtering method of the error correction phrase candidates of the present invention, and thus, redundant description is not repeated.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims

1. A method for screening candidate items of error-correcting phrases is characterized by comprising the following steps:

calling a preset secondary grading sorting model, respectively measuring and calculating a secondary weight value of the first candidate item, and sorting to obtain a second sorting result of the first candidate item;

2. The method for screening candidates of a word group to be corrected according to claim 1, wherein before the step of reading the candidates corresponding to the word group to be corrected, the method further comprises:

obtaining corpus data and taking the corpus data as a training sample set;

3. The method for filtering error-correcting phrase candidates of claim 1, wherein said evaluating a primary weight value of each of said candidates based on an attribute value of each of said candidates and ranking, and said determining a first ranking result of said candidates comprises:

M＝log₁₀(T)-P-Q；

4. The method for screening the candidates of the error-correcting phrase of claim 1, wherein said invoking a preset secondary scoring ranking model to measure and rank the secondary weight values of the first candidate candidates respectively to obtain the second ranking result of the first candidate candidates comprises:

5. The method for screening the candidates of the error-correcting phrase according to claim 1, wherein the obtaining a plurality of second candidate candidates corresponding to the phrase to be error-corrected based on the second ranking result includes:

screening the first candidate items based on the second sorting result;

6. The method for filtering the word candidates of the error correcting phrase according to claim 1, wherein after the step of filtering the second candidate with the highest secondary weight value among the second candidate candidates and labeling the corresponding second candidate as the target candidate, the method further comprises:

7. A device for screening error-correcting phrase candidates, the device comprising:

a first obtaining module, configured to obtain a first candidate corresponding to the phrase to be error-corrected based on the first sorting result;

8. The apparatus for filtering error correcting phrase candidates of claim 7, wherein said apparatus for filtering error correcting phrase candidates further comprises:

the corpus acquiring module is used for acquiring corpus data and taking the corpus data as a training sample set;

9. A screening apparatus for error correction phrase candidates, characterized in that the screening apparatus for error correction phrase candidates comprises a memory, a processor and a screening program for error correction phrase candidates stored on the memory and executable on the processor, and the screening program for error correction phrase candidates realizes the steps of the screening method for error correction phrase candidates as claimed in any one of claims 1 to 6 when executed by the processor.

10. A computer-readable storage medium, on which a filter of error correction phrase candidates is stored, which when executed by a processor implements the steps of the method for filtering of error correction phrase candidates of any of claims 1-6.