CN110633463A

CN110633463A - Professional vocabulary error correction method and system applied to vertical field

Info

Publication number: CN110633463A
Application number: CN201810651482.5A
Authority: CN
Inventors: 赵鹏; 吴雪军
Original assignee: Dingfu Data Technology (beijing) Co Ltd
Current assignee: Dingfu Data Technology (beijing) Co Ltd
Priority date: 2018-06-22
Filing date: 2018-06-22
Publication date: 2019-12-31

Abstract

The invention discloses a professional vocabulary error correction method and a system of an intelligent dialogue robot applied to the vertical field, and the implementation process is as follows: constructing a confusion set, wherein the confusion set comprises professional vocabularies with correct spelling and easily confused words corresponding to the professional vocabularies; performing word segmentation processing on a sentence input by spelling of a user; and loading a confusion set, traversing each word after the word segmentation, identifying existing confusable words, replacing the confusable words with correct professional words, and completing error correction. The method and the system can realize real-time error correction, consume little time, effectively improve the intention recognition of the customer service robot to the user, and improve the effects of single-round conversation and multi-round conversation.

Description

Professional vocabulary error correction method and system applied to vertical field

Technical Field

The invention belongs to the technical field of information, and relates to an error correction method and system; in particular to a professional vocabulary error correction method and a professional vocabulary error correction system for an intelligent dialogue robot applied to the vertical field.

Background

The intelligent chat robot has the advantages that the intelligent chat robot can be online for 24 hours all day long, the response is rapid, and waiting is not needed; due to the storability of data, repeated answer editing for multiple times is avoided for repeated problems; the efficiency is high and the cost is low. However, in the process of interaction between the user and the robot, the user often makes wrong words and often is a professional vocabulary in the vertical field, and the robot is very sensitive to the vocabulary and directly affects subsequent intention recognition, and finally affects the functional realization of the robot. The effect of the intelligent dialogue robot can be improved by correcting the professional vocabulary, such as response pertinence. And the error correction of professional vocabularies lacks corresponding linguistic data (the error correction of general vocabularies has corresponding linguistic data), and the error correction cannot be performed by adopting a language model (such as N-Gram) like the common error correction.

Meanwhile, at present, the calibration of the professional vocabulary in the vertical field is performed by taking words as a unit instead of sentences as a unit, for example, in the field of automobiles, the intelligent customer service robot calibrates the license plate automobile system vocabulary in the sentences, recognizes the vocabulary "mark" and converts the vocabulary "mark" into the automobile system vocabulary "mark", and in this case, unnecessary conversion is caused because the context of the words is not considered, for example, the query sentence "the mark of the automobile is like" middle "mark" does not need to be converted. Other vertical fields such as "electronic digital", "sports brand", "diet recipe", etc. also have the problem of misinterpretation semantics due to simple recognition of words.

Based on the above problems, it is highly desirable to develop a method or system for correcting errors of specialized vocabularies to be suitable for error correction in the vertical field, and accurately, quickly and comprehensively calibrate the specialized vocabularies that cannot be recognized by the intelligent dialogue robot due to spelling errors in the sentences in the dialogue, which is beneficial to improving the service performance of the intelligent dialogue robot.

Disclosure of Invention

In order to overcome the problems, the inventor of the invention carries out intensive research and provides a professional vocabulary error correction method and system for an intelligent dialogue robot applied to the vertical field. The method aims at performing professional vocabulary error correction on sentences in the vertical field as units, an confusion set is constructed on the basis of professional vocabularies, optimization is performed by taking accuracy, recall rate and timeliness as consideration, and then the words are traversed to realize error correction on the premise, so that the method is completed.

The invention aims to provide the following technical scheme:

(1) a professional vocabulary error correction method applied to the vertical field comprises the following steps:

step 100), constructing a confusion set, wherein the confusion set comprises professional vocabularies with correct spelling and confusable words corresponding to the professional vocabularies;

step 200), carrying out word segmentation processing on the sentence spelled and input by the user;

and 300), loading a confusion set, traversing each word after the word segmentation, identifying existing confusable words, replacing the confusable words with correct professional words, and completing error correction.

(2) A system for implementing the method of (1) above, the system comprising:

an confusion set construction module: the method comprises the steps of constructing an confusion set, wherein the confusion set comprises professional vocabularies with correct spelling and confusable words corresponding to the professional vocabularies;

a word segmentation module: the system is used for carrying out word segmentation processing on a sentence spelled and input by a user;

and the error correction module is used for loading the confusion set, traversing the confusion set of each word after the word segmentation processing, identifying the existing confusable words, replacing the confusable words with the correct professional vocabulary and completing error correction.

According to the professional vocabulary error correction method and system applied to the intelligent dialogue robot in the vertical field, the invention has the following beneficial effects:

in the invention, the confusion set is constructed and optimized, the optimized confusion set is loaded after the words of the sentences spelled by the user are segmented, each word after the words are segmented is traversed by adopting the confusion set, and the wrongly spelled words are replaced to obtain the error-corrected sentences. The method aims at performing professional vocabulary error correction on sentences in the vertical field as units, the confusion set is constructed on the basis of professional vocabularies, optimization is performed by taking accuracy, recall rate and timeliness as consideration, and error correction is realized by traversing the vocabularies under the condition of the above, the scheme has the advantages that the accuracy is over 98 percent, the recall rate is over 80 percent, error correction can be realized in real time, the consumed time is very little, the intention recognition of a customer service robot to a user is effectively improved, and the effects of single-round conversation and multi-round conversation are effectively improved.

Drawings

Fig. 1 is a flowchart illustrating a professional vocabulary error correction method applied to the vertical domain according to a preferred embodiment of the present invention.

Detailed Description

The invention is illustrated in the following detailed description by means of the figures and examples. The features and advantages of the present invention will become more apparent from the description.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

As shown in fig. 1, the present invention aims to provide a professional vocabulary error correction method for an intelligent dialogue robot applied in the vertical domain; specifically, the method comprises the following steps:

step 100), constructing a confusion set, wherein the confusion set comprises professional vocabularies with correct spelling and easily confused words corresponding to the professional vocabularies;

According to the professional vocabulary error correction method of the intelligent dialogue robot applied to the vertical field, the vertical field refers to a special, fine and deep operation range subdivision industry, is concentrated in a certain industry, such as chemical industry > petrochemical industry > liquefied gas chemical industry, and the liquefied gas chemical industry is a subdivided vertical field.

Professional vocabulary refers to a uniform industry name of specific things in the vertical field, for example, in the automobile field, the professional vocabulary comprises a license plate, a vehicle system and the like; in the field of games, the professional vocabulary comprises game names, game types and the like.

Step 100), a confusion set is constructed, wherein the confusion set comprises professional vocabularies with correct spelling and confusable words corresponding to the professional vocabularies.

In the present invention, step 100) comprises the following substeps:

substep 110) constructing a professional vocabulary dictionary according to professional vocabularies in the vertical field;

substep 120) constructing an confusable dictionary, wherein the confusable dictionary comprises a reference Chinese character and a plurality of confusable characters corresponding to the reference Chinese character;

substep 130) performing single-word replacement and double-word replacement on the professional vocabulary in the professional vocabulary dictionary by using the confusable words in the confusable dictionary to form a primary confusion set; the preliminary confusion set comprises professional vocabularies with correct spelling and confusable words formed by replacing the reference Chinese characters in the professional vocabularies with the confusable words.

In the substep 110) of the present invention, the specialized vocabularies in the vertical field can be generally subdivided into different categories, for example, in the automotive field, the specialized vocabularies can be classified into "performance" specialized vocabularies (such as automatic transmission, fuel consumption, etc.) and "license plate vehicle system" specialized vocabularies (such as popular maiden, popular treasures, etc.). The construction of the professional vocabulary dictionary can be obtained by sorting and summarizing all professional vocabularies in the vertical field; preferably, the professional vocabulary dictionary is constructed by arranging and collecting professional vocabularies of a set category, and the artificial spelling error rate of the professional vocabularies in the set category is high.

Taking an intelligent customer service robot in the field of automobiles as an example, more than 80% of spelling errors of a user in the scene are spelling errors of a license plate automobile system, so that a professional vocabulary dictionary is constructed for professional vocabularies of the license plate automobile system. The method for establishing the professional vocabulary dictionary aiming at the professional vocabularies of the set category is combined with the actual situation of spelling errors in the vertical field, the pertinence of error correction is improved on the premise of not influencing semantic recognition, the size of a confusion set is correspondingly reduced by reducing the vocabulary of the professional vocabulary dictionary, and the real-time performance of subsequent error correction is further improved.

In the prior art, complete professional vocabularies are not generally collected aiming at a specific vertical field, and a professional vocabulary dictionary needs to be manually arranged. Taking the automotive field as an example, the professional vocabulary dictionary is shown in table 1 below.

TABLE 1 professional vocabulary dictionary example

In substep 120) of the present invention, a confusing dictionary is constructed, which includes a reference chinese character and a plurality of confusing words corresponding to the reference chinese character.

In the invention, the data set format of the confusable dictionary is a key-value format, the key is a reference Chinese character, and the value is a possible misspelling form of the reference Chinese character, namely a plurality of confusable characters.

In a preferred embodiment, the misspellings include homophonic misspellings, nearnote homophonic misspellings, and nearnote misspellings of the reference Chinese character; the homonymous misspelling form is the same pronunciation form as the reference Chinese character, the homonymous misspelling form is the same syllable form as the reference Chinese character and different tone form, the homonymous misspelling form is the same tone form as the reference Chinese character, and the homonymous misspelling form is the same syllable form as the reference Chinese character and different tone form as the reference Chinese character. The confusing dictionary is shown in table 2 below.

TABLE 2 confusing dictionary example

In a preferred embodiment, a taiwan confusable dictionary (obtained based on taiwan SIGHAN baby-off 2013 wrongly written characters) is utilized, and the confusable dictionary meeting requirements is obtained by converting traditional characters into simplified characters. The reference Chinese characters in the confusing dictionary of Taiwan comprise common Chinese characters, and the misspelling forms of the reference Chinese characters comprise four misspelling forms of homophonic homonym, homophonic heteronym, homophonic homonym and homophonic heteronym, and the misspelling forms of the homophonic heteronym, so that the coverage range is wide, and the construction requirements of the confusing dictionary and subsequent confusing words are met.

In the substep 130) of the present invention, single word replacement and double word replacement are performed on the professional vocabulary in the professional vocabulary dictionary to form confusable words by using the confusable words in the confusable dictionary, and the professional vocabulary and the corresponding confusable words form confusable word pairs to form a preliminary confusion set. Preferably, the preliminary confusion set comprises a single-character preliminary confusion set and a double-character preliminary confusion set, namely, the professional vocabulary and the confusable words formed after single-character replacement form confusable word pairs which are contained in the single-character preliminary confusion set, and the professional vocabulary and the confusable words formed after double-character replacement form confusable word pairs which are contained in the double-character preliminary confusion set.

The single character replacement means that each Chinese character in the professional vocabulary is sequentially replaced by the confusable character of the Chinese character to form a corresponding confusable word; the number of the confusable words is the sum of the numbers of the confusable words corresponding to the Chinese characters in the professional vocabulary. For example, the confusable words of the specialized word "fengtian" include "fengtian", "feng sweet", "feng yu", "feng sui bao", "feng lie", and "feng be full of", for example, the confusable words of "feng" in the confusable dictionary have 40 numbers, the confusable words of "tian" in the confusable dictionary have 30 numbers, and the confusable words obtained by single character replacement have 70 numbers.

The double-character replacement means that every two Chinese characters in the professional vocabulary are replaced by the confusable characters of the two Chinese characters in sequence to form corresponding confusable words; the number of the confusable words is related to the product of the number of the two character combinations which can be formed in the professional vocabulary and the number of the confusable words corresponding to every two Chinese characters. For example, the number of the two character combinations which can be formed by the professional vocabulary "Toyota" is 1, the product of the numbers of the confusable characters corresponding to every two Chinese characters is 1200, and the number of the confusable words obtained by the professional vocabulary "Toyota" through double-character replacement is 1200; for another example, the number of the two character combinations formed by the professional vocabulary "guang yu feng tian" is 6, the product of the numbers of the confusable characters respectively corresponding to the two Chinese characters "guang yu", "guang feng", "yao", and "yao tian" is 760, 600, 570, 840, 880 and 1200, and the number of the confusable words obtained by the professional vocabulary "fengtian" through double-character replacement is 4850.

The number of words in the professional vocabulary is usually not less than 2, so that only single-word replacement and double-word replacement are set based on the research findings that the professional vocabulary is often unique compared with the general vocabulary, and the spelling of the professional vocabulary does not cause more than two words.

The professional vocabulary may have a number or letter format, such as "east-wind marked 3008" or "east-wind mx 6", when a preliminary confusion set is constructed, the numbers and letters are not processed, and only Chinese characters are replaced to form confusable words.

The data set format of the preliminary confusion set is a key-value format, a key corresponds to a confusable word and is marked with a confusable phrase, and a value corresponds to a professional vocabulary corresponding to the confusable word and is marked with a professional vocabulary group. An example of a preliminary confusion set is shown in table 3 below.

TABLE 3 preliminary confusion set example

Key (confusing words)	Value (professional vocabulary)
		Dongfeng mark 3008	Dongfeng Biaozhi 3008
Fairy tale of Toyota	Toyota leiling
		Toyota direct dressing	Dazzling of Toyota
Buick Jiang	Buick Ying Lang
		Long assault	Changan European style
Honda siborui	Honda Siborui
		Tyrant 560	Baojun 560
From great middle school	Popular treasure
		Common Chinese magnoliavine fruit	Popular bird's nest

Through the sub-step 130), it can be seen that the preliminary confusion set almost covers any misspelling forms which may exist in the professional vocabulary, so that the misspelling forms in the sentences can be effectively identified, and the recall rate is high; but the problem of poor timeliness caused by the huge content of the initial confusion set also exists. Two methods can be adopted for optimizing the preliminary confusion set, wherein one method is to directly screen professional vocabularies or easily confused words in the formed preliminary confusion set; and the other method is to reform an optimized confusion set by optimizing a professional vocabulary dictionary and an easy confusion dictionary. Because the former method needs manual work, the operation amount is huge, and the method is difficult to realize, and the confusion set after optimization is preferably obtained by adopting the latter method.

In a preferred embodiment of the present invention, the method further comprises a substep 140) of optimizing the specialized vocabulary dictionary and the confusable dictionary, and performing single-character replacement and double-character replacement on the specialized vocabulary in the optimized specialized vocabulary dictionary by using the optimized confusable dictionary to generate confusable words, wherein the specialized vocabulary and the corresponding confusable words form confusable word pairs to form an optimized confusable set.

Preferably, similar to the preliminary confusion set, the optimized confusion set includes a single-word replacement confusion set and a double-word replacement confusion set, i.e., the professional vocabulary and the confusable words formed after single-word replacement constitute confusable word pairs contained in the single-word replacement confusion set, and the professional vocabulary and the confusable words formed after double-word replacement constitute confusable word pairs contained in the double-word replacement confusion set. The confusion set comprises a single-word replacement confusion set and a double-word replacement confusion set, so that the single-word replacement confusion set and the double-word replacement confusion set can be used for error correction in sequence in the subsequent professional vocabulary error correction process, and the error correction efficiency and accuracy can be further improved.

In the invention, the optimization of the professional vocabulary dictionary comprises the following steps: and filtering the professional vocabulary dictionary by using the preliminary confusion set to obtain easily confused word pairs existing in the professional vocabulary dictionary, and modifying the misspelled professional vocabulary in the easily confused word pairs.

It is known that in the prior art, there is generally no complete collection of professional vocabularies for a specific vertical domain, and the professional vocabulary dictionary is obtained by manual arrangement, which may result in spelling errors in the professional vocabulary dictionary. In this way, a misspelling may also exist in the "professional vocabulary group" of the preliminary confusion set established based on the professional vocabulary dictionary, and instead, a professional vocabulary with a correct spelling may exist in the "confusable phrase" corresponding to the "professional vocabulary group".

The specific steps for optimizing the professional vocabulary dictionary are that each professional vocabulary in the professional vocabulary dictionary is taken as a reference (the numbers and English spelling are omitted), and other professional vocabularies in the professional vocabulary dictionary are traversed by adopting a primary confusion concentrated 'confusable phrase' and/or 'professional vocabulary group'; when one professional vocabulary is taken as a reference to be used for traversing other professional vocabularies, if the other professional vocabularies do not fall into the confusable phrase and/or the professional vocabulary group of the preliminary confusion set, the professional vocabularies are the professional vocabularies with correct spelling; if the words in the easily-confused word group and/or the professional word group in the preliminary confusion set exist in other professional words (the words are defined as screened words, and the screened words may be one or more), the professional words which are taken as the reference at the moment are marked, checked, modified or deleted with the screened words in the easily-confused word group and/or the professional word group in the preliminary confusion set, so as to obtain the optimized professional word dictionary.

Specifically, the following are exemplified: as shown in table 1, the specialized vocabulary dictionary includes specialized vocabularies "hargh 1" and "hargh 6", based on "hargh", the other specialized vocabularies are traversed by using the "specialized vocabulary group" of the preliminary confusion set, it is found that "hargh" falls into the "specialized vocabulary group" of the preliminary confusion set, that is, "hargh" and "hargh" are both regarded as correctly spelled specialized vocabularies and coexist in the specialized vocabulary dictionary, the "hargh 1" and "hargh 6" are marked, it is checked that "hargh" is incorrectly spelled, and "hargh 1" is deleted or modified to "hargh 1"; or

Taking 'Harvard' as a reference, traversing other professional words by adopting 'confusable phrases' of a preliminary confusion set, and because 'Buddha' and 'Freund' are extremely similar homonymous confusable words, finding that 'Harvard' falls into the 'confusable phrases' of the preliminary confusion set, marking 'Harvard 1' and 'Harvard 6', checking that 'Harvard' is misspelled, and deleting or modifying 'Harvard 1' into 'Harvard 1'; or

By taking 'Harvard' as a reference, traversing other professional words by adopting the whole primary confusion set, finding that 'Harvard' falls into the primary confusion set, marking 'Harvard 1' and 'Harvard 6', checking to find that 'Harvard' is misspelled, and deleting or modifying 'Harvard 1' into 'Harvard 1'.

For the professional vocabulary dictionary, the primary confusion set is utilized to traverse and match the professional vocabulary, easily confused word pairs are obtained, whether spelling errors exist or not is manually checked, and the accuracy of the professional dictionary is effectively ensured.

In the present invention, optimizing the professional vocabulary dictionary further comprises: and (4) screening the professional vocabulary dictionary (before or after calibration) to remove single words and ultra-long words (words of more than or equal to five words).

The single word needs to be removed because the error replacement is easy to occur for the error correction of the single word, namely the single word is correct originally and is replaced by an error; the ultra-long words need to be removed because when the sentences are corrected, word segmentation processing is needed in advance, and the ultra-long words are generally segmented and cannot be detected; in addition, after single-character replacement and double-character replacement, a large number of confusable words can be generated by permutation and combination, the confusable words are reserved in a confusion set to greatly influence the error correction efficiency, in addition, the confusable words are easy to appear in a large number, and the possibility of false detection is greatly increased.

Taking the number plate vehicle series correction as an example, the names of Tang, Song and Yuan are BYD vehicle series names which are single words and need to be removed; "Mazdaon Czochralski" is a very long word that needs to be removed.

The screening of the single words and the ultra-long words further reduces the capacity of the professional vocabulary dictionary on the premise of ensuring the recall rate, thereby being beneficial to reducing the capacity of the confusion set and improving the error correction efficiency.

In the invention, the optimization of the confusable dictionary comprises the following steps: determining the number of misspelled Chinese characters (value) to be reserved according to the character frequency of the reference Chinese characters (key); the higher the word frequency, the fewer the number of misspelled form characters it retains. This is because the more commonly used Chinese characters, the lower the probability of spelling errors, such as the probability of high frequency Chinese characters "one", "being" spelling errors is extremely low.

In one implementation mode, a word frequency statistical table of the network words of the dog search is obtained, the words are divided into independent characters, word frequencies of the independent characters corresponding to the words are obtained, and the word frequencies of the same independent characters are added to obtain a word frequency statistical result. The word frequency statistics (http:// www.sogou.com/labs/resource/w.php) of the dog searching network are free data disclosed by the dog searching and are word frequency statistics results aiming at internet data.

In one implementation mode, the number of misspelling Chinese characters needing to be reserved for each level of reference characters is respectively determined by grading the character frequency; for example, the word frequency is divided into three levels, the word frequency is higher than more than two hundred million and is high frequency, the word frequency is between two million and two hundred million and is intermediate frequency, and the word frequency is lower than two million and is low frequency; the high frequency reference Chinese characters retain 5 misspelled forms of Chinese characters, the medium frequency reference Chinese characters retain 10 misspelled forms of Chinese characters, and the low frequency reference Chinese characters retain 20 misspelled forms of Chinese characters.

Preferably, the characters in the confusing dictionary in the misspelled form are sorted by word frequency, and the characters in the misspelled form with high word frequency are preferentially reserved. For example, the reference Chinese character "ice" belongs to the intermediate frequency reference Chinese character, 10 Chinese characters in misspelling forms can be reserved, the characters of the Chinese characters in misspelling forms are sorted into force > C > disease > and > cake > inherit > underlying > handle > screen > formerly > char > colorful > Bin > edge > Bing > funeral > Bin > temple > temporal > , and the reserved Chinese characters in misspelling forms are 'force, disease, parallel, cake, inherit, original, handle, screen and abandon'.

In the invention, the screening of the misspelling form in the confusable dictionary is to optimize the confusable set, and reduce the size of the confusable set under the condition of covering errors as much as possible.

In the invention, after the professional vocabulary dictionary and the confusable dictionary are optimized, the optimized confusable dictionary is utilized to perform single character replacement and double character replacement on the professional vocabulary in the optimized professional vocabulary dictionary to generate confusable words, and the professional vocabulary and the corresponding confusable words form confusable word pairs, thus forming an optimized confusion set.

Preferably, the optimized confusion set is further optimized, including: and removing the confusable words with the word frequency higher than a set threshold value according to the word frequency of the confusable words in the confusion set. The reason for this optimization is that the sentences input by the customers are almost all common words (high-frequency words) except the professional vocabulary in the vertical field, and there are many common words in the confusable phrase, such as "blue" corresponding to the car series "sky," which results in replacing many common words input by the users with the professional vocabulary, and the timeliness and accuracy are low. And the confusable words in the confusable phrase above the set threshold are removed, so that the false detection rate is effectively reduced, and the detection accuracy is improved. The set threshold value differs depending on the vertical field, for example, in the automobile field, the set threshold value is 500 ten thousand.

More preferably, the removed confusing words are screened, confusing words which commonly appear in the input sentences of the client and are used for representing other effective meanings are determined, and the confusing words are added into the optimized confusing set again; the confusing word "mark" such as the professional word "mark" is a high-frequency word and should be removed theoretically, but the frequency of misuse of the "mark" by the user is extremely high, such as how to mark the vehicle in the query sentence, and therefore, it is not appropriate to remove the confusing word "mark". And determining the retention condition of the confusable words with special word frequency higher than a set threshold value in different vertical fields by combining practical conditions. For example, the vehicle license plate system is corrected, the word frequency is higher than the set threshold, and the confusable words retained again after being screened are "legend", "mark", "hard", and "harvard".

In the invention, the optimized confusion set is further optimized, and the method further comprises the following steps: increasing the situation of wrong word sequence, namely, disordering the sequence of the Chinese characters in the optimized professional vocabulary to form easily confused word pairs to be supplemented into a confused set. Taking the error correction of the license plate vehicle system as an example, the characters in the professional vocabulary "Oncalara" are disorganized to form the confusable word "Oncalara", and then the confusable word "Oncalara" is supplemented to the confusable set.

In one embodiment, all the professional vocabularies in the optimized professional vocabulary dictionary are selected, and the professional vocabularies and the easy-to-confuse words formed by the words in the professional vocabularies in all possible arrangement modes form easy-to-confuse word pairs which are then supplemented to the confusion set.

In a preferred embodiment, a professional vocabulary with the number of words of 3-4 in the optimized professional vocabulary dictionary is selected, and the professional vocabulary and confusable words formed by the words in the professional vocabulary in all possible arrangement modes form confusable word pairs and are then supplemented to a confusable set.

In a further preferred embodiment, selecting a professional vocabulary with the word number of 3-4 in the optimized professional vocabulary dictionary, forming confusable word pairs by the professional vocabulary with the word number of 3 and confusable words formed by the words in the professional vocabulary in all possible arrangement modes, and supplementing the confusable word pairs into a confusable set; and forming a confusable word pair by the professional vocabulary with 4 words and the confusable words formed by exchanging the two words positioned in the middle of the professional vocabulary, and supplementing the confusable word pair into a confusable set.

In the invention, the optimized confusion set is further optimized, and the method further comprises the following steps: professional vocabulary error correction is carried out on the test corpus (steps 100-300), and confusable word pairs which are mistakenly detected due to word segmentation errors are removed from a confusing set. The test corpus is a corpus generated by conversation between a user and a client or between the user and a robot.

The reason for doing this is that the professional vocabulary or the confusable words corresponding to the professional vocabulary in the sentence do not exist independently, and the vocabulary is connected with the front and rear Chinese characters in the sentence, and after being participled, the vocabulary can be split into other vocabularies (the vocabulary can be the confusable words of a certain professional vocabulary); or other words formed by dividing common words (non-professional words and non-confusable words) in the sentence are the same as the confusable words, and both the cases can cause false detection.

In step 200), the method carries out word segmentation processing on the sentence spelled and input by the user, wherein the word segmentation processing needs to be combined with a word segmentation dictionary.

In the present invention, the segmentation dictionary refers to a database including common or fixed terms, which is a reference for segmentation, and converts an input query sentence into an independent term having a maximum character length by referring to the segmentation dictionary. The words in the word segmentation dictionary are closely related to the application field, and the words in the word segmentation dictionary need to be screened according to different application fields, so that the data occupation space is reduced, and the word segmentation searching speed is increased.

In the prior art, the word segmentation dictionary is generally set in a list (list) form and arranged under a set rule (such as the sequence a-z of the alphabet). The method has the advantages that the arrangement is simple, and words can be accurately searched according to the arrangement rule; however, the dictionary usually has a large amount of data, and the list format requires a large storage space, and the target word can be determined only after checking many words, which is inefficient.

In the invention, a list-form word segmentation dictionary is converted into a dictionary tree structure, and the dictionary tree structure takes a root node as a start and extends through child nodes; the root node does not contain characters, and each node except the root node only contains one character; from the root node to a certain node, the characters passing through the path are connected together and are character strings corresponding to the node; all children of each node contain different characters. Here, for english, one letter is one character; for Chinese, one Chinese character is one character; one numeral or one punctuation mark corresponds to one character.

The dictionary tree structure is used as a word segmentation dictionary expression mode, the public prefix of the character string can be used for reducing the expense of query time so as to achieve the purpose of improving efficiency, the word query speed is high, and the method is particularly obvious on large-scale data.

In the invention, the word segmentation refers to a process of dividing a character string into word strings. In the invention, the word segmentation method can be a forward maximum matching method, a reverse maximum matching method, a conditional random field model or a hidden Markov model. The forward maximum matching method has the characteristics of high word segmentation efficiency, linear time complexity, easiness in realization and no need of specifying the maximum length of words; the reverse maximum matching method is characterized by linear time complexity and the maximum length maxLen of the word needing to be specified; the hidden Markov model is characterized in that the recognition effect of the unknown words is superior to the maximum matching method, but the overall effect depends on training linguistic data; the conditional random field model has the characteristics of considering not only the occurrence frequency of words, but also the context and having better learning capacity, so that the conditional random field model has good effect on identifying ambiguous words and unknown words. The inventor finds out through a large number of experimental verification that two word segmentation modes, namely a forward maximum matching method and a conditional random field model, are preferably adopted; recommending a maximum matching word segmentation algorithm to be used in a frequently-used sentence and a scene with high requirement on word segmentation speed; and recommending the use of conditional random field model word segmentation in the rare corpus or scene with more new words.

The Chinese language is relatively complex, and there exists intersection type ambiguity in the sentence, which refers to ambiguity caused by the fact that a certain word in the sentence can form a word with a previous (or a few) words or form a word with a next (or a few) words during word segmentation. The invention adopts a forward maximum matching method to carry out forward scanning on the input sentences, and word segmentation errors are likely to be generated when intersection type ambiguity exists.

In the face of the situation, the word segmentation result of the forward maximum matching method is corrected by adding a backtracking mechanism. The backtracking refers to a heuristic method for correcting the current word segmentation result by adopting a fallback strategy in the word segmentation process. Examples are as follows: the sentence to be inquired is input as ' dispatch Ying Lang sender to station ' and the forward scanning result is ' dispatch/Ying Lang/sender/man/go/station ', the word dictionary is checked to know that ' man ' is not in the dictionary, so that the backtracking is carried out, the tail word ' of ' sender ' is taken out to form ' man ' with the following ' man ', then the dictionary is checked to see whether ' send ' and ' man ' are in the dictionary, if yes, the word segmentation result is adjusted to ' dispatch/Ying Lang/sender/man/go/station '. The word segmentation accuracy can be improved by adding a backtracking mechanism, and the intersection ambiguity problem can be effectively improved.

In a preferred embodiment, the optimized professional vocabularies in the confusion set are recorded into a segmentation dictionary to improve the accuracy of segmentation.

In the step 300), the confusion set is loaded, each word after the word segmentation is traversed, existing confusable words are identified and replaced by correct professional words, and error correction is completed.

In a preferred embodiment, when the optimized confusion set comprises a single-character replacement confusion set and a double-character replacement confusion set, each word after word segmentation firstly traverses the single-character replacement confusion set, and if the word is found to be replaced, the word is not replaced, and the word is traversed the double-character replacement confusion set. In fact, the probability of single word errors of professional vocabularies is far better than that of double word errors, so that the single word replacement confusion set is traversed first, and then the double word replacement confusion set is traversed, and the error correction efficiency and accuracy can be further improved.

Step 300), there may be a very few false detection situations in the error correction result, and some words are confusable words of the professional vocabulary, but the words used correctly in the sentence are not used as the confusable words, and at this time, the correctly used vocabulary is most likely to be corrected to the professional vocabulary, resulting in false detection.

Taking a license plate and a vehicle as an example, a user asks questions: "mark how this car" here the mark shall be marked and should be replaced; the user asks questions: "the logo of this car is what-like", here the logo is correct and should not be replaced. For this case, the user can be asked questions by adding feedback, such as returning an option "you are not intended to express — mark how this car is" to the user, to improve the accuracy of the discrimination.

According to the invention, through constructing and optimizing the confusion set, the intelligent dialogue robot in the vertical field is helped to finish error correction of professional words in user sentences, the scheme has the realization accuracy of 98 percent and the recall rate of more than 80 percent, can correct errors in real time, is extremely short in time consumption, effectively improves the intention recognition of the customer service robot to the user, and effectively improves the effects of single-round conversation and multi-round conversation.

Another object of the present invention is to provide a system for implementing the above method, specifically, the system includes:

In the invention, the confusion set building module comprises the following sub-modules:

the professional vocabulary dictionary constructing submodule is used for constructing a professional vocabulary dictionary according to professional vocabularies in the vertical field;

the confusing dictionary constructing submodule is used for constructing a confusing dictionary, and comprises a reference Chinese character and a plurality of confusing characters corresponding to the reference Chinese character;

the primary confusion set constructing submodule is used for carrying out single-word replacement and double-word replacement on the professional vocabulary in the professional vocabulary dictionary by using the confusable words in the confusable dictionary to form a primary confusion set; the preliminary confusion set comprises professional vocabularies with correct spelling and easy confusion words formed after the easy confusion words replace the reference Chinese characters in the professional vocabularies;

and the confusion set optimization submodule is used for optimizing the professional vocabulary dictionary and the easy confusion dictionary, and performing single-word replacement and double-word replacement on the professional vocabularies in the optimized professional vocabulary dictionary by using the optimized easy confusion dictionary to form an optimized confusion set.

In the invention, the confusion set optimization submodule comprises a professional vocabulary dictionary optimization submodule, an easy confusion dictionary optimization submodule and a confusion set optimization submodule:

the professional vocabulary dictionary optimizing submodule is used for filtering the professional vocabulary dictionary by utilizing the primary confusion set to obtain easily confused word pairs existing in the professional vocabulary dictionary and modifying the professional vocabulary with misspelling in the easily confused word pairs; screening the professional vocabulary dictionary (before or after calibration) to remove single words and ultra-long words (words of more than or equal to five words);

the confusing dictionary optimization submodule is used for determining the number of misspelled Chinese characters (value) to be reserved according to the word frequency of the reference Chinese characters (key); the higher the word frequency is, the fewer the number of the misspelled Chinese characters to be reserved is in the reference Chinese characters;

the confusion set optimization submodule is used for removing the confusing words with the word frequency higher than a set threshold value according to the word frequency of the confusing words in the confusion set, screening the removed confusing words, determining the confusing words which commonly appear in the input sentences of the clients and are used for representing other effective meanings, and adding the confusing words into the optimized confusion set again;

preferably, the confusion set optimization submodule is further configured to increase the situation of word order errors, that is, words in the optimized professional vocabulary are disorderly ordered to form confusable word pairs, and the confusable word pairs are supplemented into the confusion set.

More preferably, the confusion set optimization submodule is further configured to perform professional vocabulary error correction on the test corpus (step 100-300), and remove the confusable word pairs which are misdetected due to the word segmentation error from the confusion set, where the test corpus is a corpus generated by a user and a client or a conversation between the user and a robot.

Examples

Example 1

The effect of the error correction method is determined by taking a large number of test corpora as data statistical samples, during error correction, each word after word segmentation firstly traverses a single-word replacement confusion set, if found, the single-word replacement confusion set is replaced, and if not, the double-word replacement confusion set is traversed. Wherein the test corpus is 2018.05.05-2018.05.09 conversation logs of the user and the conversation robot in five days.

The user dialogue logs amounted to 16996, with a total of 388 license plate train errors occurring. In which the user repeatedly inputs a plurality of times in succession, e.g.

User jumping

The customer service robot can not answer, i do not know how to answer, ask a question again, thank you

User jumping

User pleases

The customer service robot is a family and pleasure to sell for 7-12 ten thousand, and is a joint-venture car under modern flags. Its main advantage is: the cost performance is high (12% higher than the similar vehicles), the driving is smooth, the power is high (18% higher than the similar vehicles), and the daily use is enough.

User pleases

User jumping

The data after the adjustment for removing the similar situations are as follows: the total number of errors was 359, and the error rate was about 2%. The system detects 286 wrong pieces, accounting for 80%, namely the recall rate is 80%. And (5) carrying out false detection on 12 strips, wherein the accuracy rate is about 96%.

In the case of false detection, 6 extra-long sentences account for 50%, which are advertisements, for example:

the user comes back, we can be seen in the vast sea of people at the sunset area, so that the reason is to make a new media huge head, and the quantity and the quality of vermicelli are critical. The specialty of China (Hangzhou cloud vermicelli bar) adds vermicelli to the public number, and ensures 100% of the vermicelli to be alive. Meanwhile, real vermicelli can be added in a mode of WeChat connection WIFI according to different scenes such as gender, time, city, hotel, airport, station, school, hospital and the like, the retention rate can reach 80%, and if the vermicelli adding demand exists at the place, 1508870XXXX (Hangzhou cloud vermicelli customer manager Huang Chi) WeChat hzy 520 can be directly connected through telephone.

The situation has no influence on the service, and the accuracy rate is 98 percent after the part is removed.

The rest 6, 3 are Kaixiao- > Kaixian, and 3 are marked- > Biaozhi, which are all high frequency. It is only necessary to recognize this case of the class "flag 4008" on the business. Both cases have been addressed, and overall accuracy approaches 100%.

Recall related cases:

the undetected state accounts for 20%, and the total is as follows:

hanlangda (Hanlanda), Mikay wheel (Mikaran), Loulan (Loulan), and Wuling red light (Wuling macro light), etc., are basically near-sound homophonic confusion errors.

High frequency (non-continuous) occurrences are: lingk (neckk) 7 times, Sheffilan (Chevran) 2 times, Scodan (Scodan) 3 times, etc.

The error condition is adjusted in a targeted manner, and the recall rate is 84% in consideration of the universality of high frequency.

The test accuracy is 96% and the recall rate is 80%. After summary optimization for the test, the accuracy rate is about 100%, and the recall rate is about 84%.

The present invention has been described above in connection with preferred embodiments, but these embodiments are merely exemplary and merely illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

Claims

1. A professional vocabulary error correction method applied to the vertical field is characterized by comprising the following steps:

2. Method according to claim 1, characterized in that step 100) comprises the following sub-steps:

3. The method according to claim 2, wherein in the sub-step 110), the specialized vocabulary dictionary is constructed by sorting and summarizing all specialized vocabularies in the vertical domain;

preferably, the professional vocabulary dictionary is constructed by arranging and collecting professional vocabularies of a set category, and the artificial spelling error rate of the professional vocabularies in the set category is high.

4. The method as claimed in claim 2, wherein in the sub-step 120), the data set format of the confusing dictionary is a key-value format, key is a reference chinese character, and value is a possible misspelled form of the reference chinese character, i.e., a plurality of confusing words;

the misspelled forms include homophonic misspelled forms, nearsighted misspelled forms, and nearsighted misspelled forms of the reference Chinese character.

5. The method as claimed in claim 1, further comprising a substep 140) in step 100), optimizing the specialized vocabulary dictionary and the confusable dictionary, and performing single-word replacement and double-word replacement on the specialized vocabularies in the optimized specialized vocabulary dictionary by using the optimized confusable dictionary to form an optimized confusable set;

preferably, the optimized confusion set comprises a single-word replacement confusion set and a double-word replacement confusion set, namely, the professional vocabulary and the confusable words formed after single-word replacement form confusable word pairs which are contained in the single-word replacement confusion set, and the professional vocabulary and the confusable words formed after double-word replacement form confusable word pairs which are contained in the double-word replacement confusion set.

6. The method of claim 5, wherein in sub-step 140), optimizing the specialized vocabulary dictionary comprises: filtering the professional vocabulary dictionary by using the preliminary confusion set to obtain easily confused word pairs existing in the professional vocabulary dictionary, and modifying the professional vocabulary with misspelling in the easily confused word pairs; and/or

Optimizing the confusing dictionary comprises: determining the number of the misspelled Chinese characters to be reserved, namely the number of confusable characters according to the character frequency of the reference Chinese character; the higher the word frequency is, the fewer the number of the misspelled Chinese characters to be kept, namely the confusable Chinese characters is;

preferably, the characters in the confusing dictionary in the misspelled form are sorted by word frequency, and the characters in the misspelled form with high word frequency are preferentially reserved.

7. The method of claim 5, wherein in sub-step 140), the optimized confusion set is further optimized, comprising: removing the confusable words with the word frequency higher than a set threshold value according to the word frequency of the confusable words in the confusion set;

further, the removed confusing words are screened, confusing words which are commonly appeared in the user input sentence and used for representing other effective meanings are determined, and the confusing words are added into the optimized confusing set again.

8. The method of claim 5, wherein in sub-step 140), the optimized confusion set is further optimized, further comprising: increasing the situation of wrong word sequence, namely, disordering the sequence of the Chinese characters in the optimized professional vocabulary to form easily-confused word pairs to be supplemented into a confused set;

preferably, selecting a professional vocabulary with the word number of 3-4 in the optimized professional vocabulary dictionary, forming a confusable word pair by the professional vocabulary with the word number of 3 and confusable words formed by the words in the professional vocabulary in all possible arrangement modes, and supplementing the confusable word pair into a confusable set; and forming a confusable word pair by the professional vocabulary with 4 words and the confusable words formed by exchanging the two words positioned in the middle of the professional vocabulary, and supplementing the confusable word pair into a confusable set.

9. The method according to claim 2, characterized in that in step 200) a segmentation process is performed in conjunction with a segmentation dictionary, wherein,

and recording the optimized professional vocabulary in the confusion set into a word segmentation dictionary.

10. A system for implementing the method of any one of claims 1 to 9, the system comprising: