CN110399607B

CN110399607B - Pinyin-based dialog system text error correction system and method

Info

Publication number: CN110399607B
Application number: CN201910480818.0A
Authority: CN
Inventors: 杨志明
Original assignee: Ideepwise Artificial Intelligence Robot Technology Beijing Co ltd
Current assignee: Ideepwise Artificial Intelligence Robot Technology Beijing Co ltd
Priority date: 2019-06-04
Filing date: 2019-06-04
Publication date: 2023-04-07
Anticipated expiration: 2039-06-04
Also published as: CN110399607A

Abstract

The invention discloses a Pinyin-Based dialog system Text error correction system and method, which are provided with a Pinyin-Based Text Fault-Tolerant Model (PTFM), wherein the PTFM realizes Fault tolerance of abnormal field entities and abnormal field words in the Pinyin-Based dialog system Text under the condition of being Based on a field problem set and field entities. Therefore, the embodiment of the invention can realize the error correction of the pinyin-based dialog system text in various fields on the basis of reducing the error correction cost.

Description

Pinyin-based dialog system text error correction system and method

Technical Field

The invention relates to a language processing technology in the field of computers, in particular to a system and a method for correcting a text error of a dialogue system based on pinyin.

Background

The error correction of the speech based on the recognition result of the speech is an important work in the speech understanding process. Due to the limitation of the accuracy of speech recognition, the result of speech recognition is often wrong, which may cause obstacles to the subsequent work of speech understanding and increase the difficulty of speech understanding.

When the pinyin-based dialog text is corrected, the method is divided into two steps, wherein the first step is error detection, and the second step is error correction. The common methods for error detection include maximum entropy, n-gram speech model, etc., and error correction utilizes a confusion set or a language model to selectively replace detected errors.

With the development of language processing technology, an end-to-end deep learning model is applied to the error correction process of the dialogue text based on pinyin, the deep learning model is set to extract the features of the dialogue text instead of manual feature extraction, and the manual workload is reduced. The deep learning model has strong text fitting capability. The seq2seq model in the deep learning model uses the RNN to represent sentences in the text as a vector, and then uses another RNN to decode the vector output. In order to fully acquire the semantic information of the context of the water temperature table, the deep learning model adds a forward and backward and attention mechanism on the basis of the set seq2seq model. The deep learning model is adopted to input the text, if the text has wrong pinyin sentences, the corrected pinyin sentences are directly output, and the method is simple and trouble-saving.

However, the deep learning model is adopted to correct the pinyin-based dialog system text, and the problems that the data size required by training is large, the training is time-consuming and difficult to implement exist, and the error correction cost is increased. In addition, the trained deep learning model corrects the errors of the pinyin-based dialog system text in one field during training, and when the deep learning module is migrated and applied to the correction of the pinyin-based dialog system text in another field, the accuracy of the correction is greatly reduced, that is, the deep learning model has poor migration performance in correcting the errors of the pinyin-based dialog system text.

Disclosure of Invention

In view of this, embodiments of the present invention provide a pinyin-based dialog system text error correction system, which can implement error correction of a pinyin-based dialog system text in various fields on the basis of reducing error correction cost.

The embodiment of the invention also provides a pinyin-based dialog system text error correction method, which can realize the error correction of the pinyin-based dialog system text in each field on the basis of reducing the error correction cost.

The embodiment of the invention is realized as follows:

a phonetic-based text error correction system for a dialogue system comprises a parameter extraction PE module and a TF module, wherein the PE module comprises a parameter extraction EPE sub-module and a word parameter extraction WPE sub-module, the TF module comprises an entity text fault-tolerant ETF sub-module and a WTF sub-module, wherein,

the EPE sub-module is used for extracting an entity list from the field entity data to obtain an entity list, and extracting entity context parameters based on the entity list and the field problem data to obtain entity context parameters;

the WPS sub-module is used for respectively extracting word lists and word frequency parameters from the domain problem data to respectively obtain the word lists and the word frequency parameters, and extracting word context parameters based on the word lists and the domain problem data to obtain word context parameters;

the ETF submodule is used for receiving the dialogue system text based on the pinyin, extracting AWE (abnormal word extraction) by adopting word context parameters and word frequency parameters, then carrying out error correction AEC (error correction) on abnormal entities by adopting an entity list and the entity context parameters, and outputting the dialogue system text based on the pinyin after the entity fault tolerance;

and the WTF submodule is used for performing AWE (active binary arithmetic) on the dialogue system text based on the pinyin after the entity fault tolerance by adopting the word context parameter and the word frequency parameter, performing AWC (abnormal word correction) on the abnormal words by adopting the word list and the word context parameter, and outputting the dialogue system text based on the pinyin after the word fault tolerance.

The EPE submodule is also used for carrying out entity list extraction and duplicate removal processing; and extracting the entity context parameters into the mapping from the entity to the corresponding left character list and the mapping from the entity to the corresponding right character list.

The WPE sub-module is further configured to obtain a word list including: traversing the word segmentation result of the field problem data, and then performing duplicate removal processing;

the obtaining of the word frequency parameters comprises: traversing the word segmentation result of the field problem data, and mapping the obtained words to the frequency;

obtaining the word context parameters includes: words are mapped to the corresponding left character list and words are mapped to the corresponding right character list.

The AWE comprises:

performing word segmentation processing on a dialogue system text based on pinyin; carrying out abnormity judgment on each word; adding the words determined to be abnormal into an abnormal word list; a list of anomalous words of the pinyin-based dialog system text is returned.

The performing an abnormality determination includes:

setting a threshold value T1 and a threshold value T2, wherein 1 is smaller than the threshold value T1, and the threshold value T1 is smaller than the threshold value T2;

if the frequency of the word in the word-frequency mapping table is less than T1 but greater than 0, determining the word as a candidate abnormal word; if the frequency of the word in the word-frequency mapping table is not less than T1 but less than T2, further judging whether the word conflicts with the first set context in the pinyin-based dialog system text, and if so, determining the word as a candidate abnormal word;

if the frequency number of the word in the mapping table from the word to the frequency number is equal to 0, determining the word as an abnormal word;

and if the word is a candidate abnormal word, performing second set context conflict judgment on the word to determine whether the word is an abnormal word.

The AEC comprises: fuzzy matching is carried out on the dialogue system text containing the abnormal words and the entities in the entity list, and a matched entity and an abnormal entity which correspond to the similarity and are calculated by adopting a set similarity algorithm are output; replacing the abnormal entity of the pinyin-based dialog system text by using the obtained matching entity;

the AWC comprises: fuzzy matching is carried out on the dialogue system text containing the abnormal words and the words in the word list, and matched words and abnormal words corresponding to the similarity calculated by the set similarity algorithm are output; and replacing the abnormal words by using the obtained matched words.

A pinyin-based dialog system text error correction method comprises the following steps:

extracting an entity list from the domain entity data to obtain an entity list, and extracting entity context parameters based on the entity list and the domain problem data to obtain entity context parameters;

extracting word lists and word frequency parameters from the domain problem data respectively to obtain word lists and word frequency parameters respectively, and extracting word context parameters based on the word lists and the domain problem data to obtain word context parameters;

after abnormal words are extracted from the dialogue system text based on the pinyin by adopting word context parameters and word frequency parameters, abnormal entity error correction is carried out by adopting an entity list and the entity context parameters, and the dialogue system text based on the pinyin after the entity error correction is output;

and after abnormal words are extracted from the entity fault-tolerant pinyin-based dialog system text by adopting the word context parameters and the word frequency parameters, abnormal word correction is carried out by adopting the word list and the word context parameters, and the word fault-tolerant pinyin-based dialog system text is output.

The entity list extraction is the duplicate removal processing;

extracting entity context parameters into mapping from an entity to a corresponding left character list and mapping from the entity to a corresponding right character list;

the resulting word list includes: traversing the word segmentation result of the field problem data, and then performing duplicate removal processing;

the obtaining of the word frequency parameters comprises: traversing the word segmentation result of the field problem data, and mapping the obtained words to frequency;

the obtaining of the word context parameter comprises: words are mapped to the corresponding left character list and words are mapped to the corresponding right character list.

The AWE comprises:

The performing an abnormality determination includes:

if the frequency of the word in the word-frequency mapping table is less than T1 but greater than 0, determining the word as a candidate abnormal word; if the frequency of the word in the word-frequency mapping table is not less than T1 but less than T2, further judging whether the word conflicts with a first context set in the pinyin-based dialog system text, and if so, determining the word as a candidate abnormal word;

if the frequency of the word in the mapping table from the word to the frequency is equal to 0, determining the word as an abnormal word;

and if the word is a candidate abnormal word, performing set second context conflict judgment on the word to determine whether the word is an abnormal word.

As can be seen from the above, the embodiment of the present invention sets a Pinyin-Based Text Fault-Tolerant Model (PTFM), which implements Fault tolerance for an abnormal domain entity and an abnormal domain word in a Pinyin-Based dialog system Text under the condition of being Based on a domain problem set and a domain entity. Therefore, the embodiment of the invention can realize the error correction of the pinyin-based dialog system text in various fields on the basis of reducing the error correction cost.

Drawings

FIG. 1 is a schematic structural diagram of a text error correction system of a dialogue system based on Pinyin according to an embodiment of the present invention;

fig. 2 is a flowchart of a text error correction method for a dialogue system based on pinyin according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

In order to realize error correction of the dialogue system Text Based on the Pinyin in each field on the basis of reducing error correction cost, the embodiment of the invention is provided with a Text Fault-Tolerant Model (PTFM) Based on the Pinyin, and the PTFM realizes Fault tolerance of abnormal field entities and abnormal field words in the dialogue system Text Based on the Pinyin under the condition of being Based on a field problem set and field entities.

FIG. 1 is a schematic structural diagram of a Text error correction system of a pinyin-based dialog system, which includes a Parameter Extraction (PE) module and a Text Fault Tolerance (TF) module, the PE module includes an EPE (Entity Parameters Extraction) sub-module and a Word parameter Extraction (WPE, word Parameters Extraction) sub-module, the TF module includes an ETF (Entity Text Fault tolerance) sub-module and a Word Text Fault tolerance (WTF, word Text Fault tolerance) sub-module, wherein,

the EPE sub-module is used for extracting an entity list from the domain entity data to obtain an entity list, and extracting entity context parameters based on the entity list and the domain problem data to obtain entity context parameters;

the WPS sub-module is used for respectively extracting a word list and word frequency parameters from the field problem data to respectively obtain the word list and the word frequency parameters, and extracting word context parameters from the word list and the field problem data to obtain word context parameters;

the ETF submodule is used for receiving the dialogue system text based on the pinyin, extracting Abnormal Words (AWE) by adopting Word context parameters and Word frequency parameters, then carrying out Abnormal Entity error Correction (AEC) by adopting an Entity list and the Entity context parameters, and outputting the dialogue system text based on the pinyin after Entity fault tolerance;

and the WTF submodule is used for performing AWE (Abnormal Word Correction) on the entity fault-tolerant pinyin-based dialog system text by adopting the Word context parameter and the Word frequency parameter, performing AWC (Abnormal Word Correction) on the Word list and the Word context parameter, and outputting the Word fault-tolerant pinyin-based dialog system text.

In the system, AWE is multiplexed in an ETF sub-module and a WTF sub-module, AEC and AWC are provided with the same algorithm logic, only the matching objects are different, the former matching is an entity list and entity context parameters, and the latter matching is a word list and word context parameters.

In this system, the function of the PE module is to extract the model parameters to be used at the PTFM from the external domain problem data and domain entity data. In this way, the PTFM provided by the embodiment of the present invention can be migrated to other fields without any modification.

The EPE sub-module in the PE module needs to complete two processes, including: entity list extraction and entity context parameter extraction. The Entity List (EL) is relatively simple to extract, and only the Entity data in the external field needs to be subjected to duplicate removal processing. Entity context parameter extraction needs to be completed based on EL and problem data in the external field, and the specific data to be extracted in the whole process are two items, namely: the mapping of the Entity To the corresponding Left character List (E2 LCL, entity To Left Char List) and the mapping of the Entity To the corresponding Right character List (E2 RCL, entity To Right Char List), the specific extraction includes traversing the outside world problem data, and the operation is performed for each problem (assumed To be q): judging whether the entity in the EL is contained in q or not, if so, extracting a left character (lc) and a right character (rc) of the e in q; next, lc and rc are added to the E2LCL and E2RCL, respectively.

Here, the extraction of lc and rc needs to pay attention to the following three points:

q may contain a plurality of same or different e, so that each e appearing in q needs to be extracted by lc and rc;

if q begins with e, then set a "infrequent fixed character" for lc, this embodiment sets a '^ a';

if q ends with e, then rc is set with a "fixed character that does not occur frequently" (distinguished from '^ a'), which is set for this embodiment as '$'.

The WPE sub-module in the PE module needs to complete three processes, including: extracting a word list, extracting word frequency parameters and extracting word context parameters. Wherein, the Word List (WL, word List) extraction comprises: traversing the outside-realm problem data, performing an operation on each problem (say q): and traversing the word segmentation result of q, and then initializing the WL by using each traversed word. The following issues need to be noted here:

the selection of the selected word segmentation device in the embodiment of the invention is not fixed, but a uniform word segmentation device is used in the PTFM, for example, hanLP is selected in the embodiment of the invention;

in initializing the WL, a deduplication process is performed.

The specific data to be extracted in the word frequency parameter extraction process is as follows: the mapping of words To their occurrence Frequency in the external problem data, referred To simply as Word To Frequency mapping (W2F, word To Frequency). The specific extraction method comprises the following steps: traversing the outside world problem data, performing an operation for each problem (assuming q): and traversing the word segmentation result of q, and then initializing the W2F by using the traversed words. It is apparent that the WL and W2F extraction processes of the embodiments of the present invention can be combined.

The specific data to be extracted in the process of extracting the word context parameters are as follows: word To Left Char List mapping (W2 LCL, word To Left Char List) and Word To Right Char List mapping (W2 RCL, word To Right Char List). The specific operation is similar to the entity context parameter extraction process, and only the EL needs to be replaced by the WL, while the relevant operation on the EL is changed into the relevant operation on the WL.

In the embodiment of the invention, the TF module has the function of fault tolerance on the dialogue system text based on pinyin in the field problem data and the field entity data based on the parameters extracted from the PE module, and for the sake of simplicity, the dialogue system text based on pinyin is simply called as a target text.

The reason why the ETF sub-module and the WTF sub-module are included in the TF module is as follows: in the target text, the exception word often appears in the form of a segment of an exception entity. For example, if "wisdom pause" is an abnormal entity in the target text and the corresponding normal entity is "wisdom exhibition hall", the result of the abnormal word extraction process would only be "pause" rather than "wisdom pause". Therefore, the "abnormal segment" in the target text may be not only an abnormal word but also an abnormal entity. Problems arise with fault-tolerance of anomalous entities in the target text only at the word level. For example, when "slower T5" is identified as "slower T5", the result of fault-tolerant "slower T5" at the word level only would be "slower T5".

In both the ETF sub-module and the WTF sub-module, AWE is employed. The results of AWE directly affect the performance of the PTFM. The time complexity of the model is too high due to the excessive number of abnormal words extracted by adopting the AWE; too few in number may result in some erroneous text not being corrected. The AWE specific algorithm comprises:

firstly, performing word segmentation processing on a target text;

then, carrying out abnormity judgment on each word;

then, adding the determined abnormal words into an abnormal word list;

and finally, returning an abnormal word list of the target text.

The basis for the AWE to determine whether a word (assumed to be w) is abnormal is: the word frequency parameter and the context parameter, and the specific judging step comprises the following steps:

first step, candidate abnormal word determination

This step involves two thresholds, assumed to be T1 and T2 (1 <T1 <T2). If the frequency of W in W2F is less than T1 but greater than 0, then W is determined to be a candidate outlier. If the frequency of W in W2F is not less than T1 but less than T2, it is further determined whether W conflicts with its context in the target text. If so, w is determined to be a candidate outlier. Condition of context conflict: a left character of a word in the target text does not appear in the left character list of the word or a right character does not appear in the right character list of the word.

Second step, abnormal word determination

If the frequency of W in W2F is equal to 0, then W is directly determined to be an exception word. And if the w is the candidate abnormal word, performing new context conflict judgment on the w to determine whether the w is the abnormal word. Let lw and rw denote the left and right neighbors of w in the target text segmentation list, respectively, then the specific new context conflict conditions are as follows: lw and rw are both candidate abnormal words; lw is a candidate abnormal word, rw is not a candidate abnormal word, and the first character of rw does not appear in the right character list of w; lw is not a candidate exception word, rw is a candidate exception word, and the tail character of lw does not appear in the left character list of w.

The AEC in the ETF sub-module and the AWC in the WTF sub-module have the same algorithmic process, including: fuzzy matching is carried out on the target text containing the abnormal words and the entities (or words) in the EL (or WL), and the matched entities (or matched words) corresponding to the similarity obtained by the set similarity algorithm and the abnormal entities (or abnormal matched words) in the target text are output; and then replacing the abnormal entity (or abnormal matching word) in the target text by using the obtained matching entity (or matching word).

In the fuzzy matching process, the following two points need to be noted:

(1) The abnormal entity (or abnormal matching word) in the target text corresponding to the similarity calculated by the set similarity algorithm must contain the abnormal word extracted from the AWE. Due to word segmentation, the abnormal matching words and the abnormal words extracted in the target text by using the AWE are different in many cases;

(2) The similarity between the target text and the specific matching entity (or matching word) calculated by adopting a set similarity algorithm must be greater than a set threshold, and if the similarity is not greater than the set threshold, the similarity value is set to be 0; if so, a determination is made as to whether the matching entity (or matching word) conflicts with the context in the target text. Let lc and rc represent the left and right neighbor characters of the anomalous entity (or anomalous match word) in the target text, respectively, then the specific context conflict condition is: lc exists in the left character list of the matching entity (or matching word) and rc exists in the right character list of the matching entity (or matching word).

Fig. 2 is a flowchart of a text error correction method for a dialogue system based on pinyin according to an embodiment of the present invention, which includes the following specific steps:

step 201, performing entity list extraction on the domain entity data to obtain an entity list, and performing entity context parameter extraction based on the entity list and the domain problem data to obtain entity context parameters;

step 202, respectively extracting a word list and a word frequency parameter from the field problem data to respectively obtain a word list and a word frequency parameter, and extracting a word context parameter based on the word list and the field problem data to obtain a word context parameter;

step 203, after abnormal words are extracted from the pinyin-based dialog system text by adopting word context parameters and word frequency parameters, abnormal entity error correction is carried out by adopting an entity list and the entity context parameters, and the pinyin-based dialog system text after the entity error correction is output;

and 204, extracting abnormal words from the entity fault-tolerant pinyin-based dialog system text by using the word context parameters and the word frequency parameters, correcting the abnormal words by using the word list and the word context parameters, and outputting the word fault-tolerant pinyin-based dialog system text.

In the method, the entity list extraction is a duplicate removal treatment;

the obtaining of the word list comprises: traversing the word segmentation result of the field problem data, and then performing duplicate removal processing;

In this method, the AWE comprises:

The performing an abnormality determination includes:

if the frequency of the word in the word-frequency mapping table is less than T1 but greater than 0, determining the word as a candidate abnormal word; if the frequency number of the word in the mapping table from the word to the frequency number is not less than T1 but less than T2, further judging whether the word conflicts with a first context set in the pinyin-based dialog system text, and if so, determining the word as a candidate abnormal word;

In this method, the AEC comprises: fuzzy matching is carried out on the dialogue system text containing the abnormal words and the entities in the entity list, and a matched entity and an abnormal entity which correspond to the similarity and are calculated by adopting a set similarity algorithm are output; replacing the abnormal entity of the pinyin-based dialog system text by using the obtained matching entity;

the AWC includes: fuzzy matching is carried out on the dialogue system text containing the abnormal words and the words in the word list, and matched words and abnormal words corresponding to the similarity calculated by the set similarity algorithm are output; and replacing the abnormal words by using the obtained matched words.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A phonetic-based text error correction system of a dialogue system is characterized by comprising a parameter extraction PE module and a TF module, wherein the PE module consists of a parameter extraction EPE sub-module and a word parameter extraction WPE sub-module, the TF module consists of an entity text fault-tolerant ETF sub-module and a WTF sub-module, wherein,

the ETF sub-module is used for receiving the dialogue system text based on the pinyin, extracting AWE (abnormal word extraction) by adopting word context parameters and word frequency parameters, performing error correction AEC (error-free detection) on abnormal entities by adopting an entity list and the entity context parameters, and outputting the dialogue system text based on the pinyin after the entity fault tolerance;

the WTF submodule is used for performing AWE (active binary arithmetic) on the dialogue system text based on the pinyin after the entity fault tolerance by adopting the word context parameter and the word frequency parameter, performing AWC (abnormal word correction) on the abnormal words by adopting the word list and the word context parameter, and outputting the dialogue system text based on the pinyin after the word fault tolerance;

the EPE sub-module is used for performing entity list extraction on the domain entity data to remove duplicate; extracting the entity context parameters into mapping from the entity to a corresponding left character list and mapping from the entity to a corresponding right character list;

obtaining the word context parameter includes: mapping words to the corresponding left character list and mapping words to the corresponding right character list;

the AWE comprises:

2. The system of claim 1, wherein the making an anomaly determination comprises:

if the frequency of the word in the word-frequency mapping table is less than T1 but greater than 0, determining the word as a candidate abnormal word; if the frequency of the word in the word-frequency mapping table is not less than T1 but less than T2, further judging whether the word conflicts with a first set context in the pinyin-based dialog system text, and if so, determining the word as a candidate abnormal word;

3. The system of claim 1, in which the AEC comprises: fuzzy matching is carried out on the dialogue system text containing the abnormal words and the entities in the entity list, and a matched entity and an abnormal entity which correspond to the similarity and are calculated by adopting a set similarity algorithm are output; replacing the abnormal entity of the pinyin-based dialog system text by using the obtained matching entity;

4. A pinyin-based dialog system text error correction method is characterized by comprising the following steps:

respectively extracting a word list and a word frequency parameter from the field problem data to respectively obtain a word list and a word frequency parameter, and extracting a word context parameter based on the word list and the field problem data to obtain a word context parameter;

extracting AWE (abnormal word extraction) for abnormal words of a pinyin-based dialog system text by adopting word context parameters and word frequency parameters, then performing abnormal entity error correction AEC (error correction) by adopting an entity list and entity context parameters, and outputting the pinyin-based dialog system text subjected to entity fault tolerance;

after abnormal words are extracted from the entity fault-tolerant pinyin-based dialog system text by adopting word context parameters and word frequency parameters, abnormal word correction AWC is carried out by adopting a word list and the word context parameters, and the word fault-tolerant pinyin-based dialog system text is output;

wherein the extracting of the entity list is a duplicate removal treatment;

extracting the entity context parameters into mapping from the entity to a corresponding left character list and mapping from the entity to a corresponding right character list;

the obtaining of the word context parameter comprises: mapping words to a corresponding left character list and mapping words to a corresponding right character list;

the AWE comprises:

performing word segmentation processing on a dialogue system text based on pinyin; carrying out abnormity judgment on each word; adding the words determined to be abnormal into an abnormal word list; returning an abnormal word list of the pinyin-based dialog system text;

the performing an abnormality determination includes:

if the frequency number of the word in the mapping table from the word to the frequency number is smaller than T1 but larger than 0, determining the word as a candidate abnormal word; if the frequency of the word in the word-frequency mapping table is not less than T1 but less than T2, further judging whether the word conflicts with a first context set in the pinyin-based dialog system text, and if so, determining the word as a candidate abnormal word;

5. The method of claim 4, in which the AEC comprises: fuzzy matching is carried out on the dialogue system text containing the abnormal words and the entities in the entity list, and a matched entity and an abnormal entity which correspond to the similarity and are calculated by adopting a set similarity algorithm are output; replacing the abnormal entity of the pinyin-based dialog system text by using the obtained matching entity;