CN110727780A

CN110727780A - System and method for automatically expanding acquaintance text

Info

Publication number: CN110727780A
Application number: CN201910988927.3A
Authority: CN
Inventors: 刘德建; 梁益冰; 林剑锋; 林琛
Original assignee: Fujian TQ Digital Co Ltd
Current assignee: Fujian TQ Digital Co Ltd
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2020-01-24

Abstract

The invention provides a system for automatically expanding a recognition text, which comprises: the system comprises a skill configuration module, a product management module, a skill synchronization module and a corpus generation module; the skill configuration module is responsible for creating an extension type and configuring a corresponding extension template; the product management module provides expansion capability for the product in a product configuration skill mode, the product simultaneously configures a plurality of skills to perfect the expansion capability, and the skill synchronization module is responsible for integrating all expansion templates under all the skills configured by the current product and is used as a knowledge base of the product expansion capability; the corpus generating module automatically expands the corpus set of the imported product through the skill capability and the knowledge base of the current product.

Description

System and method for automatically expanding acquaintance text

Technical Field

The invention relates to the technical field of computers, in particular to a system and a method for automatically expanding a recognition text.

Background

In the present day of high-speed development in the field of intelligence, natural language understanding is a very important development direction. Having a good semantic understanding will give a more intelligent image to a product (e.g., a robot). Taking the most common customer service robot at present as an example, all answers answered by the robot are through preset contents (except for chatting the robot), and at the same time, text contents which can be understood by the robot are also configured manually. In order for the robot to recognize different questions, the configurator needs to give the robot enough corpora, and the tester needs to write enough corpora to test the correctness of the robot response. For example, the original requirement corpus has one sentence: how to handle the access certificate of the library, besides identifying the language of how to handle the access certificate of the library and giving an answer, the robot needs to be capable of identifying the questions such as how to handle the access certificate of the library, how to know the handling process of the access certificate of the library, how to handle the access certificate of the library, and the like. These corpora expansion are mainly achieved by artificial thinking expansion, and even sometimes to cover more corpora, configuration and testing personnel need to replace expansion by synonyms of keywords in query sentences.

The existing way for expanding the corpus has the following disadvantages:

at present, in order to enable a robot to recognize enough questions, the linguistic data are expanded by questions manually and then are given to the robot. However, if all the robot recognition capabilities are manually expanded, the expanded personnel may not be thinking-free and not sufficiently expanded due to the limitation of the literacy level, so that the robot recognition capabilities are not very high.

In addition to the case schema expansion in shortcoming one, synonym expansion is also necessary. But the operation of synonym replacement by synonym inquiry for a long time is easy to bring a bored emotion to the person who performs the work as time goes on (examples of synonym replacement: where is the address of the library.

The third disadvantage is that although the existing Chinese text can be replaced and expanded by synonym recommendation, synonym replacement is necessary to expand the recommended synonym as described in the second disadvantage, but synonyms cannot be completely replaced. The method needs to replace scenes, and the semanteme may be changed after synonym replacement in some scenes. So the main decision is eventually left to manual processing, which results in very high cost for corpus expansion.

And fourthly, after the configuration personnel or the testing personnel take the dialogue requirements, the configuration personnel or the testing personnel need to expand the contents in the dialogue requirements, the configuration personnel is used for enabling the robot to support different question ways, and the testing personnel is used for verifying whether the program supports the pattern question method or not. The implementation basis of the two parties is based on rich corpus. However, if independent expansion is performed on all corpora, both parties will occupy a lot of corpus expansion time, but have to perform expansion.

The intelligent second side of the customer service system is correct and timely, so that each product can continuously perfect and supplement new corpora or update the latest answers, and the situation that the answers of new and old question sentences are different easily occurs under the situation. In this case, most of the data is found by the tester during the testing process. However, the test finds that the configuration needs to be readjusted, the round of test round trip is increased, and in addition, the randomness of the response may occur, so that the test personnel can miss the test.

Disclosure of Invention

In order to overcome the problems, the invention aims to provide a system for automatically expanding the acquaintance texts, which realizes the automatic expansion of some common template questioning methods, does not need manual input and effectively improves the execution efficiency.

The invention is realized by adopting the following scheme: a system for automatically expanding a recognition text, the system comprising: the system comprises a skill configuration module, a product management module, a skill synchronization module and a corpus generation module;

the skill configuration module is responsible for creating an extension type and configuring a corresponding extension template;

the product management module provides the product with expansion capability by means of product configuration skills, the product configures a plurality of skills simultaneously to perfect the expansion capability,

the skill synchronization module is responsible for integrating all expansion templates under all skills configured by the current product and is used as a knowledge base of product expansion capability;

and the corpus generating module is used for automatically expanding a corpus set of the imported product through the skill capability and the knowledge base of the current product, wherein the corpus comprises two fields of questions and answers.

Further, the skill configuration module is further specifically configured to: providing a skill platform interface, wherein a user can create a user-defined skill through the skill platform interface, can create various extension types and descriptions under the corresponding skill, and can configure an extension template under each extension type; the skill configuration module also provides a part-of-speech query function for a user to query the part-of-speech combination condition in the original sentence text; the structure of the part of speech combination is as follows: word segmentation + part of speech name + part of speech code; the user can configure a corresponding expansion template under the expansion according to the condition of the original sentence part-of-speech combination, and the structure of the expansion template is as follows: extension type name + extension description + original sentence + extension sentence pattern.

Further, the product management module further specifically includes: providing a product platform interface, wherein the product platform interface creates a special product, adds personal customized skills and can also add skills created by other personnel; the product management module also provides a synonym importing function, a product configuration function, a synonym expansion switch and an access interface of restful, wherein the synonym importing function is used for importing synonyms into the system, and the importing format is not limited; the product configuration function is used for supporting whether the synonym expansion capability is enabled or not; the synonym expansion switch is used for loading all expansion templates of skills under the current product, adding the expansion templates into the memory and providing a knowledge base for externally providing expansion capability; the access interface of restful is used for calling other required services, and as long as the input document subject attribute body contains the request content of "this is a field", the access interface displays all sentences which can be expanded under the returned current product by using the list attribute field.

Further, the skill synchronization module is further specifically: reading all configured skills according to the skill list configured by the product; reading the extension type and the extension template under each skill and integrating; and finally configuring all integrated expansion templates as an expanded knowledge base.

Furthermore, the corpus generating module further comprises a corpus duplication checking unit and a sentence pattern extension unit, wherein the corpus duplication checking unit is used for filtering the corpus and giving out a list of existing conflicting corpora; the sentence pattern extension unit is used for reading a corpus set to be extended, and each corpus is stored in a list in the form of a question and an answer two field; performing word segmentation and part-of-speech tagging on each corpus to be expanded, then comparing the word segmentation and tagging results with expansion templates in a knowledge base, if the same template exists, finding the expansion type of the template, performing vocabulary replacement on the template in the expansion type, and giving out the expansion vocabulary after the vocabulary replacement; and adding the expanded question sentence into the expanded corpus set in an object mode, and continuing to expand until all expansion is completed.

Further, the corpus duplication checking unit is further specifically: comparing each corpus object with other corpus objects, judging whether a uniform question after word segmentation exists, if so, comparing whether answers of the uniform question and the uniform question are the same, if so, deleting one question, and if not, adding the two question objects into a set conflict list; and finally outputting a conflict list for the user to adjust after all the linguistic data are compared.

In addition, the present invention also provides a method for automatically expanding a recognition text, wherein the expanding method is expanded by using the expanding system according to claim 1, and the expanding method comprises the following steps: step S1, a skill configuration module in the system creates extension types by using the skills, and configures extension templates under each extension type, wherein the configuration content comprises: an extended type name, an extended type description, and an extended sentence pattern template list;

step S2, the product management module provides the product with expansion capability by means of product configuration skills, and the product can configure a plurality of skills to perfect the expansion capability;

step S3, the skill synchronization module integrates all expansion templates under all skills configured by the current product to be used as a knowledge base of the product expansion capability;

step S4, when the corpus needs to be expanded, the corpus generating module operates, namely, the corpus is uploaded or the database link stored corresponding to the corpus is configured; opening or closing a synonym expansion switch in the skill synchronization module according to the requirement; starting conversion is carried out; and the corpus generating module performs expansion operation through the skill capability and the knowledge base of the current product.

Furthermore, the corpus generating module further comprises a corpus duplication checking unit and a sentence pattern extension unit, wherein the corpus duplication checking unit is used for filtering the corpus and giving out a list of existing conflicting corpora; the corpus generating module further specifically performs expansion operation through the skill capability and the knowledge base of the current product: reading a corpus set to be expanded through the sentence pattern expansion unit, wherein each corpus is stored in a list in the form of a question and two fields for answering; performing word segmentation and part-of-speech tagging on each corpus to be expanded, then comparing the word segmentation and tagging results with expansion templates in a knowledge base, if the same template exists, finding the expansion type of the template, performing vocabulary replacement on the template in the expansion type, and giving out the expansion vocabulary after the vocabulary replacement; and adding the expanded question sentence into the expanded corpus set in an object mode, and continuing to expand until all expansion is completed.

The invention has the beneficial effects that: 1. by means of the text corpus expansion method, a corpus configuration person can quickly acquire expandable synonyms needing configuration, only unnecessary words need to be filtered, and repeated synonym query is not needed.

2. Text corpus expansion is carried out in the mode provided by the text, automatic expansion of some common template questioning methods can be achieved, and manual investment is not needed.

3. Through the duplication checking module provided by the text, personnel related to the product can quickly locate the question sentence with conflict in the temporary corpus and the expanded corpus, and the time investment in the aspect of filtering is effectively reduced.

4. By combining the sentence pattern template configuration and the synonym expansion scheme provided by the text, testers do not need to manually perform repeated expansion of corresponding synonym replacement, automatic expansion replacement can be realized, and the execution efficiency is effectively improved.

5. Through the restful interface request provided by the system, accessibility providing of the capability can be realized, user-defined calling is supported, and the system is more practical.

Drawings

FIG. 1 is a schematic block diagram of the system of the present invention.

Fig. 2 is a functional diagram of the skill configuration module of the present invention.

FIG. 3 is a schematic diagram of the operation of the product management module of the present invention.

FIG. 4 is a skill synchronization module operational schematic of the present invention.

FIG. 5 is a diagram illustrating the operation of the corpus generating module according to the present invention.

FIG. 6 is a diagram of the operation of the sentence pattern extension unit of the present invention.

FIG. 7 is a diagram illustrating the operation of the corpus duplication checking unit according to the present invention.

FIG. 8 is a schematic flow chart of the method of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 7, a system for automatically expanding a recognition text according to the present invention includes: the system comprises a skill configuration module, a product management module, a skill synchronization module and a corpus generation module;

the skill configuration module is responsible for creating an extension type and configuring a corresponding extension template; the concept of skill is introduced to facilitate customization of each product to its own proprietary expansion capability.

The product management module provides expansion capability for the product in a product configuration technology mode, the product configures a plurality of skills simultaneously to perfect the expansion capability, except for application skills, the product can import and modify a synonym library exclusive for the product, and meanwhile, a user can select to open or close the synonym expansion capability according to own requirements.

and the corpus generating module is used for automatically expanding a corpus set of the imported product through the skill capability and the knowledge base of the current product, wherein the corpus comprises two fields of questions and answers. The corpus only needs a user to upload a file with a specified format, such as the most common Excel, containing two fields of questions and answers;

as shown in fig. 2, the skill configuration module further specifically includes: providing a skill platform interface, wherein a user can create a user-defined skill through the skill platform interface, can create various extension types and descriptions under the corresponding skill, and can configure an extension template under each extension type; the skill configuration module also provides a part-of-speech query function for a user to query the part-of-speech combination condition in the original sentence text; (for example, the part-of-speech query function is to input text: how to adjust the mental state, which is shown in the following table 1), and is used for the user to query the part-of-speech combination condition in the text;

TABLE 1

Word segmentation	How to do	Adjustment of	Mental state
				Part of speech name	Adverb	Common verb	Common noun
Part-of-speech code	ad	vv	nn

The structure of the part of speech combination is as follows: word segmentation + part of speech name + part of speech code; the user can configure a corresponding expansion template under the expansion according to the condition of the original sentence part-of-speech combination, and the structure of the expansion template is as follows: extension type name + extension description + original sentence + extension sentence pattern. (specific examples of minimum units of the extended template are shown in table 2 below);

TABLE 2

That is, table 2 uses { } to divide the sentence pattern, but it may also be other separators, and it is only a judgment means for the program to realize the expansion capability.

As shown in fig. 3, the product management module further specifically includes: providing a product platform interface, wherein the product platform interface creates a special product, adds personal customized skills and can also add skills created by other personnel; the product management module also provides a synonym importing function, a product configuration function, a synonym expansion switch and an access interface of restful, wherein the synonym importing function is used for importing synonyms into the system, and the importing format is not limited (for example, the synonym importing function can be simply TXT files, each line is a class of words and is separated by spaces); the product configuration function is used for supporting whether the synonym expansion capability is enabled or not; the synonym expansion switch is used for loading all expansion templates of skills under the current product, adding the expansion templates into the memory and providing a knowledge base for externally providing expansion capability; that is, the system supports opening and closing synonym extensions for this configuration function; when the user clicks and opens, the function of carrying synonym replacement in expansion can be realized; if the user selects click to close, only sentence pattern expansion is carried out, and synonym expansion is not additionally added. The access interface of restful is used for calling other required services, and as long as the input document subject attribute body contains the request content of "this is a field", the access interface displays all sentences which can be expanded under the returned current product by using the list attribute field.

As shown in fig. 4, the skill synchronization module further specifically includes: reading all configured skills according to the skill list configured by the product; reading the extension type and the extension template under each skill and integrating; and finally configuring all integrated expansion templates as an expanded knowledge base.

As shown in fig. 5, the corpus generating module specifically includes the following contents:

1. importing dialogue corpora according to the requirements of the system (for example, uploading Excel corpora (or configuring database links corresponding to the corpora for storage) and configuring a switch for judging whether synonym expansion is needed;

2. the system carries out sentence pattern expansion of the sentence pattern expansion unit according to the imported linguistic data and outputs an expansion result;

3. when the synonym expansion switch is not turned on, the program directly goes to the corpus duplicate checking unit for duplicate checking;

4. when the synonym expansion switch is turned on, the program can replace synonyms for the corpus and generate a question sentence after synonym replacement. After all the corpora are integrated, the program goes to the corpus duplication checking unit for duplication checking.

The corpus generating module further comprises a corpus repetition searching unit and a sentence expansion unit, wherein the corpus repetition searching unit is used for filtering the corpus and giving a list of existing conflicting corpora, as shown in FIG. 6, the sentence expansion unit is used for reading a corpus set to be expanded, each corpus is stored in the list in the form of a question and an answer two fields (namely, each corpus is stored in the list in the form of an object { "query": query _ value "," answer "}), a participle and a part word property are labeled on each corpus to be expanded, then the results of the participle and the label are compared with an expansion template in a knowledge base, if the same template exists, the expansion type of the template is found, the template under the expansion type is replaced, an expansion vocabulary is given after the replacement (for example, how to input the expansion process, the continuation result of the participle and the part word property is" v "and the result of the extension sentence", how to replace the expansion process under the expansion type is continued until the expansion process is completed in a text set ③, how to the expansion process is continued, how to the expansion process of the expansion template under the expansion process is continued, and how to the expansion process is continued in a text set 3632, and how to be repeated.

As shown in fig. 7, the corpus duplication checking unit further specifically includes: comparing each corpus object with other corpus objects, judging whether a uniform question after word segmentation exists, if so, comparing whether answers of the uniform question and the uniform question are the same, if so, deleting one question, and if not, adding the two question objects into a set conflict list; and finally outputting a conflict list for the user to adjust after all the linguistic data are compared.

In addition, as shown in fig. 8, the present invention further provides a method for automatically expanding a recognition text, wherein the expanding method is expanded by using the expanding system according to claim 1, and the expanding method comprises the following steps: step S1, a skill configuration module in the system creates extension types by using the skills, and configures extension templates under each extension type, wherein the configuration content comprises: an extended type name, an extended type description, and an extended sentence pattern template list;

Finally, when the expansion task is completed, the system can provide the conflicting corpus information for the user to delete or modify. When the modification is completed, the entire corpus can be derived. Meanwhile, the system still provides an external restful interface, and a user can obtain a recommended acquaintance text list through the interface call.

The invention is further illustrated by the following specific examples:

scene one:

the classmate A is responsible for the operation and configuration of a customer service system, and the main function of the customer service system is mainly text conversation. To give the robot customer service the ability to answer questions instead of the operator or customer service colleagues, gadget a integrates product questions faq and deploys faq into the robot customer service. However, after the robot is operated on line for a period of time, the robot cannot recognize most people by changing the configured question method into a sentence method except that the configured question method can be recognized by the robot. On this basis, the small a has to extend other interrogations through its own thought to supplement the configuration of other interrogations. Meanwhile, in the process of expansion, a small A finds that many questions are all the same sentence pattern changes. (for example, regarding the question of the place query, the original question has "(-) address is.

According to the scheme of the patent, a classmate A creates skills belonging to the classmate A, and can configure a certain common query sentence pattern in a skill configuration module, configure different query sentence pattern templates under the type (for example, a personal skill with the skill of A is created, an extension type named as 'location query' is created in a product management module, and configure ①, the address of { nn } is ②, the position of { nn } is ③, the address of { nn } is ④, where { nn } is, ⑤, and the like under the 'location query' type, nn refers to code of common noun).

Scene two:

the classmate B is responsible for the operation and configuration of an intelligent customer service system. In order for each intent to support more query patterns, small B not only tries various sentence changes to expand, but also expands some keywords in the replacement sentence by querying synonyms over the web. However, after the synonyms from the online query are found to be applied to the sentences, the semantics are different, and manual filtering is needed. And it takes a lot of time to query available synonyms and replace them into each sentence pattern.

According to the scheme of the patent, the classmate B can maintain the own synonym library and import the synonym library into the system. Clicking on the synonym expansion option, the program will automatically present all alternative synonyms in the question sentence. Meanwhile, if partial synonym recommendations which do not accord with the scene exist, the user can remove the partial synonym recommendations by deleting the partial synonym recommendations through the platform interface. And after confirming the expandable synonyms of the corpus, clicking and continuing, and completely replacing the synonyms into the expanded sentence pattern. The small B does not need to be manually configured for each operation.

Scene three:

the classmate C and the classmate D are responsible for the configuration and the test of a dialogue system, and the responsible products integrate tens of thousands of corpus sets under long-term operation configuration. There are some outdated corpora and latest corpora conflicts (corpus conflicts, i.e., the same question is repeatedly configured in the system but the answer is not the same, such questions are often found in the case of corpus expiration, manual misoperation, etc. for example, ask- "who is someone", configure answer in the last year- "high-level engineer who is someone", but the answer configured in this year is "the side leader of someone is"). Each time the small C and the small D receive the corpora, the respective work is started, one is responsible for configuration, and the other is responsible for expanding the corpus use case. As a result, after each version test, some answers caused by corpus conflict are found. And then, the small C performs configuration checking according to the result of the small D, finds out the content of the conflict, submits the content to a product worker for confirmation, modifies the conflict problem, and submits the content to the small D for retesting. Since the problem is discovered late, the small D is repeatedly tested for many times, the test cost is increased, and meanwhile, the small C spends much time when checking the conflict content.

Through the scheme of the patent, the small C can introduce the latest linguistic data into the system, and the system can automatically enumerate the conflict content between the new linguistic data and the old linguistic data one by one. The small C can directly deliver the result to the production personnel for confirmation and adjustment of the configuration. The configured content delivery small-D test does not have similar conflict problems, and time cost caused by conflict is relatively reduced.

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims

1. A system for automatically expanding a recognition text, comprising: the system comprises: the system comprises a skill configuration module, a product management module, a skill synchronization module and a corpus generation module;

2. The system for automatically expanding the acquaintance text according to claim 1, wherein: the skill configuration module is further specifically: providing a skill platform interface, wherein a user can create a user-defined skill through the skill platform interface, can create various extension types and descriptions under the corresponding skill, and can configure an extension template under each extension type; the skill configuration module also provides a part-of-speech query function for a user to query the part-of-speech combination condition in the original sentence text; the structure of the part of speech combination is as follows: word segmentation + part of speech name + part of speech code; the user can configure a corresponding expansion template under the expansion according to the condition of the original sentence part-of-speech combination, and the structure of the expansion template is as follows: extension type name + extension description + original sentence + extension sentence pattern.

3. The system for automatically expanding the acquaintance text according to claim 1, wherein: the product management module is further specifically: providing a product platform interface, wherein the product platform interface creates a special product, adds personal customized skills and can also add skills created by other personnel; the product management module also provides a synonym importing function, a product configuration function, a synonym expansion switch and an access interface of restful, wherein the synonym importing function is used for importing synonyms into the system, and the importing format is not limited; the product configuration function is used for supporting whether the synonym expansion capability is enabled or not; the synonym expansion switch is used for loading all expansion templates of skills under the current product, adding the expansion templates into the memory and providing a knowledge base for externally providing expansion capability; the access interface of restful is used for service calling of other requirements, and as long as the input document subject attribute body contains the request content of queue = 'this is a field', the access interface displays all sentences which can be expanded under the returned current product by using the list attribute field.

4. The system for automatically expanding the acquaintance text according to claim 1, wherein: the skill synchronization module is further specifically: reading all configured skills according to the skill list configured by the product; reading the extension type and the extension template under each skill and integrating; and finally configuring all integrated expansion templates as an expanded knowledge base.

5. The system for automatically expanding the acquaintance text according to claim 1, wherein: the corpus generating module further comprises a corpus duplication checking unit and a sentence pattern expansion unit, wherein the corpus duplication checking unit is used for filtering the corpus and giving out a list of conflict corpuses; the sentence pattern extension unit is used for reading a corpus set to be extended, and each corpus is stored in a list in the form of a question and an answer two field; performing word segmentation and part-of-speech tagging on each corpus to be expanded, then comparing the word segmentation and tagging results with expansion templates in a knowledge base, if the same template exists, finding the expansion type of the template, performing vocabulary replacement on the template in the expansion type, and giving out the expansion vocabulary after the vocabulary replacement; and adding the expanded question sentence into the expanded corpus set in an object mode, and continuing to expand until all expansion is completed.

6. The system for automatically expanding a recognition text according to claim 5, wherein: the corpus duplication checking unit is further embodied as follows: comparing each corpus object with other corpus objects, judging whether a uniform question after word segmentation exists, if so, comparing whether answers of the uniform question and the uniform question are the same, if so, deleting one question, and if not, adding the two question objects into a set conflict list; and finally outputting a conflict list for the user to adjust after all the linguistic data are compared.

7. A method for automatically expanding a recognition text is characterized in that: the method of expansion is extended with an extended system according to claim 1, the method of expansion comprising the steps of: step S1, a skill configuration module in the system creates extension types by using the skills, and configures extension templates under each extension type, wherein the configuration content comprises: an extended type name, an extended type description, and an extended sentence pattern template list;

8. The method of claim 7, wherein the method comprises: the corpus generating module further comprises a corpus duplication checking unit and a sentence pattern expansion unit, wherein the corpus duplication checking unit is used for filtering the corpus and giving out a list of conflict corpuses; the corpus generating module further specifically performs expansion operation through the skill capability and the knowledge base of the current product: reading a corpus set to be expanded through the sentence pattern expansion unit, wherein each corpus is stored in a list in the form of a question and two fields for answering; performing word segmentation and part-of-speech tagging on each corpus to be expanded, then comparing the word segmentation and tagging results with expansion templates in a knowledge base, if the same template exists, finding the expansion type of the template, performing vocabulary replacement on the template in the expansion type, and giving out the expansion vocabulary after the vocabulary replacement; and adding the expanded question sentence into the expanded corpus set in an object mode, and continuing to expand until all expansion is completed.

9. The method of claim 8, wherein the method comprises: the corpus duplication checking unit is further embodied as follows: comparing each corpus object with other corpus objects, judging whether a uniform question after word segmentation exists, if so, comparing whether answers of the uniform question and the uniform question are the same, if so, deleting one question, and if not, adding the two question objects into a set conflict list; and finally outputting a conflict list for the user to adjust after all the linguistic data are compared.