CN110059168A

CN110059168A - The method that man-machine interactive system based on natural intelligence is trained

Info

Publication number: CN110059168A
Application number: CN201910065179.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Echo Technologies Inc
Current assignee: Echo Technologies Inc
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2019-07-26

Abstract

A method of the man-machine interactive system based on natural intelligence is trained, it include: generation text script corresponding with standard expression, voice corresponding with text script is obtained by text compressing tool, each voice is respectively converted into secondary language message, the order of magnitude of the messenger particle degree of secondary language message is less than the order of magnitude of the messenger particle degree of text, secondary language message and the expression of corresponding standard are used as paired data to be stored in database, for pairs of secondary language message existing in database and standard expression, the various permutation and combination of the element of the secondary language message and the standard are expressed or the various permutation and combination of the element of standard expression carry out loop iteration, establish the element of secondary language message various permutation and combination and standard expression or or standard expression element various permutation and combination between Corresponding relationship, obtains the paired data of more secondary language messages and standard expression, and stores in the database.

Description

Method for training human-computer interaction system based on natural intelligence

Technical Field

The invention relates to a method for training a human-computer interaction system, in particular to a method for training a human-computer interaction system based on natural intelligence, and corresponding human-computer interaction equipment and a human-computer interaction system.

Background

The currently used Machine Intelligence (MI) technology mainly includes Artificial Intelligence (AI) technology. Natural Language Processing (NLP) technology based on artificial intelligence is a common technology.

For artificial intelligence-based natural language processing (also referred to as AI-NLP for short), for the processed speech, text is required to be converted from speech, and then semantic understanding is realized through the established grammar model and semantic model. However, this approach is limited by the accuracy of the speech recognizer. For example, a sentence has 10 words, the speech recognizer can achieve 90% accuracy, but if the error occurs in the keyword (word), the existing AI-NLP technology cannot achieve correct semantic understanding. In particular, in a noisy environment, since the accuracy of the speech recognizer decreases, the difficulty of semantic understanding using AI-NLP techniques is even higher. On the other hand, AI-NLP requires manual construction of massive grammar models and semantic models, which results in a significant labor cost. In fact, the major enterprises engaged in AI-NLP technology development and application in the world currently have thousands or even more employees engaged in manual labeling and model building of speech.

Disclosure of Invention

According to an aspect of the present invention, there is provided a natural expression processing method based on natural intelligence, including: receiving input of natural expression, obtaining first language information with first information granularity, converting the first language information into second language information with second information granularity, wherein the magnitude order of the second information granularity is between the magnitude order of the first information granularity and the magnitude order of the information granularity of characters, converting the second language information into third language information, and the third language information is used as a result of understanding the natural expression, wherein the second language information and the third language information corresponding to the second language information are stored in a database as pairing data, and for the existing paired second language information and third language information in the database, performing cyclic iteration on various permutation combinations of elements of the second language information and the third language information or various permutation combinations of elements of the third language information, and establishing various permutation combinations of elements of the second language information and the third language information or the third language information And acquiring more pairing data of the second language information and the third language information according to the corresponding relation among various permutation and combination of the elements, and storing the pairing data in a database.

According to the natural intelligence-based natural expression processing method of the embodiment of the invention, after second language information is obtained from input first language information, the second language information is compared with second language information existing in a database, then third language information corresponding to the second language information is determined according to the comparison result, or the accuracy of corresponding the second language information to certain third language information is calculated, if the machine comprehension capability is not mature enough and is not enough or uncertain to convert the second language information to certain third language information, artificial assistance comprehension is carried out, the input first language information is manually understood to obtain third language information corresponding to the meaning of natural expression, the second language information obtained from the first language information is corresponding to the third language information or the first language information is corresponding to the third language information, and obtaining new pairing data and storing the new pairing data into the database.

According to the natural expression processing method based on natural intelligence, the pairing data of the new second language information and the third language information or the pairing data of the new first language information and the third language information are subjected to cyclic iteration on various permutation combinations of elements of the second language information or the second language information obtained by converting the first language information and the third language information or the third language information, the corresponding relation between the various permutation combinations of the elements of the second language information and the various permutation combinations of the elements of the third language information or the third language information is established, more pairing data of the second language information and the third language information are obtained and stored in the database.

According to the natural expression processing method based on natural intelligence, the corresponding relation of errors between the second language information and the third language information in the database is corrected through artificial assistance understanding.

According to the natural expression processing method based on natural intelligence, the machine comprehension capacity is measured through the self-credibility, and the self-credibility is calculated based on the corresponding relation between the second language information and the third language information.

According to the natural expression processing method based on natural intelligence, after the second language information is obtained from the first language information, the logarithmic probability or similar score of the third language information is generated through one or more of a deep neural network, a finite state converter and an automatic encoder decoder, and the self-credibility of the third language information is calculated through a normalized exponential function.

According to the natural expression processing method based on natural intelligence, disclosed by the embodiment of the invention, the information granularity of the second language information is 1/10-1/1000 of the information granularity of characters.

According to the natural expression processing method based on natural intelligence, when the pairing data of the second language information and the third language information is subjected to loop iteration, a conversion model from the second language information to the third language information is also subjected to loop optimization.

According to the natural intelligence-based natural expression processing method, the second language information obtained by loop iteration is used for testing the conversion of the second language information into the third language information, the second language information which cannot be correctly converted and the third language information which is correctly corresponding to the second language information are written into the comparison table, and for the subsequently input natural expression, the second language information converted by the natural expression is compared with the second language information stored in the comparison table.

According to an aspect of the present invention, there is provided a natural expression processing and responding method based on natural intelligence, including: obtaining third language information by a natural expression processing method according to the foregoing; calling or generating a standard response matched with the third language information; outputting the standard response in a manner corresponding to the first language information.

According to the natural expression processing and responding method based on natural intelligence of the embodiment of the invention, the standard response is fixed data stored in the response database in advance, or is generated based on variable parameters and basic data of the standard response stored in the response database in advance.

According to an aspect of the present invention, there is provided a natural expression processing and responding apparatus based on natural intelligence, including: the system comprises a dialogue gateway, a central controller, an MAU workstation, a robot, an expression database, a response database and a response generator, wherein the dialogue gateway receives a natural expression from a user, sends the natural expression to the central controller for subsequent processing and sends a response to the natural expression to the user; the central controller receives the natural expression from the dialogue gateway, and works with the robot and the MAU workstation cooperatively, converts the natural expression into a standard expression representing the meaning of the natural expression, and indicates a response generator to generate a standard response corresponding to the standard expression according to the standard expression; the robot converts the natural expression into secondary language information according to the instruction of the central controller, wherein the order of magnitude of information granularity of the secondary language information is between the order of magnitude of information granularity of the natural expression and the order of magnitude of information granularity of characters, and converts the secondary language information into standard expression; the MAU workstation presents the natural expression to an external MAU artificial seat, the MAU artificial seat inputs or selects a standard expression through the MAU workstation, and then the MAU workstation sends the standard expression to the central controller; the training database is used for storing the secondary language information and the pairing data of the standard expression; the response database stores response related data, including standard response data for calling and/or data for generating a response; the response generator receives the instruction of the central controller, and generates the response to the natural expression of the user by calling and/or operating the data in the response database, wherein the device further comprises a trainer which is used for training the robot to convert the natural expression into a standard expression, wherein the trainer enables the robot to carry out loop iteration comparison on various permutation combinations of the elements of the secondary language information and the standard expression for the paired secondary language information and standard expression existing in the training database, establishes the corresponding relation between the various permutation combinations of the elements of the secondary language information and the various permutation combinations of the elements of the standard expression or the standard expression, obtains more paired data of the secondary language information and the standard expression, and stores the paired data in the training database.

According to an aspect of the present invention, there is provided a human-computer interaction system based on natural intelligence, including: natural expression processing and responding equipment and calling equipment, wherein, the user communicates with the natural expression processing and responding equipment through the calling equipment, and the MAU artificial seat carries out artificial operation on the natural expression processing and responding equipment, and the natural expression processing and responding equipment comprises: the system comprises a dialogue gateway, a central controller, an MAU workstation, a robot, an expression database, a response database and a response generator, wherein the dialogue gateway receives a natural expression from a user, sends the natural expression to the central controller for subsequent processing and sends a response to the natural expression to the user; the central controller receives the natural expression from the dialogue gateway, and works with the robot and the MAU workstation cooperatively, converts the natural expression into a standard expression representing the meaning of the natural expression, and indicates a response generator to generate a standard response corresponding to the standard expression according to the standard expression; the robot converts the natural expression into secondary language information according to the instruction of the central controller, wherein the order of magnitude of information granularity of the secondary language information is between the order of magnitude of information granularity of the natural expression and the order of magnitude of information granularity of characters, and converts the secondary language information into standard expression; the MAU workstation presents the natural expression to an MAU artificial seat, the MAU artificial seat inputs or selects a standard expression through the MAU workstation, and then the MAU workstation sends the standard expression to the central controller; the training database is used for storing the secondary language information and the pairing data of the standard expression; the response database stores response related data, including standard response data for calling and/or data for generating a response; the response generator receives the instruction of the central controller, and generates the response to the natural expression of the user by calling and/or operating the data in the response database, wherein the device further comprises a trainer which is used for training the robot to convert the natural expression into the standard expression, wherein the trainer enables the robot to carry out loop iteration on various permutation combinations of the elements of the secondary language information and the standard expression according to the paired secondary language information and standard expression existing in the training database, establishes the corresponding relation between the various permutation combinations of the elements of the secondary language information and the various permutation combinations of the elements of the standard expression or the standard expression, obtains more paired data of the secondary language information and the standard expression, and stores the paired data in the training database.

According to an aspect of the present invention, there is provided a natural expression processing method based on natural intelligence, including: receiving the first natural expression, converting the first natural expression into secondary language information, calculating the confidence of converting the secondary language information converted by the first natural expression into standard expression in a database, and outputting the standard expression as a result of understanding the first natural expression when the calculated confidence of a certain standard expression is not lower than a first confidence threshold.

According to the natural expression processing method based on natural intelligence, when the calculated confidence degrees are all lower than the second confidence degree threshold value, the input of the second natural expression having the same meaning as the first natural expression is prompted.

According to the natural intelligence-based natural expression processing method, the second natural expression is converted into the secondary language information, the confidence degree of the secondary language information converted from the second natural expression into the standard expression in the database is calculated, and when the confidence degree of a certain standard expression is not lower than the first confidence threshold value, the standard expression is output as a result of understanding the first natural expression.

According to the natural expression processing method based on natural intelligence, when the calculated confidence degree of a standard expression is lower than a first confidence degree threshold value but not lower than a second confidence degree threshold value, a third natural expression is prompted to be input to confirm whether the standard expression corresponds to the meaning of the first natural expression.

According to the natural intelligence-based natural expression processing method of the embodiment of the invention, the third natural expression is converted into the secondary language information, the confidence level of converting the secondary language information converted from the third natural expression into the second standard expression representing the meaning of "confirmation" is calculated, and if the confidence level is not lower than the first confidence level threshold value, the first standard expression is output as a result of understanding the first natural expression.

According to the natural intelligence-based natural expression processing method of the embodiment of the invention, the secondary language information converted by the first natural expression and the first standard expression are stored in the database as the paired data.

According to the natural expression processing method based on natural intelligence, provided by the embodiment of the invention, if the calculated confidence is lower than a first confidence threshold or other confidence thresholds, the first natural expression is subjected to artificial assistance understanding or other artificial processing.

According to the natural intelligence-based natural expression processing method, the self-credibility is calculated based on the corresponding relation between the secondary language information and the standard expression, the logarithmic probability or similar scores of the single or multiple standard expressions are generated through one or more of the deep neural network, the finite state converter and the automatic encoder decoder, and then the self-credibility of the single or multiple standard expressions is calculated by utilizing the normalized exponential function.

According to the natural expression processing method based on natural intelligence, the magnitude of information granularity of the secondary language information is smaller than that of the information granularity of the characters.

According to the natural expression processing method based on natural intelligence, disclosed by the embodiment of the invention, the information granularity of the secondary language information is 1/10-1/1000 of the information granularity of characters.

According to the natural expression processing method based on natural intelligence, for existing paired secondary language information and standard expression in a database, various permutation and combination of elements of the secondary language information and various permutation and combination of elements of the standard expression or the standard expression are subjected to loop iteration, corresponding relations between various permutation and combination of elements of the secondary language information and various permutation and combination of elements of the standard expression or the standard expression are established, and more pairing data of the secondary language information and the standard expression are obtained and stored in the database.

According to the natural intelligence-based natural expression processing method, the secondary language information obtained by loop iteration is used for converting the secondary language information into the standard expression, the secondary language information which cannot be correctly converted and the standard expression which is correctly corresponding to the secondary language information are written into the comparison table, and for the subsequently input natural expression, the secondary language information converted by the natural expression is compared with the secondary language information stored in the comparison table.

According to the natural expression processing method based on natural intelligence, when the pairing data of the secondary language information and the standard expression is subjected to loop iteration, a conversion model from the secondary language information to the standard expression is also subjected to loop optimization.

According to an aspect of the present invention, there is provided a natural expression processing and responding method based on natural intelligence, including: the natural expression processing method obtains a first standard expression; invoking or generating a standard response matched with the standard expression; outputting the standard response in a manner corresponding to the first natural expression.

According to an aspect of the present invention, there is provided a natural expression processing and responding apparatus based on natural intelligence, including: the system comprises a dialogue gateway, a central controller, an MAU workstation, a robot, a training database, a response database and a response generator, wherein the dialogue gateway receives a natural expression from a user, sends the natural expression to the central controller for subsequent processing and sends a response to the natural expression to the user; the central controller receives the natural expression from the dialogue gateway, and works with the robot and the MAU workstation cooperatively, converts the natural expression into a standard expression representing the meaning of the natural expression, and indicates a response generator to generate a standard response corresponding to the standard expression according to the standard expression; the robot converts the natural expression into secondary language information according to the instruction of the central controller, calculates the confidence level of converting the secondary language information converted by the natural expression into a standard expression in a training database, and converts the secondary language information into the standard expression when the calculated confidence level of a certain standard expression is not lower than a first confidence level threshold; the MAU workstation presents the natural expression to an external MAU artificial seat, the MAU artificial seat inputs or selects a standard expression through the MAU workstation, and then the MAU workstation sends the standard expression to the central controller; the training database is used for storing the secondary language information and the pairing data of the standard expression; the response database stores response related data, including standard response data for calling and/or data for generating a response; the response generator receives instructions from the central controller and generates responses to the user's natural expressions by invoking and/or executing data in the response database.

According to an aspect of the present invention, there is provided a human-computer interaction system based on natural intelligence, including: natural expression processing and responding equipment and calling equipment, wherein, the user communicates with the natural expression processing and responding equipment through the calling equipment, and the MAU artificial seat carries out artificial operation on the natural expression processing and responding equipment, and the natural expression processing and responding equipment comprises: the system comprises a dialogue gateway, a central controller, an MAU workstation, a robot, a training database, a response database and a response generator, wherein the dialogue gateway receives a natural expression from a user, sends the natural expression to the central controller for subsequent processing and sends a response to the natural expression to the user; the central controller receives the natural expression from the dialogue gateway, and works with the robot and the MAU workstation cooperatively, converts the natural expression into a standard expression representing the meaning of the natural expression, and indicates a response generator to generate a standard response corresponding to the standard expression according to the standard expression; the robot converts the natural expression into secondary language information according to the instruction of the central controller, calculates the confidence level of converting the secondary language information converted by the natural expression into a standard expression in a training database, and converts the secondary language information into the standard expression when the calculated confidence level of a certain standard expression is not lower than a first confidence level threshold; the MAU workstation presents the natural expression to an MAU artificial seat, the MAU artificial seat inputs or selects a standard expression through the MAU workstation, and then the MAU workstation sends the standard expression to the central controller; the training database is used for storing the secondary language information and the pairing data of the standard expression; the response database stores response related data, including standard response data for calling and/or data for generating a response; the response generator receives instructions from the central controller and generates responses to the user's natural expressions by invoking and/or executing data in the response database.

According to an aspect of the present invention, there is provided a natural expression processing method based on natural intelligence, including: setting a plurality of standard expressions respectively corresponding to a plurality of intentions in a database, receiving the natural expressions, converting the natural expressions into secondary language information, acquiring parts corresponding to the plurality of intentions from the secondary language information, and respectively converting the acquired parts of the secondary language information corresponding to the plurality of intentions into the standard expressions, wherein the order of magnitude of information granularity of the secondary language information is smaller than that of the information granularity of characters.

According to the natural intelligence-based natural expression processing method of the embodiment of the invention, secondary language information obtained by conversion of natural expression and a plurality of standard expressions obtained by conversion of the secondary language information and corresponding to a plurality of intentions respectively are stored as pairing data in a database, loop iteration is performed on various permutation combinations of elements of the secondary language information and combinations of the plurality of standard expressions or various permutation combinations of elements of the combination of the plurality of standard expressions, a correspondence relationship between the various permutation combinations of the elements of the secondary language information and the combinations of the plurality of standard expressions or the various permutation combinations of the elements of the combination of the plurality of standard expressions is established, and more pairing data of the secondary language information and the standard expression combinations are obtained and stored in the database.

The natural expression processing method based on natural intelligence according to the embodiment of the present invention is a natural expression processing method based on natural intelligence, wherein after secondary language information is obtained from an input natural expression, the secondary language information is compared with secondary language information existing in a database, then a standard expression or a standard expression combination corresponding to the secondary language information is determined according to the comparison result, and/or a probability that the secondary language information correctly corresponds to a certain standard expression is calculated, if a machine understanding ability is not mature enough and is not sufficient or is uncertain to convert the secondary language information into a certain standard expression, a manual auxiliary understanding is performed, the input natural expression is manually understood to obtain a standard expression or a standard expression combination corresponding to a certain intention or certain intentions, and the secondary language information obtained from the natural expression is corresponding to the standard expression or the standard expression combination or the natural expression combination, and obtaining new pairing data and storing the new pairing data into the database.

According to the natural intelligence-based natural expression processing method of the embodiment of the invention, for the pairing data of the new secondary language information and the standard expression or the standard expression combination or the pairing data of the new natural expression and the standard expression or the standard expression combination, the various permutation combinations of the elements of the secondary language information or the secondary language information converted from the natural expression and the standard expression or the standard expression combination or the various permutation combinations of the elements of the standard expression or the standard expression combination are subjected to cyclic iteration to establish the corresponding relation between the various permutation combinations of the elements of the secondary language information and the standard expression or the standard expression combination or the various permutation combinations of the elements of the standard expression or the standard expression combination, so as to obtain more pairing data of the secondary language information and the standard expression or the standard expression combination, and stored in a database.

According to the natural expression processing method based on natural intelligence, the corresponding relation of errors between the secondary language information and the standard expression or the standard expression combination in the database is corrected through artificial assistance understanding.

According to the natural expression processing method based on natural intelligence, the machine understanding capacity is measured through the self-credibility, and the self-credibility is calculated based on the corresponding relation between the secondary language information and the standard expression.

According to the natural intelligence-based natural expression processing method, after the secondary language information is obtained from the natural expression, the logarithmic probability or similar scores of the single or multiple standard expressions are generated through one or more of a deep neural network, a finite state converter and an automatic encoder/decoder, and the self-credibility of the single or multiple standard expressions is calculated through a normalized exponential function.

According to the natural expression processing method based on natural intelligence, the information granularity of the secondary language information is 1/10-1/1000 of the information granularity of characters.

According to the natural expression processing method based on natural intelligence of the embodiment of the present invention, portions corresponding to a plurality of intentions are acquired from secondary language information through a plurality of understandings or a plurality of sessions.

According to the natural expression processing method based on natural intelligence of the embodiment of the present invention, a plurality of upper intentions are set in a database, a plurality of lower intentions are set for each upper intention, portions corresponding to the respective lower intentions of different upper intentions are acquired from secondary language information in one intention acquisition operation, and the portions are converted into standard expressions.

According to the natural expression processing method based on natural intelligence of the embodiment of the invention, for a standard expression corresponding to one of a plurality of intentions or a combination of standard expressions corresponding to a part of the plurality of intentions, the standard expression and natural expression or secondary language information corresponding to the standard expression are stored in a database in advance as paired training data, or a standard expression combination and natural expression or secondary language information corresponding to the standard expression combination are stored as paired training data, and training is performed by using the paired training data.

According to an aspect of the present invention, there is provided a natural expression processing and responding method based on natural intelligence, including: obtaining a standard expression or a combination of standard expressions by a natural expression processing method according to the foregoing; invoking or generating a standard response that matches the standard expression or combination of standard expressions; the standard response is output in a manner corresponding to the natural expression.

The natural expression processing and responding method according to an embodiment of the present invention, wherein the standard response is fixed data stored in advance in the response database, or is generated based on variable parameters and basic data of the standard response stored in advance in the response database.

According to an aspect of the present invention, there is provided a natural expression processing and responding apparatus based on natural intelligence, including: the system comprises a dialogue gateway, a central controller, an MAU workstation, a robot, an expression database, a response database and a response generator, wherein the dialogue gateway receives a natural expression from a user, sends the natural expression to the central controller for subsequent processing and sends a response to the natural expression to the user; the central controller receives the natural expression from the conversation gateway, and the natural expression, the robot and the MAU workstation work cooperatively, the natural expression is converted into a plurality of standard expressions corresponding to a plurality of set intentions, and a standard response corresponding to the standard expression is generated according to a standard expression indication response generator; the robot converts the natural expression into secondary language information according to the instruction of the central controller, acquires parts corresponding to a plurality of intentions from the secondary language information, and respectively converts the acquired parts of the secondary language information corresponding to the plurality of intentions into standard expressions, wherein the order of magnitude of information granularity of the secondary language information is smaller than that of the characters; the MAU workstation presents the natural expression to an external MAU artificial seat, the MAU artificial seat inputs or selects a standard expression through the MAU workstation, and then the MAU workstation sends the standard expression to the central controller; the training database stores the pairing data of the secondary language information and the standard expression or the standard expression combination; the response database stores response related data, including standard response data for calling and/or data for generating a response; the response generator receives the instruction of the central controller, generates the response to the natural expression of the user by calling and/or operating the data in the response database, and the trainer is used for training the robot to convert the natural expression into the standard expression or the standard expression combination.

According to an aspect of the present invention, there is provided a human-computer interaction system based on natural intelligence, including: natural expression processing and responding equipment and calling equipment, wherein, the user communicates with the natural expression processing and responding equipment through the calling equipment, and the MAU artificial seat carries out artificial operation on the natural expression processing and responding equipment, and the natural expression processing and responding equipment comprises: the system comprises a dialogue gateway, a central controller, an MAU workstation, a robot, an expression database, a response database and a response generator, wherein the dialogue gateway receives a natural expression from a user, sends the natural expression to the central controller for subsequent processing and sends a response to the natural expression to the user; the central controller receives the natural expression from the conversation gateway, and the natural expression, the robot and the MAU workstation work cooperatively, the natural expression is converted into a plurality of standard expressions corresponding to a plurality of set intentions, and a standard response corresponding to the standard expression is generated according to a standard expression indication response generator; the robot converts the natural expression into secondary language information according to the instruction of the central controller, acquires parts corresponding to a plurality of intentions from the secondary language information, and respectively converts the acquired parts of the secondary language information corresponding to the plurality of intentions into standard expressions, wherein the order of magnitude of information granularity of the secondary language information is smaller than that of the characters; the MAU workstation presents the natural expression to an external MAU artificial seat, the MAU artificial seat inputs or selects a standard expression through the MAU workstation, and then the MAU workstation sends the standard expression to the central controller; the training database stores the pairing data of the secondary language information and the standard expression or the standard expression combination; the response database stores response related data, including standard response data for calling and/or data for generating a response; the response generator receives the instruction of the central controller, generates the response to the natural expression of the user by calling and/or operating the data in the response database, and the trainer is used for training the robot to convert the natural expression into the standard expression or the standard expression combination.

According to an aspect of the present invention, there is provided a natural expression processing method based on natural intelligence, including: receiving and storing the natural expression, converting the natural expression into secondary language information, calculating the confidence level of converting the secondary language information converted by the natural expression into standard expression in a database, and outputting the first standard expression as a result of understanding the first natural expression when the calculated confidence level for the first standard expression is not lower than a first confidence threshold; when the confidence degree is lower than a first confidence degree threshold value, the silence seat understands the stored natural expression, and when the silence seat can understand the natural expression, the silence seat inputs a second standard expression obtained by understanding; when the silence seat can not understand the natural expression, the silence seat prompts to input the natural expression with the same meaning again or the natural expression stored by the high-level seat to be understood and responded.

According to the natural expression processing method based on natural intelligence, a knowledge base designer carries out background construction of jargon according to responses of a high-level seat to natural expressions which cannot be understood by a silence seat.

According to the natural intelligence-based natural expression processing method of the embodiment of the invention, the secondary language information converted by the natural expression and the second standard expression are stored in the database as the paired data.

According to the natural intelligence-based natural expression processing method, the self-credibility is calculated based on the corresponding relation between the secondary language information and the standard expression, the logarithmic probability or similar scores of the single or multiple standard expressions are generated through one or more of a deep neural network, a finite state converter and an automatic encoder decoder, and the self-credibility of the single or multiple standard expressions is calculated by utilizing a normalized exponential function.

According to an aspect of the present invention, there is provided a natural expression processing and responding method based on natural intelligence, including: obtaining a first standard expression or a second standard expression by the natural expression processing method; calling or generating a standard response matched with the first standard expression or the second standard expression; the standard response is output in a manner corresponding to the natural expression.

According to an aspect of the present invention, there is provided a natural expression processing and responding apparatus based on natural intelligence, including: the system comprises a dialogue gateway, a central controller, an MAU workstation, a robot, a training database, a response database and a response generator, wherein the dialogue gateway receives a natural expression from a user, sends the natural expression to the central controller for subsequent processing and sends a response to the natural expression to the user; the central controller receives the natural expression from the dialogue gateway, and works with the robot and the MAU workstation cooperatively, converts the natural expression into a standard expression representing the meaning of the natural expression, and indicates a response generator to generate a standard response corresponding to the standard expression according to the standard expression; the robot converts the natural expression into secondary language information according to the instruction of the central controller, calculates the confidence level of converting the secondary language information converted by the natural expression into standard expression in the training database, and converts the secondary language information into the first standard expression when the calculated confidence level for the first standard expression is not lower than a first confidence threshold; the MAU workstation presents the natural expression to an external MAU artificial seat, wherein the MAU artificial seat comprises a silence seat and a high-level seat, the silence seat inputs or selects a standard expression through the MAU workstation, then the MAU workstation sends the standard expression to the central controller, when the calculated confidence level is lower than a first confidence level threshold value, the silence seat understands the stored natural expression, when the silence seat can understand the natural expression, the silence seat inputs a second standard expression obtained by understanding, and when the silence seat cannot understand the natural expression, the silence seat prompts a user to input the natural expression with the same meaning again or transfers the high-level seat to understand the stored natural expression and answer; the training database is used for storing the secondary language information and the pairing data of the standard expression; the response database stores response related data, including standard response data for calling and/or data for generating a response; the response generator receives instructions from the central controller and generates responses to the user's natural expressions by invoking and/or executing data in the response database.

According to an aspect of the present invention, there is provided a human-computer interaction system based on natural intelligence, including: natural expression processing and responding equipment and calling equipment, wherein, the user communicates with the natural expression processing and responding equipment through the calling equipment, and the MAU artificial seat carries out artificial operation on the natural expression processing and responding equipment, and the natural expression processing and responding equipment comprises: the system comprises a dialogue gateway, a central controller, an MAU workstation, a robot, a training database, a response database and a response generator, wherein the dialogue gateway receives a natural expression from a user, sends the natural expression to the central controller for subsequent processing and sends a response to the natural expression to the user; the central controller receives the natural expression from the dialogue gateway, and works with the robot and the MAU workstation cooperatively, converts the natural expression into a standard expression representing the meaning of the natural expression, and indicates a response generator to generate a standard response corresponding to the standard expression according to the standard expression; the robot converts the natural expression into secondary language information according to the instruction of the central controller, calculates the confidence level of converting the secondary language information converted by the natural expression into standard expression in the training database, and converts the secondary language information into the first standard expression when the calculated confidence level for the first standard expression is not lower than a first confidence threshold; the MAU workstation presents the natural expression to the MAU artificial seat, wherein the MAU artificial seat comprises a silence seat and a high-level seat, the silence seat inputs or selects a standard expression through the MAU workstation, then the MAU workstation sends the standard expression to the central controller, when the calculated confidence level is lower than a first confidence level threshold value, the silence seat understands the stored natural expression, when the silence seat can understand the natural expression, the silence seat inputs the understood second standard expression, and when the silence seat cannot understand the natural expression, the silence seat prompts a user to input the natural expression with the same meaning again or transfers the high-level seat to understand the stored natural expression and answer; the training database is used for storing the secondary language information and the pairing data of the standard expression; the response database stores response related data, including standard response data for calling and/or data for generating a response; the response generator receives instructions from the central controller and generates responses to the user's natural expressions by invoking and/or executing data in the response database.

According to an aspect of the present invention, there is provided a method for training a human-computer interaction system based on natural intelligence, comprising: generating a character script corresponding to the standard expression, obtaining the voice corresponding to the character script through a text-to-voice conversion tool, respectively converting each voice into secondary language information, wherein the order of magnitude of information granularity of the secondary language information is smaller than the order of magnitude of information granularity of the characters, the secondary language information and the corresponding standard expression are stored in a database as pairing data, for the existing paired secondary language information and standard expression in the database, performing cycle iteration on various permutation combinations of elements of the secondary language information and the standard expression or the various permutation combinations of the elements of the standard expression, establishing the corresponding relation between the various permutation combinations of the elements of the secondary language information and the various permutation combinations of the elements of the standard expression or the standard expression, and obtaining more pairing data of the secondary language information and the standard expression, and stored in a database.

According to the method for training the human-computer interaction system based on natural intelligence of the embodiment of the invention, wherein the voice is inputted, the inputted voice is converted into secondary language information, the secondary language information converted from the inputted voice is compared with the secondary language information existing in the database, then, according to the comparison result, the standard expression corresponding to the secondary language information is determined, and/or the probability of correctly corresponding the secondary language information to a certain standard expression is calculated, if the machine comprehension capability is not mature enough to convert the secondary language information to a standard expression, or is uncertain, then a human-assisted comprehension is performed, the input speech is understood manually to obtain a standard expression, and the secondary language information obtained from the speech is associated with the standard expression to obtain new pairing data to be stored in the database.

According to the method for training the human-computer interaction system based on the natural intelligence, the pairing data of the new secondary language information and the standard expression or the standard expression combination or the pairing data of the new natural expression and the standard expression or the standard expression combination are subjected to cyclic iteration on various permutation combinations of elements of the secondary language information or the secondary language information converted from the natural expression and the standard expression or the standard expression combination or the various permutation combinations of the elements of the standard expression or the standard expression combination, the corresponding relation between the various permutation combinations of the elements of the secondary language information and the standard expression or the standard expression combination or the various permutation combinations of the elements of the standard expression or the standard expression combination is established, and more pairing data of the secondary language information and the standard expression or the standard expression combination are obtained, and stored in a database.

According to the method for training the human-computer interaction system based on the natural intelligence, the corresponding relation of errors between the secondary language information and the standard expression or the standard expression combination in the database is corrected through artificial auxiliary understanding.

According to the method for training the human-computer interaction system based on the natural intelligence, the machine understanding capacity is measured through the self-credibility, and the self-credibility is calculated based on the corresponding relation between the secondary language information and the standard expression.

According to the method for training the human-computer interaction system based on the natural intelligence, after the secondary language information is obtained from the natural expression, the logarithmic probability or similar scores of the single or multiple standard expressions are generated through one or more of a deep neural network, a finite state converter and an automatic encoder decoder, and then the self-credibility of the single or multiple standard expressions is calculated through the normalized exponential function.

According to the method for training the human-computer interaction system based on the natural intelligence, disclosed by the embodiment of the invention, the information granularity of the secondary language information is 1/10-1/1000 of the information granularity of characters.

According to the method for training the human-computer interaction system based on the natural intelligence, the secondary language information obtained by loop iteration is used for converting the secondary language information into the standard expression, the secondary language information which cannot be correctly converted and the standard expression which corresponds to the secondary language information are written into the comparison table, and for the subsequently input natural expression, the secondary language information converted by the natural expression is compared with the secondary language information stored in the comparison table.

According to the method for training the human-computer interaction system based on the natural intelligence, disclosed by the embodiment of the invention, when the pairing data of the secondary language information and the standard expression is subjected to loop iteration, a conversion model from the secondary language information to the standard expression is also subjected to loop optimization.

According to an aspect of the present invention, there is provided a natural intelligence based speech processing and responding apparatus, comprising: the system comprises a dialogue gateway, a central controller, an MAU workstation, a robot, a training database, a response generator and a text-to-speech converter, wherein the dialogue gateway receives speech from a user, sends the speech to the central controller for subsequent processing and sends the response to the speech to the user; the central controller receives the voice from the conversation gateway, and the voice, the robot and the MAU workstation work cooperatively, the voice is converted into a standard expression representing the meaning of the voice, and a standard response corresponding to the standard expression is generated according to a standard expression indication response generator; the robot converts the voice into secondary language information according to the instruction of the central controller, wherein the order of magnitude of information granularity of the secondary language information is smaller than that of information granularity of characters, and converts the secondary language information into standard expression; the MAU workstation presents the voice to an external MAU manual position, the MAU manual position inputs or selects a standard expression through the MAU workstation, and then the MAU workstation sends the standard expression to the central controller; the training database is used for storing the secondary language information and the pairing data of the standard expression; the response database stores response related data, including standard response data for calling and/or data for generating a response; the response generator receives an instruction of the central controller, a response to the voice of a user is generated by calling and/or operating data in the response database, the text-to-voice converter generates voice corresponding to a character script based on the character script corresponding to a standard expression, the robot converts the voice obtained by the text-to-voice converter into secondary language information, and paired data formed by the secondary language information and the standard expression corresponding to a corresponding text is stored in a training database, wherein the equipment further comprises a trainer used for training the robot to convert the voice into the standard expression, the robot carries out cyclic iteration on various permutation and combination of elements of the secondary language information and the corresponding standard expression or various permutation and combination of the elements of the standard expression, and corresponding relations between the various permutation and combination of the elements of the secondary language information and the various permutation and combination of the elements of the standard expression or the standard expression are established, and the obtained secondary language information and the matching data of the standard expression are stored in a training database.

According to an aspect of the present invention, there is provided a human-computer interaction system based on natural intelligence, including: natural expression processing and responding equipment and calling equipment, wherein, the user communicates with the natural expression processing and responding equipment through the calling equipment, and the MAU artificial seat carries out artificial operation on the natural expression processing and responding equipment, and the natural expression processing and responding equipment comprises: the system comprises a dialogue gateway, a central controller, an MAU workstation, a robot, a training database, a response generator and a text-to-speech converter, wherein the dialogue gateway receives speech from a user, sends the speech to the central controller for subsequent processing and sends the response to the speech to the user; the central controller receives the voice from the conversation gateway, and the voice, the robot and the MAU workstation work cooperatively, the voice is converted into a standard expression representing the meaning of the voice, and a standard response corresponding to the standard expression is generated according to a standard expression indication response generator; the robot converts the voice into secondary language information according to the instruction of the central controller, wherein the order of magnitude of information granularity of the secondary language information is smaller than that of information granularity of characters, and converts the secondary language information into standard expression; the MAU workstation presents the voice to an external MAU manual position, the MAU manual position inputs or selects a standard expression through the MAU workstation, and then the MAU workstation sends the standard expression to the central controller; the training database is used for storing the secondary language information and the pairing data of the standard expression; the response database stores response related data, including standard response data for calling and/or data for generating a response; the response generator receives an instruction of the central controller, a response to the voice of a user is generated by calling and/or operating data in the response database, the text-to-voice converter generates voice corresponding to a character script based on the character script corresponding to a standard expression, the robot converts the voice obtained by the text-to-voice converter into secondary language information, and paired data formed by the secondary language information and the standard expression corresponding to a corresponding text is stored in a training database, wherein the equipment further comprises a trainer used for training the robot to convert the voice into the standard expression, the robot carries out cyclic iteration on various permutation and combination of elements of the secondary language information and the corresponding standard expression or various permutation and combination of the elements of the standard expression, and corresponding relations between the various permutation and combination of the elements of the secondary language information and the various permutation and combination of the elements of the standard expression or the standard expression are established, and the obtained secondary language information and the matching data of the standard expression are stored in a training database.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description only relate to some embodiments of the present invention and are not limiting on the present invention.

Fig. 1 generally illustrates a layer-by-layer conversion process from collected sound waves (a language information) to Y language information;

fig. 2 shows an example of conversion from collected sound waves (a language information) into Y language information;

FIG. 3 illustrates an example of identifying voice information;

FIG. 4 is a schematic diagram of multi-layer sensing;

FIG. 5 illustrates an example of converting a collected sound wave into X language information using a Gaussian mixture model;

FIG. 6 schematically shows a flow of a natural expression processing method according to one embodiment of the invention;

FIG. 7 schematically illustrates a flow of a natural expression processing and responding method according to one embodiment of the invention;

FIG. 8 schematically illustrates an information extraction and slot filling process for a natural intelligence based human-computer interaction system, according to one embodiment of the invention;

FIG. 9 further illustrates exemplary native expression slot filling process flow under an "order ticket" query term;

FIG. 10 schematically illustrates an intelligent human-computer interaction system according to an embodiment of the invention;

FIG. 11 further illustrates a portion of the structure of the intelligent answering device in the system of FIG. 10;

12A-12P schematically illustrate an operational interface of an intent retrieval and slot filling system according to an embodiment of the present invention;

FIG. 13 schematically illustrates a natural expression process of robotic understanding in combination with human-assisted understanding (MAU) according to an embodiment of the invention;

FIG. 14 schematically illustrates an example of an operator interface presented by the MAU workstation to the MAU human agent 9;

FIG. 15 illustrates an example of intelligent human-machine interaction.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.

Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The use of "first," "second," and similar terms in the description and claims of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. Also, the use of the terms "a" or "an" and the like do not denote a limitation of quantity, but rather denote the presence of at least one.

Artificial Intelligence (AI) is built based on computer technology and traditional IT technology, and simply speaking, IT is a result of simulating human intelligence activities by artificially building rules that a computer can execute, and the core Logic behind IT is black and white Logic (discontinuity Logic). The Natural Intelligence (NI) is also implemented by computer technology, but it mimics the human mental activities themselves, and the computer itself constructs the rules, and the core Logic behind it is gray Logic (Fuzzy Logic).

In the field of Natural Language Processing (NLP), an artificial intelligence-based Natural language processing technique (AI-NLP technique) requires that a speech of a Natural language is first transcribed into a text, and then Natural Language Understanding (NLU) is performed. There are significant drawbacks to using this method as described below, but there are historical reasons for it.

On the one hand, the development of artificial intelligence techniques in the field of natural language processing has been initiated by machine translation, i.e. translating a segment of text in one language into text in another language, with the accurate correspondence of the two languages as a value guide. And natural language is processed based on grammar rules by adopting a probability and statistic method. However, for long strings of sentences that are highly ambiguous with strong randomness of non-written language or grammars, thousands of possibilities may be created when analyzed in real grammar. Although methods such as corpora and Markov models (Markov models) can be used to reduce ambiguity, there is still a need to build corpora containing large amounts of data for computer learning and use. Regardless of the data collection, the building of syntactic and semantic models, or the computation required for retrieval and feedback, required to generate the corpus of pairs (often by human annotation), very large computational resources and costs are required.

As an extension of the above-described machine translation technique, the same methodology and similar methods are also used to generate corpus pairs between speech and text by collection and labeling, and learning is performed using grammar models, semantic models, and the like, still oriented to the accuracy of speech recognition as a value.

However, the above ideas and methods actually introduce a huge waste of resources, since natural language understanding or intent recognition does not require accurate and rigorous translation or recognition of every word.

On the other hand, the method of recognizing speech as text and then understanding the recognized text itself causes a theoretical limit to the accuracy of language understanding. This is because the process of recognizing speech as text itself loses a large amount of information (for example, about 50MB of uncompressed data amount of 5-minute binaural, 16-bit sampling bit number, 44.1kHz sampling frequency sound, and if 1000 chinese characters in five minutes at a speech rate of 200 chinese characters per minute, the corresponding data amount is 2KB, which differs by 25000 times), and this lost information is likely to contain key information required for language understanding. In other words, the information granularity of the recognized text is very coarse compared to the original speech. For example, like the previous example, for a sentence with 10 words, if the keywords are 3 and the wrong recognition of 1 keyword causes an understanding error, even if a recognition accuracy of 90% can be guaranteed, the probability of correct understanding is only 70%; even if the probability of understanding errors caused by the wrong recognition of 1 or more keywords can be reduced to a certain extent by the associated information between the characters, the overall understanding accuracy is not much higher than 70%. For example, the industry refers to the 80% accuracy bottleneck as the "century spells for commercial AI".

Natural intelligence is a mimic to human intelligence behavior, which is itself based on gray-scale logic. Particularly, in the process of interacting with the outside, the human brain does not convert the expression (sound, image, contact, taste, etc.) sensed from the outside into text and then understand the text, but directly analyzes and understands the information obtained through sensing of the sense organs. This understanding is based on prior knowledge and experience (experience can also be understood as probabilistic) to derive the meaning of an external expression from the information perceived and thus may in fact deviate. For example, an observer who sees a person shaking his head will generally think that the observed person expresses a negative attitude. However, if the observer is aware of himself in india, the observed is indian, and knows that indian shakes a head slightly (shakes the head) to indicate a positive attitude, the observer's understanding of the expression (shakes the head) will be correct, although this is contrary to general knowledge; whereas if the observer does not have such knowledge and experience, he/she may make an erroneous judgment, but the observer may still correct his/her judgment by being expressed by facial expressions, body movements or gestures, etc. of the observed person (this problem is more complicated in practice, if the observed indian person is expressionless or smiling but shakes his head up and down closer to the center of the eyebrow, he is more likely to express an "unsettlable" attitude).

For artificial intelligence it can recognize the external expression as text describing the action-shaking head or somebody shaking head gently-or a longer sentence-by image recognition or video recognition, but it cannot verify whether the applicable result here based on general cognitive rules is correct according to the recognized text content itself. This is because the artificial intelligence collects only the information for judging the action itself (which may include, for example, the action subject) to identify, and filters out other information, but the filtered information happens to include the key information for judging the true intention expressed by the action. And the loss of such critical information is irreversible and cannot be retrieved from the text describing the recognized action.

To give yet another example of natural language understanding. Hearing a sentence "the tian shang hui hui", might be recognized as "ashy sky" or "ashy sky chicken", but if the expression of a child in babbling idiom or of a person with unclear mouth has the ability to understand, even the evolving web language, it will be understood that the speech is correctly meant as "flying sky airplane".

From the above example of natural language understanding we can also think that children or original civilizations who do not have knowledge of the words can clearly express and communicate their intentions through language, which is also why natural intelligence is closer to human intelligence than artificial intelligence — natural intelligence can process text but not rely on text, but artificial intelligence relies on text as a medium for recognition and understanding.

In the following, we take natural language processing as an example to further illustrate the similarities and differences between natural intelligence and artificial intelligence.

As shown in FIG. 1, for ease of illustration, we will refer to the different messages as letters A-D and X-Y.

The language a, i.e., sound waves, is physical layer data collected by a sound collection device, such as a microphone.

"B language" is a language formed by various permutation and combination of B elements. The B elements may be phonemes, with some permutation combinations of the B elements constituting syllables. The terms "phoneme" and "syllable" herein have the same meaning as they are in the linguistic category. Examples of B elements are shown in fig. 2, which are phonemes for chinese (chinese).

The "C language" is a language formed by various permutation and combination of C elements. All or part of the B elements are arranged and combined to form C elements, so that the B language is converted into the C elements, and the C elements form the C language. Thus, the conversion relationship from B language to C language is a "many-to-many" relationship. If the linguistic system of phonemes, syllables is followed, the C element corresponds to a "word" in the natural language. Examples of C elements are shown in FIG. 2, which are words of Chinese.

The "D language" is a language formed by various permutation and combination of D elements. All or part of the C elements are arranged and combined to form D elements, and thus it can also be understood that the C language is converted into D elements, and the D elements constitute the D language. Thus, the conversion relationship from the C language to the D language is also a "many-to-many" relationship. If the linguistic system of phonemes, syllables, words is followed, the D element corresponds to a "word" or "phrase" in natural language. Examples of D elements are shown in FIG. 2, which are words of Chinese.

The "C language" example and the "D language" example in fig. 2 look the same, and are composed of "i", but those familiar with chinese can understand that understanding only from C language results in a great ambiguity, and after conversion to "D language", the meaning of the expression is relatively determined. For other languages, the conversion from word → word or phrase is also important for semantic understanding, especially if speech recognition is implemented by an intelligent system (speech robot). According to different natural languages, "words" and "words", i.e. C language information and D language information, may also be grouped into one language information hierarchy.

If natural language processing or Natural Language Understanding (NLU) is performed based on the principle of the foregoing AI, reference is made to fig. 3 to follow a path a → B → C or D, and may be a → B → C → D. That is, the aforementioned process of converting the collected voice (sound wave) into text.

The conversion from the a language information (sound wave) to the B language information (phoneme) can be automatically performed relatively accurately by a robot. However, a high error rate may occur in the conversion from B language information (phonemes) to C language information (words). For example, in the example of chinese, as shown in fig. 3, the original language information input by the client is "ping-pong auction completed", possibly because of the pronunciation or accent of the client, "ping-pong" may be recognized as "flat and" bat "may be recognized as" afraid ", and as a result, the sound wave is finally converted into" flat afraid of selling "seven words. In order to improve the recognition accuracy of the robot, especially for the above-mentioned problems such as pronunciation or accent, the recognition result of the robot needs to be corrected, and a manner of human-assisted recognition is usually adopted. This stage of human-assisted recognition is called Transcription (Transcription). The so-called transcription means that a transcription worker precisely cuts the sound wave (information in language a) by using a specific transcription tool, and then converts each cut wave band into a corresponding word (information in language C), that is, defines a conversion/translation relationship between language a (sound wave) → language C (word) for the robot. Whether the cutting is accurate depends on whether the transcriber is careful enough and the familiarity of the transcribing tool is mastered; whether the corresponding 'word' can be accurately converted depends on the context of the sound wave of the transcriber and whether the context (other sound waves before and after the sound wave) is accurately understood. Especially, Chinese characters and homophones are many, and the difficulty of accurate work of transcribers is increased.

Next, D language information (words, phrases) is obtained from the C language information (words). The conversion from word to word is ambiguous as well, and as in the previous example, even if the recognition from sound wave to word is accurate, the result of the sequence of seven words in table tennis auction is obtained, but the conversion is still at least two results of table tennis bat + sell + end "and table tennis + auction + end", and the meanings are obviously different. Also, human-assisted recognition may be employed for correction. The manual assisted recognition at this stage is called Keyword segmentation (also referred to as "word segmentation" for short), that is, word segmentation personnel combine the transcribed words (C language information) to form words (keywords) "(D language information), that is, a conversion/translation relationship between C language (words) → D language (words) is defined for the robot. Whether the word segmentation is accurate or not usually depends on the mastering degree of the word segmentation personnel on the business knowledge. For different fields, persons who need to be familiar with the business contents and words in the field perform word segmentation, and the cost is higher than that of transcription.

Finally, the meaning (i.e., the Y language information) is understood from the D language information. Only the words in a certain sequence are obtained, and the real meaning of the client cannot be accurately known. For example, the customer says "my credit card is missing", the robot recognizes that it cannot, and the technician puts "my", "credit card", "missing" as a new keyword into the grammar table of the database; another client says: the me's swiping card is lost, the robot can not recognize the meaning of the me's swiping card, and the technician puts the me's' swiping card '(which is the meaning of the credit card) and the me's 'lost' into a grammar table of the database as new keywords. In this way, the meaning or requirement of the customer is understood and generalized to the database in a human-assisted manner. This kind of artificial assistant recognition is called Keyword Pile-up (abbreviated as "piled-up word"), which is to accumulate the permutation and combination of "word" and to put it into database according to its meaning. The workload of this work is also enormous and also requires expert knowledge of trained personnel to assist understanding.

If the Multi-Layer "many-to-many" relationship transformation according to a → B → C → D → Y is technically called Multi-Layer Perception (MLP), as the principle shown in fig. 4, the disadvantage is that: each time a conversion is made, the original information is distorted to some extent, and at the same time, more processing load is added to the system, resulting in further performance loss. The more times the conversion is made, the more distorted the original information is and the slower the processing speed of the system is. Similarly, robot training in the processing process requires intervention of manual auxiliary recognition, so that on one hand, high workload and cost are generated, and on the other hand, the probability of errors is increased due to multiple manual interventions.

The language X is logic layer data obtained by performing Speech Signal Processing (SSP) on the language a data, and is referred to as "language X" in the embodiment of the present invention. The X language is a language formed by various permutation and combination of X elements. The X element is the system's transformation through some modeling tool, such as: the Gaussian Mixture Model (GMM) automatically cuts sound waves into a plurality of columnar elements with different heights. Fig. 5 shows an example of converting a captured sound wave (represented by a histogram) into X elements (represented by a vector quantization histogram) using a gaussian mixture model.

The number of X elements can be controlled within a certain range (e.g., below 200) according to different modeling tools, applied to different sets of natural speech. According to an embodiment of the present invention, a combination of 2-bit ASCII characters may be defined as an ID of an X element, as shown in fig. 2. That is, the number of X elements can reach up to 16,384 (128X 128 ═ 16,384), and the demand of increasing the number of X elements due to further development of the acoustic wave modeling technology in the future can be met. The sound wave units after cutting are in one-to-one correspondence with the X elements, and the A language information can be considered as the combination of the sound wave units, and the X language information is the combination of the X elements, so that the conversion relationship from the A language to the X language is a 'many-to-many' relationship. An example of an X element represented by ASCII characters is also shown in fig. 3.

In the process of natural language processing or natural language understanding based on the foregoing AI principle, X language information is not involved, so that an X language (X element) layer is identified in fig. 1, 2, and 3, on one hand, it is explained that the X element is located between sound waves and phonemes from the perspective of information granularity; on the other hand, it is also possible to describe that natural language processing or natural language understanding is performed by using the path of a → X → B → C or D and a → X → B → C → D, that is, it is also possible to use the X element as intermediate data for conversion between a language (sound wave) and B language (phoneme).

The "Y language" is language information representing "meaning" or "meaning" obtained by understanding the original natural language information a, as shown in fig. 1 and 2. The "standard expression" defined by the embodiments of the present invention is a form of "Y language". According to an embodiment of the invention, for example: the banking industry can use the service code '21' to represent the meaning of 'credit card loss'; service code "252" can be used to represent the meaning of "payment of credit card portion", and "252-" 5000 "(requirement code 252, requirement parameter 5000) can be used to represent the meaning of" payment of credit card 5000 yuan "; the entertainment industry may use the code "24" to represent the meaning of "watch movie", and "24-chinese partner" (demand code is 24, demand parameter is "chinese partner") to represent the meaning of "watch movie" chinese partner ".

In brief, natural language processing or natural language understanding based on natural intelligence first performs basic automatic recognition or conversion on irregular natural expression information expressed in the form of physical data, for example, sound waves (i.e., "a language information") from a user by some kind of modeling tool to obtain language information ("X language information") expressed in the form of an array combination of a plurality of basic elements ("X elements"), and then converts the X language information recognized or converted from the a language information into a standard expression ("Y language information") in some kind of form. That is, the processing path of a → X → Y is adopted without conversion into "words" and "words" ("C language information" and "D language information") and without conversion into phonemes ("B language information"), which is an important difference between natural intelligence and artificial intelligence in natural expression processing, as described above. It can be seen that this difference is a difference in processing paths and also a difference in methodology. Therefore, the multi-layer 'many-to-many' relation conversion of B → C → D → Y can be omitted, the accuracy and efficiency of the expression information conversion can be improved, and the workload and the error rate of manual auxiliary identification can be reduced.

In comparison with the aforementioned natural expression processing based on artificial intelligence, natural expression processing based on natural intelligence does not need to convert an expression of non-character information into a text when processing the expression, but converts the expression into X-language information having information granularity much finer than characters, and thus has higher accuracy of key information recognition as described above. For example, the information granularity of the X language information is different from the information granularity of the text by orders of magnitude, and if the information granularity of the text is 1, the information granularity of the X language information is orders of magnitude of 1/10, 1/100, 1/1000, and the like; on the other hand, since the X language information is obtained by sampling and converting the a language information (sound wave, image, etc.), the X language information is coarser than the information granularity of the a language information, for example, following the previous proportional relationship, and if the information granularity of the characters is 1, the information granularity of the sound wave is such as 1/10000, 1/100000, 1/1000000, etc. The aforementioned B language information (composed of phonemes), C language information (composed of syllables), and D language information (composed of words or phrases) are substantially in the same order of magnitude as the information granularity of characters, and therefore they are similar to characters when compared with the X language information at the information granularity level.

With respect to the relationship between information granularity and expression understanding accuracy, for example, for the language expression "I lost mycrdetcart", and the correct expression should be" I lost my creditcard", if the word is taken as the information granule unit, then the last word" card "is wrong, and the understood error rate would be 1/5-20%, then the understood accuracy rate is 80%; if the character (including the space) is used as the information granule unit, the last character "d" is wrong, and the understanding error rate is 1/22-4.55%, and the understanding accuracy rate is 95.45%.

Based on the same principle, if finer granularity of information is used, there is theoretically a higher accuracy. According to the measured data of online products, by adopting X primitives with finer information granularity, the understanding accuracy rate of more than 95% and even higher can be obtained, and the commercial century spells of AI can be easily broken through.

On the other hand, when the expression of the character information is processed (i.e., the a language information is the character information), since the Y language information is corresponding to the a language information based on the gray-scale logic described above, that is, the correspondence of the a language information to the Y language information can tolerate a large degree of ambiguity. Advantageously, since the a language information does not need to be converted into text in a human language, it is not limited by homophones and grammar, nor by the way of speaking. Therefore, based on the natural intelligence method, the correspondence from the a language information to the Y language information can be applied to the dialect and the mixed language or the mixed voice, for example, the english mixed in the chinese language, the mandarin mixed in the cantonese language, the english mixed in the shanghai language and the mandarin, and more subdivided languages, dialects and the mixture thereof, and even can be applied to the mixture of a plurality of speaking manners, and the understanding accuracy is not affected. For artificial intelligence NLP techniques, high understanding accuracy cannot be obtained even if massive grammar models are made and a great deal of cost is paid. In fact, for AI, mixing languages, dialects, and utterances causes exponential growth of the grammar model, which is not realized at all.

The difference between natural intelligence and artificial intelligence is described above using natural language processing and natural language understanding as examples. Natural expression methods for humans are various, and for example, natural expression from a client, i.e., "a language information", can be classified into the following four broad categories: text information, voice information, image information, and animation information. Wherein, the text information expression can be: the customer expresses himself by entering text via the keyboard, for example, does the customer type "how much money is left in my savings account" on the internet channel call center user interface of a bank? "; the image information representation may be: the client expresses himself through the image, for example, the client expresses the problems encountered by the client in the form of the image by using the error information of certain software through a computer desktop screen capturing tool; the speech information representation may be: the customer expresses himself by speaking, for example, the customer talks to a service staff (telephone channel call center) of a bank, during which the customer speaks on the telephone: "what do you say anything about? I am not too clear "; the animation (or "video") information representation may be: the customer expresses his or her disagreement by shaking his or her head in front of the lens (this is similar to the general case described above).

When natural expression processing is performed based on natural intelligence, first, a natural expression (a language information) of a client is automatically recognized or converted to obtain information expressed in a certain language format. If the a language information is voice information, for example, the acoustic waveform information may be collected by a modeling tool and automatically recognized or converted into a certain (corresponding to the voice information) X language by a system (smart robot); if the a language information is graphical information, graphical pixel information may be collected by a modeling tool and automatically recognized or converted into X language (corresponding to image information) by a system (smart robot), for example; if the a language information is animation information, graphic pixel information and image change speed information may be collected by a modeling tool and automatically recognized or converted into X language (corresponding to animation information) by a system (smart robot), for example; if the A language information is literal information, the literal information is converted into X language in character units (primitives) or conversion is not needed.

Then, the above X language information obtained by automatic conversion from the A language information or the character information without conversion is further processed to obtain a regularized standard expression (Y language information) which can be 'understood' by a computer or other processing equipment. The Y language information can be automatically processed by the computer business system.

According to an embodiment of the present invention, the regularization standard expression (Y language information) may be implemented with a regularized encoding. For example, the following coding mode of numbers and English letters is adopted, which comprises an industry code, an industry service code, an organization service code and an expression information code.

(1) Industry code

Major industry (2-digit English letters, at most 26X 26 ═ 676 major industry)

Sub-industries (3-digit English letters, each main industry has at most 26X 26 ═ 17,576 sub-industries)

(2) Industry service code

First-class industry service category (1 digit 0-9)

Second-level industry service category (1 digit 0-9)

Three-level industry service category (1 digit 0-9)

Four-level industry service category (1 digit 0-9)

Five-level industry business category (1 digit 0-9)

Category of six-level industry (1 digit 0-9)

Seven-level industry service category (1 digit 0-9)

Eight-level industry service category (1 digit 0-9)

Nine-level industry service category (1 digit 0-9)

Ten-level industry service category (1 digit 0-9)

(3) Organization code (UID) (24 bit number 3 bit country + city 3 bit + organization 18 bit)

(4) Organization service code

First class organization business category (0-9)

Second level organization business category (0-9)

Three-level organization business category (0-9)

Four-level organization business category (0-9)

Five-level organization business category (0-9)

(5) Expressing information codes

Information type code (2 digit number 1-99)

Language code (using RFC3066 standard: http:// tools. ietf. org/html/RFC3066, e.g. zh-CN stands for "simplified Chinese")

Dialect code (3 digit 1-999)

The industry code represents the industry where the service providing subject pointed by the irregular natural expression (a language information) from the client is located, for example, 2-digit english letters can be used for representing, 676 industries can be covered, optionally, a sub-industry code of 3-digit english letters is added, and 17576 sub-industries covering each industry can be added. Thus, the encoding can cover substantially all common industries; the industry service code represents the service requirement pointed by the language A information from the client, and can also be represented by a plurality of digits, for example, the industry service code is coded by 10 digits and can cover more industry service categories; the organization code represents the subject of the service to which the a language information from the client points, and may identify, for example, the country and city in which the organization is located; the organization business code represents the internal personalized business division of the main body providing the service, so that the organization can conveniently perform personalized internal management; the expression information code represents identifying information of the client's a language information itself, and may include a type of information, a type of language, and the like, which are represented by numerals and letters.

The following are two examples of regularization standard expressions (Y language information) according to the above coding scheme:

the first embodiment is as follows: FSBNK27100000000860109558800000000000000000002zh-CN003

Wherein,

the industry code is that,

FS Financial Service (main industry)

Bank (sub-industry) with BNK ═ Bank

The business code of the industry is,

2710000000-class one business category-2 (credit card) class two business category-7 (adjustment credit line) class three business category-1 (increase credit line) 0000000 (no more subdivided category)

The organization code is as follows,

086010955880000000000000 national number 086 (china) 010 (beijing) 955880000000000000 (china headquarters for industrial and commercial banks)

The organization service code is that,

00000-no organization business category (in this Y language information, there is no organization business category defined by the organization itself, that is, it means that this Y language information completely belongs to the business category and is general for banking)

The code for expressing the information is that,

02 speech (type of A language information provided by customer is "speech")

zh-CN ═ continental chinese

003 ═ Cantonese dialect

In this example, the language a information corresponding to the language Y information may be voice information such as "my credit card amount is too small", "i want to increase my credit card amount", "i want to decrease my credit card amount", "i want to adjust credit card amount", and so on.

In some specific application scenarios, particularly where the subject of the service being provided determines, the aforementioned industry code, institution code, and institution business code may all be preset as system default values. That is, it is sufficient to obtain only the service code and the expression information code from the a language information provided by the customer, and in this case, the Y language information may be expressed as "271000000002 zh-CN 003"; or, if it is sufficient that the 3-digit number represents an industry service code for a particular application, it may be further represented as "27102 zh-CN 003"; also, if only for voice services, it may be denoted as "271 zh-CN 003"; if only the requirement expression of the client is considered, but the type information of the expression itself is not concerned, even only the expression of '271' is needed. Example two: TVTKT11200000000014047730305000000000001240003fr-CH000

Service travel Service (Main industry) and TV

TKT Ticketing Ticketing (sub-industry)

1120000000-category of first-class business-category of 1 (airline ticket) second-class business-category of 1 (airline ticket change label) third-class business-category of 2 (delay) 0000000 (no more subdivided category)

001404773030500000000000 ═ national number 001 (usa) 404 (georgia, atlanta) 773030500000000000 (usa Delta airline)

12400 as the first level organization business category-1 (discount ticket) second level organization business category-2 (off season) third level organization business category-4 (asia-pacific region) 00 (no more subdivided category)

03 ═ image (type of A language information provided by the client is "image", for example, when the client performs ticket change operation on Delta official website, the client encounters system error report, and the client takes screenshot as natural expression for seeking help to Delta customer service.)

fr-CH ═ swiss french

000. dialect

In this example, the a language information corresponding to the Y language information is obtained by image recognition. Similarly, in the case where the subject providing the service determines, the above-mentioned industry code and organization code may be preset as default values of the system. In this case, the Y language information may be represented as "11200000001240003 fr-CH 000"; if only the requirement expression of the client is considered, but the type information of the expression itself is not concerned, only the expression of "112000000012400" is needed; if the 3-digit numbers represent industry service codes and the 3-digit numbers represent organization service codes for a particular application, it is sufficient to simply use "112124".

The above is only an example of the regularization standard expression (Y language information) according to the embodiment of the present invention, and different code numbers and code arrangement orders may be adopted, or different code representation or encoding modes may be adopted. By adopting the coding mode to express the understood expression intention, the manual input workload required by the manual auxiliary understanding can be reduced, and the manual auxiliary understanding can be realized more conveniently and in real time.

The natural expression (a language information) from the customer often reflects the specific needs of the customer, and as mentioned above, the a language information of the customer is first automatically converted into X language information or language information that does not need to be converted (when the a language information is text information), and then the X language information or the text language information is converted into a standard expression (Y language information) in a coded form. In the foregoing example, the Y language information may include an industry code, an industry business code, an organization business code, and an expression information code. Optionally, the a language information may also include specific parameters (which may be referred to as "requirement parameters") in the category of customer requirements, such as: "turn 5000 pieces to zhang san" (case one), "i want to see a movie, called" chinese partnership "(case two), etc. A particular set of requirements codes (e.g., including one or more of the aforementioned industry codes, industry business codes, organization business codes, and expression information codes) corresponds to a particular set of parameters. As in the second example above, if the requirement code of "watch movie" is 123, its corresponding parameter set may include the parameters: the movie name. Then it is. The Y language information corresponding to this a language information is "123 < chinese partner >". 123 is the demand code, and five Chinese characters in < > are the demand parameters. There are various ways to distinguish the requirement code from the requirement parameter in the Y language information, which may be by using a symbol such as "< >", by using a space, by arranging in a specific order, or the like.

The above example of expressing the normalized standard expression by the code is particularly applicable to an Interactive Voice Response System (IVRS). Interactive Voice Response (IVR) is a generic term for telephony-based voice value-added services. Many institutions (such as banks, credit card centers, telecom operators, etc.) provide various self-service services to customers through an Interactive Voice Response System (IVRS), and the customers can dial a designated telephone number to enter the system, enter appropriate options or personal data according to the instruction of the system to listen to prerecorded information, or combine data according to a preset program (Call Flow) through a computer system to read out specific data (such as account balance, amount due, etc.) in a voice mode, and also can input transaction instructions through the system to perform preset transactions (such as transferring accounts, changing passwords, changing contact telephone numbers, etc.). The organization can construct a special coding rule expressed by a standard and a dialogs expressed by the standard, thereby standardizing an internal customer service dialogue data system and protecting data (even if a third party knows the Y language information of the organization, the third party cannot know the dialogue data corresponding to the code). On the other hand, even if a service provider of the intelligent expression processing engine knows data of a language information and Y language information (for example, a set of numbers or a code of a number + a letter) obtained by conversion in providing a conversion service from a language information to Y language information, the service provider of the intelligent expression processing engine does not know the meaning of the Y language information corresponding to the data in the customer service dialogue database of the user institution, and can provide an intelligent expression processing service with data security.

The foregoing process of converting the a language information into the X language information can be realized by a voice signal processing technique, a voice recognition technique, an image recognition technique, and a video processing technique, which may also be the existing techniques. In fact, the coding standard expression idea according to the embodiment of the present invention may be applied to recognition processing of natural expression, and X language information is expressed by regularized coding.

According to the natural expression processing method of the embodiment of the invention, firstly, the natural expression (A language information) of a client is automatically converted to obtain the X language information, or the C language information is directly obtained without conversion (when the A language information is the character information); and then converting the X language information or the C language information into Y language information. Therefore, the multi-layer 'many-to-many' relationship conversion of A → B → C → D → Y is omitted, the accuracy and efficiency of the information conversion can be improved, and the workload and error rate of manual auxiliary identification can be reduced.

Specifically, according to the technology of the embodiment of the present invention, firstly, non-regularized natural expression information, such as text, voice, graphics, and video, is converted into X language information by a modeling tool; then, with the X language as the left language and the Y language as the right language, the conversion of the X language information into the Y language information is realized by using a Machine Translation (MT) technique.

Specifically, taking Processing of non-regular natural expression information such as Speech as an example, firstly, a "Speech Signal Processing" technology is utilized to automatically convert/translate a language into X language (based on the current "Speech Signal Processing" technology, the conversion accuracy of a → X can be as high as 95% or more, while the improved "Speech Signal Processing" technology is better in noise reduction, and the conversion accuracy of a → X can be as high as 99% or more); then, the automatic machine translation of X → Y is realized by utilizing the machine translation technology without the multi-layer conversion of X → B → C → D → Y.

The resulting irregular natural expression (X language information) can be converted to a regularized standard expression (Y language information) using a machine translation algorithm similar to that based on statistical analysis of the sample of instances. Such machine translation algorithms require that the amount of correspondence data between the X language and the Y language be sufficiently large and accurate.

Further, in order to accumulate the correspondence data between the X language and the Y language, the correspondence data between the a language and the Y language may be accumulated in consideration that the precise machine automatic conversion of a → X has been possible. Therefore, a scheme based on natural intelligence provides a new manual agent working mode of MAU (manual Aided Understanding), and corresponding data accumulation between the a language and the Y language is realized by combining manual Understanding with code input. As before, the requirement code "271" can be used to indicate the adjustment of credit card amount, and similarly, the requirement code "21" can also be used to indicate the loss of credit card, so that "21" can be used to correspond to the aforementioned natural expression information of "my credit card is missing" or "me swiped card is lost".

The MAU can convert the traditional 'speaking seat' into 'non-speaking seat' -silence seat through the existing simple code input mode, so that the work of the seat becomes more comfortable, the working efficiency is greatly improved, the understanding capability of the highest value of human is more fully utilized, massive corresponding data of A/X language and Y language are accurately and quickly collected and provided for an MT engine to carry out circular iteration, and the conversion/translation rule of A/X → Y is self-learned to form and optimize the translation model of A/X → Y.

The working principle of the machine translation technique and the machine translation robot training technique according to the present invention will be described below.

Machine translation is an artificial intelligence technique used to automatically translate two languages. The language referred to herein is not a narrow national language (e.g., Chinese, English … …), but rather a broad information expression. As previously mentioned, in the manner of expression, languages can be divided into four broad categories: text, speech, images, animation (or "video").

A language is information formed by various permutations and combinations of elements in an element set. For example: the English letters are a language formed by 128 ASCII characters (elements) in an ASCII character set (element set) through various one-dimensional (serial) arrangement combinations; the Chinese language is an infinite permutation and combination of thousands of Chinese characters in a national standard code and punctuation marks (which form basic elements of Chinese information); another example is: an RGB plane image is another language formed by combining three kinds of sub-pixels of red, green, and blue by various two-dimensional (length and width) arrangements.

Any two languages have a certain conversion/translation rule, and the automatic conversion/translation rule between the two languages can be found by analyzing the corresponding relation of the permutation and combination of the two language elements. Firstly, corresponding data (or called translation samples) of two languages are manually collected, and then an automatic conversion/translation rule between the two languages is automatically found through an iterative loop of permutation and combination of the two language elements to form a translation model of the two languages.

Two data tables are needed for machine translation: "Training data Table (Training data Table)" and "Testing data Table (Testing data Table)".

The data structures of these two tables are similar: stored are a pair of pairs of data, the left value being in the "left language" (or "source language") and the right value being in the "right language" (or "target language"). We can visually do this by a metaphor: the training data table is a textbook for the MT robot to learn by the human, and the examination data table is an examination question given to the MT robot by the human and used for evaluating the self-learning effect of the robot.

The following are examples of the English → Chinese MT "training data Table" and "verification data Table":

training data sheet

Inspection data sheet

The MT robot is an iterative loop in which elements constituting a language are arranged and combined in units. As in the above example, through two sets of data pairs #3 and #4 in the training data table, it is found that the arrangement combination of 15 ASCII character elements in english "May I have your" (3 english letters "May" +1 space +1 english letter "I" +1 space +4 english letters "have" +1 space +4 english letters "your") corresponds to the arrangement combination of 3 chinese code characters in chinese "ask you"; through two groups of data pairs of #2 and #5 in the training data table, the arrangement combination of 3 ASCII character elements of English "age" is found to correspond to the arrangement combination of 2 Chinese characters of national code of Chinese "age".

Therefore, if the robot can check the english "May I have your age? "exactly translate into Chinese" ask you for your age? ", that proves that the robot learns the Chinese-English translation of the sentence. If not, it is proved that the robot has not learned. The robot needs to modify its learning method (e.g., find another path to try to learn again), and re-digest the training data table once, which is another iteration; … … so repeating this "iterative correction" allows the translation accuracy of the robot to climb over. When climbing to a certain degree (for example, the translation accuracy is 70%), the translation accuracy of the robot may linger at the certain level, and it is difficult to go up, that is, a bottleneck of "machine self-learning" is met, so that MT training data table data needs to be added to the robot. The data of the MT training data table may be imported from an external database, or may be generated and added through "human-aided understanding".

For example, following the previous example of credit card business, assume that the resulting irregular natural expression is "my credit card can overdraw" and when the robot understanding is not mature enough, "human-assisted understanding" can intervene, comprehend the expression manually as "i want to increase my credit card amount" and input the Y language information corresponding thereto. Alternatively, the "human-assisted understanding" process does not need to record the understanding process and the understanding result for the natural expression, but only needs to record the corresponding standard expression (Y language information) as the final processing result. Therefore, the manual operation can be simplified, and the resources are saved. For example, the operator may only need to enter "271" as the corresponding standard expression to complete the process for the irregular natural expression "too little my credit card can be overdrawn". For example, a new natural expression instance, such as the natural expression "my credit card is too little overdrawn" described above, and its corresponding standard expression "271" are added to the existing MT training data table, thereby adding and updating MT training data table data. Therefore, through the artificial auxiliary understanding, on one hand, accurate and stable conversion of target natural expression (conversion of the target natural expression into standard expression-Y language information) can be realized, and on the other hand, efficient addition and updating of data in the MT training data table can be realized, so that the data in the MT training data table of the system is richer and more accurate, and the translation (conversion) accuracy of the robot can be improved efficiently.

Theoretically, the MT robot needs to perform an exhaustive list of all the arrangement combinations of 20 ASCII character elements, i.e., "May I have your time" for the left value of #3, and needs to perform an exhaustive list of all the arrangement combinations of 10 chinese characters, i.e., "ask when you are present" for the right value of # 3. That is, the MT robot needs to perform an exhaustive listing of all permutation combinations of the left and right sets of elements for each pair of data in the training data table. Through such an exhaustive list of element levels, the MT robot must be able to find many repeatedly appearing permutations (e.g., "your", "May I have your", "age", "time", "you", "ask you", "age" … …), so as to find out a certain correspondence between these repeatedly appearing permutations of left and right linguistic elements, i.e., a translation model between two languages. That is to say, the larger the number of left and right language data pairs in the training data table is, the more the MT robot can find the permutation and combination of the two repeated language elements, and the more the correspondence relationship between the permutation and combination of the two repeated language elements, the more the MT robot can master the conversion/translation rules of the two languages, and the more the translation model is mature. Therefore, by adopting the 'regularization standard expression' and 'artificial assistance understanding' according to the technical idea of the invention, the MT training data table data can be more efficiently accumulated, and the robot self-learning and automatic machine translation can be realized.

The machine translation between the X language → Y language in the invention is the same as the machine translation principle of Chinese and English, except that English is changed into X language and Chinese is changed into Y language, and the left and right languages have different element sets.

As previously mentioned, machine translation techniques can be used to automatically translate one language into another. The technical principle is that basic element level analysis is carried out on collected pairing information (left language and right language) of the two languages, and various permutation and combination of basic elements of a large number of language information pairs are subjected to cyclic iteration comparison, so that a conversion/translation rule between the two languages is found out, and a translation model of the two languages is formed.

The technology of the invention extends the application range of the machine translation technology from automatic translation between different national languages to automatic conversion of all non-regular multimedia natural expression information (characters, voice, images and video, namely, A language information) into the regular standard information (Y language information) so that business systems of all walks of life can process the information, thereby realizing real and practical natural expression processing.

For natural language processing, because multi-layer linguistic analysis required by traditional machine translation is not needed, and a mode of performing basic element level analysis on an instance is adopted, the accuracy and the rapidness of translation can be increased, and meanwhile, updating and expansion can be easily performed by adding natural expression instances and standard expressions.

For the natural expression processing of the embodiment of the present invention, since only the conversion from the natural expression (a language information) to the standard expression (Y speech information) needs to be performed, in other words, only the translation model of a/X → Y needs to be established, and not the language translation result of the text, the processing of modifying the translation result is not needed.

In addition, the natural expression processing according to the embodiment of the present invention can be limited to be used in specific business of specific industry and institution, for example, the above-mentioned credit card business, and the MT training data table size required by the processing system can be greatly reduced, so that the cost for constructing and maintaining the MT training data table is reduced while the robot understanding maturity threshold is improved, and meanwhile, the maturity period of the a/X → Y translation model can be effectively shortened.

As described above, according to the natural expression processing system of the embodiment of the present invention, conversion from a natural expression to a coded standard expression is realized. The conversion is based on an MT training data table (i.e., a training database) storing paired data of A/X language information and Y language information, and an A/X → Y translation model obtained on the basis of the MT training data table. Therefore, it is necessary to collect a certain amount of accurate a/X language data and Y language data to generate an MT training data table, and to form a translation model of a/X → Y by self-learning (self-training) of a robot (information processing system). And forming the MT training data table may be performed by human-aided understanding.

The above method of converting the a language information into the Y language information is also applicable to the case where the a language information is character information. In this case, the translation (conversion) of a → Y can be realized by using a character (for example, a chinese word, an english word, or the like) or a character (for example, an english letter, a character, a german letter, a character, or the like) in the a language information as the X language information as it is or by converting the a language information into the X language information in which the character is the X element, and training the translation model of X → Y according to the above-described method. In addition, character recognition and grammar analysis in the A → Y conversion are not needed, support of a word segmentation library and a grammar table is not needed, and limitation of language and language mixture is not needed.

A natural intelligence methodology in accordance with an embodiment of the present invention is summarized herein. Under the methodology of natural intelligence, a machine understanding problem of natural expression is equivalent to a process of converting a language information into Y language information, in which X language information with a coarser information granularity than that of the a language information is obtained from the a language information, and then Y language information corresponding to the a language information is obtained by corresponding the X language information to the Y language information.

More specifically, the X language information may be a word, a character, or the like, or may be information having a much finer information granularity than a word, and the correspondence (which may also be referred to as conversion or recognition) of the X language information with the Y language information by using an algorithm similar to machine translation is not limited by a grammar rule of word processing, and there is no need to establish a model such as a grammar model or a rule base. This machine translation-like algorithm also allows for one hundred percent machine self-learning because there is no need for manual model building and rule base maintenance. The method comprises the steps of constructing a training database of X language information and Y language information, inputting correct pairing data of the X language information and the Y language information, enabling a machine (namely a robot or a processing engine) to carry out loop iteration on various permutation combinations of elements of the existing X language information and various permutation combinations of the elements of the existing Y language information or the existing Y language information (the permutation combinations also comprise the Y language information), further establishing corresponding relations between the various permutation combinations of the elements of the X language information and the various permutation combinations of the elements of the Y language information or the Y language information, and further obtaining more pairing data of the X language information and the Y language information and storing the more pairing data in the training database. Such loop iteration may include performing multiple iterative trainings on the training data, that is, training again with the data after one training (the data used for training plus the new data obtained by training) as the training data for the next training, continuously obtaining new training data through multiple loops, and storing all the data in a training database; if a new pair of data is input, then loop iteration is performed on all data (new data and existing data).

Thus, under the NI methodology, the database of X language information and Y language information is automatically expandable, including both input pairing data and training data expanded by permutation and combination of elements and training iterations. This is why we call the database a training database. Under the AI methodology, rules need to be manually added to the rule base, and the rule base cannot be self-expanded, whereas under the NI methodology, a machine can expand a database for implementing corresponding conversion of X language information into Y language information through self-training. While new training data is obtained through loop iteration, an understanding model (including a conversion model from X language information to Y language information) of the machine is also iterated, so that the model is optimized, and the accuracy of the model is enhanced. Under the natural intelligence methodology, the model optimization is also automatically completed by the machine, and the work of a model engineer is not needed, so that the machine learning cost can be greatly reduced.

When the X language information is obtained from the newly input a language information, the X language information may be compared with the existing X language information in the database (including the aforementioned X language information expanded by the element arrangement combination), and then the Y language information corresponding to the X language information may be determined according to the comparison result, or the accuracy of corresponding the X language information to a certain Y language information may be calculated. If the machine has insufficient comprehension capabilities to adequately or indefinitely translate the X language information to some Y language information, human-assisted comprehension is typically required. The newly input A language information is understood manually, that is, the original natural expression is understood by human comprehension ability instead of the converted X language information, so that Y language information corresponding to the meaning of the natural expression is obtained, the X language information and the Y language information obtained from the A language information are corresponded to obtain new pairing data, the new pairing data is stored in a training database, and the data expansion and training are carried out, so that new data are added in the training database and the model is optimized. Due to the data expansion, the manual auxiliary understanding can be carried out in such a way, not only one piece of pairing data is added, but one or more groups of pairing data are added, so that the pairing data volume of the database is increased rapidly, and the understanding capability of the machine is improved.

In addition, the corresponding relation of errors between the X language information and the Y language information in the training database can be corrected through artificial auxiliary understanding. For example, a piece of a language information (natural expression) corresponding to a piece of Y language information is specified by artificial assistance understanding, and the Y language information corresponding to the piece of a language information is replaced with the previous Y language information, or the robot is informed by artificial assistance understanding that the accuracy of the piece of a language information corresponding to the piece of Y language information is higher than the accuracy of the piece of a language information corresponding to the previous piece of Y language information, so that the correspondence between the X language information and the Y language information converted from the piece of a language information is corrected or optimized.

In the natural intelligence methodology, the above machine learning for expanding and training the paired data can be implemented by using one or more models of statistics, deep learning, probability calculation, fast optimization path search, and the like. The model itself may introduce minor errors, which may be referred to as intrinsic errors. Errors resulting from such system-inherent errors are visible for large data volume extensions and training. For example, if the number of expanded data is 500 ten thousand, the error rate obtained by performing a test using these paired data is 0.2%, that is, 1 ten thousand pieces of expanded paired data are erroneous. To compensate for such inherent errors, the known X-language information data which cannot be correctly recognized and the Y-language information which should be correctly corresponded thereto are written into the look-up table. When the conversion of the natural expression is carried out each time, the X language information converted by the natural expression can be compared with the X language information stored in the comparison table, and if the new X language information is in the comparison table, the correct corresponding Y language information can be obtained through the comparison table; if the new X language information is not in the comparison table, the same or similar X language information is searched in the training database, and the Y language information corresponding to the new X language information is determined by means of self-confidence judgment and the like. Because the comparison table for error correction is arranged, and the accuracy of the data outside the comparison table in the database is ensured by screening, the zero error rate effect on the seen data can be obtained. After adding new pairing data (usually in batch for saving calculation amount and time) to the training database and performing data extension training, the method can be used for correspondingly extending the comparison table, so as to improve the accuracy of conversion and identification.

FIG. 6 schematically shows a flow of a natural expression processing method according to an embodiment of the present invention.

In step S20, the system receives natural expression information (a language information), which may be text information, voice information, image information, video information, etc., as previously described.

In step S21, it is determined whether the understanding ability of the robot is mature. The judgment of whether the robot understanding is mature or not can be based on that the robot converts the A language information into the X language information within a certain time interval (set according to specific application requirements), then converts the X language information into the result of the Y language information, and compares the result with the result of manually converting the A language information into the Y language information, wherein the percentage obtained by dividing the same times by the total times of comparison is the robot understanding accuracy. The robot may also be used to determine whether the understanding ability is mature, that is, the robot estimates the probability or accuracy that it correctly converts a certain or some pieces of a language information into determined Y language information, which is also referred to as "confidence level" or "confidence value" of the robot. With the manual auxiliary training and the self-learning of the robot, the conversion self-reliability of the robot to the specific Y language information is continuously improved. The calculation of the confidence level or confidence value of the robot is performed based on the correspondence between the X-language information and the Y-language information. Specifically, after X language information is obtained by conversion or extraction from a language information, a logarithmic probability or similar score for Y language information is generated by one or more recognizer/classifier such as a deep neural network, a finite state converter, an automatic encoder/decoder, etc., and robot self-reliability is calculated by using a normalized exponential function.

The confidence may be calculated in correspondence with specified Y language information (standard expression), which may be a single piece or a plurality of pieces (more than one piece). For example, for a certain natural expression, the converted standard expression may be "standard expression 1", "standard expression 2", or "standard expression 3", or it is necessary to identify an intention corresponding to one of "standard expression 1", "standard expression 2", and "standard expression 3" from a certain natural expression.

If the confidence level is independently calculated for a single standard expression, that is, the confidence level corresponding to "standard expression 1", "standard expression 2", or "standard expression 3" to the X language information (secondary language information) converted from the natural expression is calculated separately, the obtained result is, for example: the confidence of conversion into "standard expression 1" was 80%, the confidence of conversion into "standard expression 2" was 40%, and the confidence of conversion into "standard expression 3" was 10%. If the confidence threshold is set to be 80% at the moment, the standard expression 1 meets the threshold requirement; if the confidence level threshold is set to be 90%, the three standard expressions do not meet the threshold requirement; if a confidence threshold is set at 40%, then the confidence level of two standard expressions meets the threshold requirement, and the standard expression with the higher confidence level may be output as the understanding result, but generally such a low confidence threshold is not set. Under this scheme, the confidence for each standard expression is independent, and thus is not the case where the confidence sums up to 100%. On the other hand, it is also possible to calculate the confidence in understanding with respect to other standard expressions for the same natural expression, except that for the set dialect, it is generally not necessary to calculate the confidence in understanding with respect to standard expressions other than the dialect.

If confidence is calculated for multiple standard expressions at a time, a similar softmax function can also be used to make the sum of confidence of multiple standard expressions 100%, for example, the confidence calculated is 70% for conversion to "standard expression 1", 20% for conversion to "standard expression 2", and 10% for conversion to "standard expression 3", it is more advantageous to distinguish the understanding of natural expressions by the confidence threshold.

In addition, relative confidence may also be used, i.e., after understanding probabilities are calculated for different standard expressions, the confidence may be further calculated using the mutual numerical comparison relationship between the probabilities. For example, the understanding (recognition) probability for "standard expression 1" is 65%, the understanding (recognition) probability for "standard expression 2" is 35%, and the calculated confidence may be 80% for "standard expression 1" and 20% for "standard expression 2".

The robot understanding accuracy and the robot self-reliability are indexes for measuring the understanding maturity of the robot. In other words, the robot understanding accuracy may be a statistical result for a larger amount of training data, which more accurately measures the degree of robot understanding maturity than a case of less training data. The self-credibility of the robot is the comprehension ability evaluation aiming at the understanding of the robot to a certain piece or a certain piece of A language information, and the comprehension ability of the robot can be more accurately evaluated under the condition of less training data. Among them, the calculation of the robot understanding accuracy generally requires a certain time for data accumulation, because more data can represent the expression diversity more widely, thereby reflecting the actual application situation more accurately. In other words, the robot understanding accuracy may be a statistical result for a larger amount of training data, which more accurately measures the degree of robot understanding maturity than in the case of less training data. The confidence value is used for measuring the reliability of the robot answer. Generally, when the language metaphor for semantic understanding obtained from the user expression is fuzzy, the robot has a relatively low degree of confidence compared to the case where the language metaphor is clearer, which reflects the case where the user naturally expresses semantic ambiguity or expression approaches multiple semantics. From another perspective, the robot understanding accuracy is a maturity assessment for a particular application, while the confidence reflects the uncertainty of the robot's answer to itself.

According to the robot understanding accuracy or the confidence level set according to application needs, the robot understanding accuracy or the confidence level is called as a robot understanding maturity threshold value. If the robot understanding accuracy or the confidence is lower than the robot understanding maturity threshold, the system considers that the robot understanding is not mature yet, the robot conversion result is not adopted, and the manual conversion result Y2 is still adopted continuously, so that the accuracy and stability of the system for the A language information understanding are guaranteed. Meanwhile, the system adds the X language information (left language) of the A language information automatically converted by the machine and the manual conversion result Y2 (right language) into an MT training data table for self-training of the MT robot.

If the robot understands the maturity, let the robot automatically convert the natural expression A directly into the standard expression Y at step S22; if the robot understands that it is not yet mature, the natural expression A is tried to be converted into the standard expression Y1 by the robot at step S23, while the natural expression A is converted into the standard expression Y2 by the MAU seat at step S24.

In step S26, if it is determined in step S21 that the robot understanding ability has matured, outputting a result Y automatically converted by the robot; otherwise, the result Y2 of MAU position manual conversion is output.

Optionally, in step S25, the natural expression a, the result Y1 of the robot trial conversion, and the result Y2 of the MAU seat manual conversion are subjected to the following post-processing: automatically converting A into X language information (left language) and Y2 (right language) as a pair of new pairing data to be placed in an MT training data table; y1 and Y2 were compared and used as statistical data for "judging whether the robot understanding is mature". Alternatively, the original data a is retained, and the left-hand linguistic data of the MT training data table is updated as the a → X conversion technique further matures (higher conversion accuracy).

FIG. 7 schematically shows a flow of a natural expression processing and responding method according to an embodiment of the present invention.

In the process shown in fig. 7, the natural expression a is first received in step S30.

It is then determined at step S31 whether the natural expression a can be converted into the standard expression Y by machine conversion. This step is equivalent to step S21 in fig. 6. Similar to the process of fig. 6, when it is judged at step S31 that the desired standard expression cannot be obtained by machine conversion, a manual conversion process is performed at step S32.

In actual applications, there may be a need for not understanding the identified natural expression or understanding the expression of the client even by manual processing, at which time a response is made to prompt the client for re-input at step S33, and then the process returns to step S30 to receive natural expression information a input again by the client. The "prompt customer to re-enter response" may be, for example, a voice prompt "no good meaning, ask you to speak you again", "ask you to speak slowly"; the text prompts that 'bad meaning, please write a little concretely'; or an image prompt, etc.

The standard representation of the machine conversion or the manual conversion is output at step S34. A standard response matching the standard expression is queried at step S35. The standard response may be fixed data pre-stored in a database, or basic data of the standard response may be pre-stored in the database, and then the basic data and the case variable parameters are synthesized to generate the standard response through system operation. In one embodiment, a standard response ID is set as a primary key of response data, and a correspondence table of a request code of a standard expression (Y language information) and the standard response ID is set in a database, thereby associating the request code of the standard expression (Y language information) with the response data. The following tables 1 to 3 schematically show examples of an expression data table, an expression response relation table, and a response data table, respectively. Alternatively, the standard expression and standard response ID may be many-to-one, as shown in table 4. In other embodiments, since the request code of the standardized expression (Y language information) is itself encoded, the request code of the standardized expression (Y language information) may be used as the primary key of the response data.

TABLE 1

TABLE 2

TABLE 3

TABLE 4

As previously described, the standard expression may include information related to the natural expression, such as expression type, language type, dialect type, and the like. For example, the natural expression from the customer is speech "received", the standard response is speech "good" by the converted standard expression query, knowing, thank you! "; also for example, the natural expression from the customer is an image "transfer failure page screenshot", and the standard response is a video "transfer error correction simple tutorial" through the converted standard expression query.

If there are no standard responses in the database that match the standard expression, then the responses corresponding thereto may be matched manually at step S36. The manual matching may associate the standard representation with the standard response ID by entering or selecting the standard response ID, may directly associate the standard representation with the response data, or may create new response data. The reason for not finding a standard response may be that the standard expression is newly added by a human, or because the same type of standard response is not matched.

Then, a response of the machine matching or the manual matching is output at step S37. The content of the response may be invoked or generated according to different information types. For example, for a voice response, a live recording may be played back or a voice through Speech synthesis (TTS) may be output; for the digital operation of the user, for example, the telephone key sequence combination is '2-5-1000', and the operation of 'repayment of credit card is 1000 yuan' is completed through the program operation.

For example, for the text information of "transfer 5000 pieces to my mother", an operation of "transfer 5000 yuan to X women" needs to be performed by running a program, but the system may not grasp account information of "X women" in advance, and on one hand, the account information may need to be manually added to realize conversion of standard expression, and on the other hand, even if the conversion of standard expression is realized, a corresponding standard response may not be inquired, and a response processing needs to be manually made. At this point, new response data (e.g., operating program) is generated, and a new standard response ID is assigned to the response data either manually or automatically and associated with the converted standard representation. Therefore, while natural expression responses to the client are realized, artificial assistance understanding and training can be realized, and the expression-response database can be updated.

According to the natural expression processing and responding method provided by the embodiment of the invention, the standard expression can be used for quickly pointing to response, so that a client does not need to traverse a complex conventional function menu for a long time to find self-service required by the client.

On the other hand, unlike the conventional response mode, the manual operation is mainly limited to the "decision" work in the background, including determining the standard expression (Y language information) requirement code, selecting the response (or response ID) or generating the response operation, but it is not necessary to communicate with the client directly in the foreground by means of communication or text input (except for inputting the standard expression (Y language information) requirement parameters), i.e. the aforementioned silent seat mode. Thereby can save manpower resources in a large number, promote work efficiency by a wide margin. In addition, compared with the traditional free response directly provided by the artificial seat for the customer, the system can respond to the standardized response provided by the customer, is not influenced by a plurality of factors such as emotion, vocal gland, accent, business proficiency and the like of the artificial seat, and can better ensure the stability of customer experience.

Moreover, through automatic learning, training and artificial assistant understanding of the system (robot), a converted natural expression (X language information) -standard expression-standard response database can be established, and automatic understanding and response of the system are gradually realized. And the X language information data in the database can also have the advantages of fine information granularity, narrow service category, high data fidelity and the like, so that the training difficulty of the robot is reduced, and the intelligent mature period of the robot is shortened.

According to the embodiment of the invention, the flow of the human-computer interaction can be controlled by setting the confidence threshold value of the robot. For example, a first confidence threshold is set as a criterion for determining whether the robot comprehension is mature. The human-computer interaction can be intelligently controlled by setting other thresholds of confidence. For example, a second confidence threshold is set, and when the confidence of the robot is lower than the first confidence threshold but not lower than the second confidence threshold, the robot asks the user to confirm whether the meaning of the natural expression input by the user is a certain standard expression. For another example, a third confidence threshold value is set, when the confidence of the robot is lower than the second confidence threshold value but not lower than the third confidence threshold value, the robot asks the user to repeat the input of natural expression, and when the confidence of the robot is lower than the third confidence threshold value, the robot automatically switches to the artificial assistance understanding.

For example, FIG. 15 illustrates an example of intelligent human-machine interaction for identity authentication. In the example shown in fig. 15, the question that the robot asks the user is "ask for a question of the person who is my new ancestor? "in return, the natural expression of the user input is a speech, which can be expressed in one of four meanings of" yes "," not clearly "or" not interested or unwilling "meaning.

In the example shown in fig. 15, the first confidence threshold value is set to 80% and the second confidence threshold value is set to 60%. When the robot understands not less than 80% confidence in the meaning of the segment of the voice reply (first voice answer) by the user, that is, the robot understands not less than 80% confidence in the segment of the voice reply as one of "yes", "not clearly" or "not interesting or unwilling". The robot understands the meaning corresponding to the user voice as one of "yes", "not clearly" or "not interesting or unwilling" with confidence of not less than 80%. In general, when the differentiation design of the standard expressions (corresponding to the meaning or intention of the natural expression) is good, the case that the confidence of more than one standard expression exceeds 50% does not occur, but the case that the confidence deviation is high due to the deviation of the training data and the like also occurs, and in this case, if the confidence of more than one standard expression exceeds the threshold value, the standard expression with the highest confidence can be automatically selected as the understanding result. It is also possible to use a calculation of confidence for more than one standard expression together and to sum the confidence for each standard expression to 100%, so that no more than one standard expression has a confidence of more than 50%.

When the robot understands not more than 80% but not less than 60% of the meaning of the user's first voice answer, that is, for example, the robot understands that the meaning of the user's answer may be "not" but not well determined (60% ≦ CL < 80%, CL representing confidence), the robot then asks the user to confirm whether the meaning of the answer is "not", the user then inputs the voice answer again, the robot understands the user's second voice answer, and if the confidence that the robot understands the meaning of the user's second voice answer is not less than 80%, the robot obtains confirmation ("yes") of the meaning of the user's first voice answer, and thus understands the meaning corresponding to the user's voice as one of "yes", "not to hear", or "not interesting or unwilling to hear" according to the confirmation result of the user, and if the self-confidence level of the robot for the meaning understanding of the second voice answer of the user is less than 80 percent or the confirmation result of the user is 'not yes', the robot turns to the manual auxiliary understanding or the manual response.

When the confidence level that the robot understands the meaning of the first voice answer of the user is less than 60%, or the robot cannot understand the meaning of the user answer, then the robot asks the user to answer again, the user then inputs the voice answer again, the robot understands the voice answer of the user this time, if the confidence level that the robot understands the meaning of the second voice answer of the user is not less than 80%, the robot understands the meaning corresponding to the user voice as one of "yes", "not clearly" or "not interesting or unwilling", and if the confidence level that the robot understands the meaning of the (second) voice answer of the user is still less than 80%, the robot will resort to human-assisted understanding or manual answering.

Considering the influence of the number of interactions on the user experience, the above description of the example shown in fig. 15 only sets the user to answer in two rounds. Alternatively, the number of rounds of user responses may be increased, for example, in the second round, if the robot's understanding confidence level for the user's second voice response is less than 80% but not less than 60%, the user may still be asked to make a third voice response to confirm whether the robot's understanding of the meaning of the second voice response is correct.

In addition, a third confidence threshold value can be set, for example, 40%, when the confidence level of the robot for understanding the meaning of the first voice answer of the user is lower than 40%, the robot automatically switches to the manual auxiliary understanding process, so that the interaction times can be reduced, and the user experience can be improved.

Of course, the first confidence threshold, the second confidence threshold, the third confidence threshold, and the like may be set as needed, for example, the first confidence threshold is 90%, the second confidence threshold is 50%, and the like.

The above-mentioned way of performing multiple rounds of interaction based on the understanding confidence level can be regarded as performing feedback control on the robot through the user expression input, that is, generating a logic output according to the understanding confidence level of the robot on the expression input by the user, and performing logic control on the interaction process through the logic output. The direct effect of this scheme is that the workload of human-assisted understanding or human response can be greatly reduced. For example, in the case of the user's understanding of the expression of a certain meaning, the actual accuracy of the robot is 60%, and then the user expresses twice the same meaning, and the probability that the robot understands the accuracy once is theoretically 100% - (100% -60%) x (100% -60%) -84%. Alternatively, for the user's understanding of the expression of a certain meaning, the probability that the robot has a confidence of not less than 80% is 60%, and then after the user expresses twice the same meaning, the probability that the robot understands a correct one time is theoretically 100% - (100% -60%) × (100% -60%) -84%. In fact, often prompting the user for "do you not have good meaning, not hear clearly, trouble you say again? After that, the meaning definition and/or pronunciation definition of the second voice input of the user is improved, so that the robot generally improves the correct rate of understanding the second expression of the same meaning for the user. Thus, the understanding accuracy of the robot can be increased by automatically prompting the user to repeat the meaning. Further, the user is automatically prompted to confirm the meaning, and since the expression used for confirming the user is usually simple expressions such as "yes", "right", "not", and the like, the understanding accuracy or the understanding confidence of the robot for the expression for confirming the meaning by the user is usually high, for example, 90% -100%, then the robot can basically realize automatic response by combining the meaning repeatedly expressed by the user and the meaning confirmed by the user through multiple rounds of interaction, so that the amount of the expression of the user transferred to manual processing (including silence seating or manual customer service for manually assisting understanding) is greatly reduced.

On the other hand, in the case where the user confirms the meaning by the expression input, the X language information obtained by converting the expression input by the user for the first time and the preset meaning (the standard expression indicated by the Y language information) corresponding to the expression may be stored as the paired data in the training database, and the paired data may be trained in the manner described above. In this way, the paired data of the extended training database can be generated only by the expression input of the user without understanding or confirming the meaning of the expression of the user by human assistance. That is, intelligent data accumulation and robot (engine) automatic learning at the server can be realized.

The user expressions involved in the human-computer interaction scheme shown in the example of fig. 15 may be speech or other expressions. If the scheme is a man-machine voice interaction scheme, the implementation can be controlled through IVR.

The natural expression processing principle and method based on natural intelligence and the interaction method are set forth above, and accurate information extraction based on human-computer interaction of natural intelligence is described next. It should be emphasized that the precise information extraction in human-computer interaction according to the embodiment of the present invention may completely cover the above natural expression processing method based on natural intelligence, and it can also be said that the precise information extraction is realized by adding or adjusting partial steps on the basis of the above method.

The "precise information extraction method" is, in colloquial, to obtain multiple intentions from a natural expression. As described above, the natural expression is not limited to natural language, and may be a still image, a moving image, or the like. In combination with the above natural intelligence system, it can also be understood that the Y language information corresponding to the respective intentions is obtained from one natural expression. Specifically, the robot converts the a language information into the X language information, then analyzes and calculates parts corresponding to the preset intention from the X language, and then converts the parts into the Y language information, respectively. That is, compared with the foregoing natural expression processing procedure, the method performs the recognition and extraction of key information in the X language information layer, and adopts precise conversion or local conversion instead of whole conversion. This approach may improve the accuracy of robotic understanding, especially for natural expressions containing multiple pieces of key information that embody intent, with precision transformations more accurate than overall transformations. And after the intention in the natural expression is acquired by using an accurate information extraction method, filling the intention into a slot (slot) corresponding to the intention classification direction, and realizing slot filling processing.

For example, in the code "112124" (Y language information), each bit code represents: the method comprises the following steps of 1 (airline ticket), 1 (airline ticket change label), 2 (delay), 1 (discount ticket), 2 (off season) and 4 (Asia-Pacific region), wherein the front three-digit code corresponds to an operation, and the rear three-digit code corresponds to an object. The full intention is to postpone the change of ticket for off-season discount in the signature asia-pacific region. Assuming that the key information of the demand is classified, including slot 1-operation, slot 2-object, if the slot is filled, the slot may be filled with "delayed ticket change" in slot 1 and "subtropical off-season discount ticket" in slot 2. The slot arrangement, similar to the previous encoding of the Y language information, can be increased or decreased or adjusted as desired. For example, if slot filling is defined as a ticketing operation involving an airline ticket, then "deferred ticket change" may be filled in slot 1, and for example, the object of the ticketing operation is refined, slot 2 corresponds to an airline ticket discount type, slot 3 corresponds to a destination, then "slack discount ticket" is filled in slot 2, and "asian-pacific region" is filled in slot 3. Therefore, the slot filling is a process of refining classification and extraction intentions according to requirements and storing according to the classification. As before, if the intent after the subdivision is indicated by a code, "12" may be filled in slot 1, "12" may be filled in slot 2, and "4" may be filled in slot 3.

Further, since the specific intention may be various, for example, the ticket destinations corresponding to the aforementioned slot 3 may include hundreds of International airports and thousands of domestic airports, and if this is the case, the specific airport name may be indicated by a code formed by a combination of letters (such as an airport three-letter code established by International Air Transport Association (IATA)) or a combination of letters and numbers. In this case, however, these codes are sometimes not easy for human-assisted personnel to memorize and enter, who can directly fill in the slots with a representation of the specific destination (airport name) that the system can automatically correspond to the code. E.g., directly filling in city names or city codes, e.g., Shanghai, hun, etc.

Under the AI-based natural expression processing methodology, the "slot filling" through intention acquisition or information acquisition also applies a grammar model and a semantic model to obtain key information to be extracted from a text obtained through natural expression conversion. Under the NI-based natural expression processing methodology, the conversion of a natural expression into a standard expression is essentially an understanding of the natural expression, and thus there is still no need to go through a syntactic model and a semantic model, but rather the natural expression is mapped to the intended part to be extracted in the standard expression based on the aforementioned machine translation principles. In other words, in the information acquisition (extraction), only information related to the intent to be extracted is subjected to conversion of the standard expression (Y language information). For example, for the natural expression "i fly from beijing, home from shanghai at tomorrow", if an information extraction and slot filling operation is performed on a flight reservation, and it is assumed that slot 1 corresponds to "departure place", slot 2 corresponds to "destination", and slot 3 corresponds to "date", then slot 1 is filled in "shanghai", slot 2 is filled in "beijing", slot 3 is filled in "tomorrow", or a specific date corresponding to "tomorrow" automatically determined by the system, and other information, such as ticket orderer information, arrival destination information, and the like, is not used in this information extraction and slot filling operation.

In the conversion process, the robot converts the A language information into the X language information, extracts a part corresponding to the Y language information of the slot to be filled from the X language information, converts the part into the Y language information, and fills the slot. That is, compared with the foregoing natural expression processing procedure, the method performs the discrimination and extraction of the key information in the X language information layer, and adopts the precise conversion rather than the integral conversion. This approach may improve the accuracy of robotic understanding, especially for natural expressions containing multiple pieces of key information that embody intent, with more accuracy for precise transformations than for global transformations.

FIG. 8 schematically illustrates an information extraction and slot filling process for a natural intelligence based human-computer interaction system, according to one embodiment of the invention.

In step S40, the system receives natural expression information (a language information), which may be text information, voice information, image information, video information, etc., as previously described.

In step S41, it is determined whether the accurate information extraction capability (or simply referred to as "intention acquisition capability") of the robot is mature. The judgment of whether the accurate information extraction capability of the robot is mature is based on that in a certain time interval (set according to specific application requirements), the robot converts the A language information into the X language information, then extracts a part corresponding to the Y language information of the slot to be filled from the X language information, converts the part into the Y language information, compares the part with the Y language information of the slot to be filled, which is directly obtained from the A language information manually, and obtains the percentage, namely the accurate information extraction accuracy or the intention acquisition accuracy of the robot by dividing the same times by the total times of comparison.

Similarly, the robot may also adopt a mode that the robot judges whether the understanding ability is mature, that is, the robot estimates the probability that the robot correctly obtains information for a certain intention based on a certain piece or a certain piece of a language information, and the robot is called as "accurate information extraction self-credibility" or "intention obtaining self-credibility" (also may be colloquially called as "filling tank self-credibility"). With the artificial assistant training and the self-learning of the robot, the intention acquisition self-credibility of the robot for a specific intention is continuously improved. The robot's estimation of the correct probability of intent to acquire may be based on a comparison at the X element level between its currently processed a language information and the a language information already in the MT training data table. Specifically, the confidence level is calculated based on the correspondence between the X language information (secondary language information) and the Y language information (standard expression), and the confidence level is calculated by using a normalized exponential function, wherein the logarithmic probability or similar score for the Y language information is generated through one or more of a deep neural network, a finite state converter and an automatic encoder/decoder.

The robot intention acquisition accuracy or the intention acquisition self-reliability set according to application needs is called as a 'robot intention acquisition maturity threshold'. If the accuracy rate of the robot intention acquisition or the intention acquisition self-reliability is lower than the maturity threshold of the robot intention acquisition, the system considers that the robot intention acquisition capability is not mature, the robot intention acquisition result YF cannot be adopted, and the artificial intention acquisition result YF2 is still continuously adopted, so that the accuracy and stability of the system for the intention acquisition of the A language information are ensured. Meanwhile, the system adds the X language information (left language) automatically converted by the machine from the A language information and the artificial intention acquisition result YF2 (right language) into an MT training data table (namely a training database) for self-training of the MT robot.

If the robot' S accurate information extraction capability is mature, the robot automatically performs intention acquisition and slot filling operation at step S42, converts the a language information into X language information, extracts a portion corresponding to the Y language information of the slot to be filled from the X language information, converts the portion into Y language information, and fills the slot; if the robot' S accurate information extraction capability is not yet mature, the robot attempts to convert the standard expression YF1 to be extracted from the natural expression a and fills in the slot at step S43, while the MAU seat directly obtains the Y language information YF2 needed to fill in the slot from the a language information and fills in the slot at step S44.

Optionally, in step S45, the following post-processing is performed on the natural expression a, the extraction conversion result YF1 of the standard expression to be extracted by the robot attempting to convert from the natural expression a, and the result YF2 of the MAU seat manual extraction conversion: automatically converting A into X language information (left language) together with YF2 (right language) and putting the information into an MT training data table as a pair of new pairing data; YF1 and YF2 are compared and used as statistical data for judging whether the accurate information extraction capability of the robot is mature or not. Alternatively, the original data a is retained, and the left-hand linguistic data of the MT training data table is updated as the a → X conversion technique further matures (higher conversion accuracy). If the above-described subsequent processing is directly performed on the extracted and converted YF1 and YF2, it is not actually necessary to fill the YF1 into the slot at step S43. The data of the filled slots after the YF1 and the YF2 are respectively filled can also be used as training data or statistical data.

If the a language information is a character, the character itself or a character is obtained as an X element or converted into X language information as described above, and the subsequent operation is performed.

FIG. 9 further illustrates the natural expression intent acquisition and slot filling process flow under the "order ticket" query term. As shown in fig. 9, after the process starts, the natural expression "i fly from shanghai to beijing, home in the evening" is received at step ES11, which may be in the form of voice, text, etc., as previously described. It is determined in step ES12 whether the expression embodies an intention to be a "ticket booking" query item, and if it is determined not to be a "ticket booking" query item, the user is prompted for the current "ticket booking" query item or requested to determine that the current demand is a "ticket booking", and then asked to re-enter the expression. Step ES12 may also be performed before the user enters the expression at the start of the process, i.e. the user is prompted for the current query item. Then, in step ES13, it is further determined whether the user is booking air ticket for himself or booking air ticket for someone else, and the user may input "i'd mom", "i's too much", "director" and so on, and if the robot can recognize the specific persons corresponding to these expressions and has the information of these persons, the ticket booking person information may be automatically filled in the corresponding slot. If the ES13 judges that the user is ticketing himself, the robot further extracts information about "departure place", information about "destination", and information about "date" in step ES15, which extraction is consistent in principle and basic method with the aforementioned conversion from X language information to Y language information, except that only the information about "departure place", the information about "destination", and the information about "date" are extracted precisely and converted into Y language information. After the intent acquisition and slot filling operation in step ES15, the robot may query or autonomously determine whether the user has other intentions, for example, in this example, the expression of the user also includes "go home", so after finding the expression of "go home", the robot may further perform subsequent processing, for example, query or autonomously prompt whether the user needs an airport pickup service, and may fill the home address of the user (if the data of the home address of the user is included in the knowledge base of the robot) into the "airport pickup destination" slot. After the required information is filled into the slot, the robot may perform a corresponding response operation, such as displaying or audibly informing the customer of flight information that may meet the customer's requirements, and so on.

It is worth noting that the steps in fig. 9 may include the aforementioned process (e.g., as shown in fig. 7 or fig. 8) of converting the natural expression into a standard expression or performing intent acquisition and slot filling based on the natural expression. For example, from the natural expression judgment query item (ES12), voters are determined (ES13, ES14), a "departure place", "destination" and "date" are determined (ES15), other intentions are determined (ES16), and subsequent slot filling processing is performed (ES 17).

The precise extraction of this information does not have to be done by a conversion of the natural expression to a standard expression. If the extraction is carried out at one time, the extraction actually comprises 7 slots, namely, a slot for 'inquiring items' is filled with 'air ticket booking', 'person' is filled in a slot for 'ordering', 'departure place' is filled in 'Shanghai', 'destination place' is filled in 'Beijing', 'date' is filled in a slot for 'tomorrow' or corresponding date, 'other intention' is filled in a slot for 'connecting', and a slot for 'connecting destination' is filled in 'home' or a specific home address. This greatly increases the amount of computation and data flow required for the robot to understand, and also increases the operational complexity of human-assisted understanding. Therefore, the extraction of the respective slot correspondence information and the filling of the respective slots can be achieved by a plurality of understandings. For example, it is understood for the first time that "query item" (i.e., the common problem "FAQ" described later) is "air ticket", so that the a/X → Y database can be narrowed down to a range corresponding to "query item" being "air ticket", thereby greatly reducing the amount of data and computation required for understanding and training of the robot, and greatly increasing the convergence speed of iterative operations. The "query item" may also be determined in a default manner or in a manner selected by the user. Likewise, the user may be determined to order himself or herself by re-understanding or by default or user selection. The information corresponding to each slot can be obtained step by step in a multi-turn conversation mode. For example, the robot firstly inquires about a service required by the user, namely 'ticket booking', by means of characters, language, images and the like, then inquires about who the user books the ticket, and then inquires about information such as 'departure place', 'destination', 'date', 'preferred time period', 'price', and the like, and inquires about whether other requirements (such as airport pickup and the like) exist.

The upper slot of the query item, namely the application scene slot, can be set, and the query item is classified. The "application scenario" may be "travel" if the query item "ticket booking" is attributed. The upper slot corresponds to the intention of the previous layer.

Not only can multi-slot filling for one query item be realized in one slot filling operation, but also multi-slot operation spanning multiple query items can be realized in one slot filling operation. For example, for the expression "i want to have a meal at 7 am from shanghai to beijing on majord", two application scenes, "inquiry items" of "trip" and "restaurant", which are "order ticket" and "order meal", respectively, may be included, and then, in one slot filling operation, in addition to the slot filling operation of "departure place", "destination", "date" and "time" under "order ticket", the slot filling operation of "restaurant name" and "diner number" under "order meal" is simultaneously realized. The expression processing mode of filling a plurality of grooves by understanding once can reduce the number of times of conversation, save the time of a user and greatly improve the user experience. The intelligent human-computer interaction system with the slot filling function can be used as a Virtual Personal Assistant (VPA) to realize cross-platform synchronous docking, for example, for a 'travel' application scene, docking with a travel network, and for a 'dining' application scene, docking with popular comment (mei qun).

If the number of slots to be filled at a time is too large, the amount of data required for operation and the complexity of operation are greatly increased, and therefore, a preset balance or dynamic balance between user experience and processing efficiency/resource consumption can be sought in a specific application.

The natural expression processing method, the man-machine interaction method and the accurate information extraction method based on natural intelligence can be particularly applied to a customer service system such as the interactive voice response IVR or the Internet call center system ICCS or other remote customer contact systems (such as a telephone sales system, a network sales system, a VTM intelligent remote terminal … …). As previously mentioned, in such applications, the requirement for machine translation is not a literal exact meaning, but rather requires that the customer's natural expression be converted into information that the system can understand in order to provide the customer with a response corresponding to its expression. That is, machine translation herein focuses on understanding the essential meaning behind human language, thereby representing the actual intent or need of a customer "understood" from natural expression in a form that is more easily handled by a computer program.

FIG. 10 schematically shows an intelligent human-computer interaction system according to an embodiment of the invention. As shown in fig. 10, the intelligent interpersonal interaction system includes an intelligent answering device 1 (corresponding to a server side) and a calling device 2 (corresponding to a client side), a client 8 communicates with the intelligent answering device 1 through the calling device 2, and an MAU manual position 9 (system service staff) performs manual operation on the intelligent answering device 1. The intelligent answering machine 1 comprises a dialogue gateway 11, a central controller 12, an MAU workstation 13 and a robot 14. Optionally, the smart transponder device 1 further comprises a trainer 15.

The client 8 refers to an object for remote sales and remote service of an organization. Telemarketing generally refers to an establishment actively contacting customers in an "outgoing call" fashion through its own dedicated telephone or internet channel in an attempt to market its own products and services. Remote services generally refer to the establishment's customers actively contacting the establishment, asking for or using the establishment's products and services, in the form of an "incoming call" through the establishment's own telephone or internet channel.

The calling device 2 may be a dedicated telephone channel or internet channel set up by an organization for remote sale to a customer 8 (outgoing traffic) and for providing remote service to the customer (incoming traffic). A telephone channel Call system, such as an Automatic Call Distribution system (ACD), is a conversation channel for an organization to interact with customers 8 in a voice form through a background Automatic service system (e.g., a traditional IVR system based on a telephone key technology, or a new vp (voice portal) voice portal system based on an intelligent voice technology) and a human seat.

An Internet channel Call system, such as an Internet Call Center (ICC) based on Instant Messaging (IM) technology, is a session channel through which an organization interacts with a customer 8 in the form of text, voice, image, video, etc. through a background customer self-service system (e.g., Natural Language Processing (NLP)) and a human seat.

The intelligent answering machine 1 enables an organization to control an automatic service system and a manual position of a background of the organization and to have conversation with a client 8 in multimedia forms of characters, voice, images, videos and the like, so that standardization and automatic interactive conversation between the organization and the client is achieved.

The dialogue gateway 11 plays the role of "front portal" in the intelligent answering machine 1, and the main functions include: receiving irregular natural expressions (in the form of characters, voice, images and videos) and regularized non-natural expressions (in the form of telephone keyboard keys and the like) from the client 8 through the calling device 2, and sending the irregular natural expressions and the regularized non-natural expressions to the central controller 12 for subsequent processing; receiving instructions from the central controller 12, responses (in the form of text, voice, images, video, programs, etc.) expressed by the client 8 are implemented.

As shown in fig. 11, the dialogue gateway 11 includes an expression receiver 111, an identity authenticator 112, a response database 113 and a response generator 114.

The expression receiver 111 receives the expression from the client 8 through the calling device 2. The expression may be any of the various random natural expressions and regular non-natural expressions described previously.

Optionally, the identity authenticator 112 is provided before the expression receiver 111. The identity authenticator 112 may identify and verify the identity of the client 8 during the initial phase of the session. The traditional password input technology (such as inputting passwords by telephone keys, inputting website login passwords by keyboard, and the like) can be adopted; the technology of 'Pass-phrase + voiceprint-print identification' can also be adopted; the two technologies can also be mixed and adopted at the same time.

The identity authenticator 112 is arranged, and a client identity recognition and verification method of 'whisper + voiceprint recognition' is adopted, so that the client experience can be improved, and the client does not need to memorize a plurality of different passwords; the security risk that the password is stolen in the traditional password input method is reduced; in addition, the method of 'whisper + voiceprint recognition' and the traditional method of 'password input' are mixed for use, so that the method can be widely accepted by the market, and the safety of the identity recognition and verification of the client can be improved.

Response database 113 stores response data for responding to the customer. Similar to that illustrated in the table above, this data may include the following types:

and (3) writing: pre-programmed text, for example, the textual answers in the e-bank FAQ (frequently asked questions).

And (3) voice: a prerecorded live recording, or a TTS speech synthesis recording without variables, such as: "you are just! Here a future bank. Ask what can me help you? ".

Image: prefabricated images, such as Beijing subway network map. Non-video animations are also included, such as: the bank presents the customer with a GIF file, a FLASH file, etc. that describes how to perform international remittance operations on the online banking system.

Video: pre-made video, for example, the electric iron supplier demonstrates to the customer how to use their new product.

The procedure is as follows: a preprogrammed series of instructions, for example, when the client speaks to express "i want to see a chinese partner", the cloud-end smart tv will operate according to the client's request to respond to the client: the method comprises the steps of automatically turning on a television, automatically downloading and caching a movie of China partnership from a cloud server side, and finally starting playing.

Template: variable words, voices, images, program templates can be filled in.

Response generator 114 receives central controller 12 instructions to generate responses to the presentation of customer 8 by invoking and/or executing data in response database 113. Specifically, the response data may be queried and called from the response database 113 according to the standard response ID in the command, or the text and image may be displayed, or the voice and video may be played, or the program may be executed; the template may also be called in the response database 113 according to the instruction, and the variable parameters transmitted in the instruction are filled, or TTS speech synthesis generated in real time is played (for example, "5000 yuan for you to pay credit card successfully," wherein, "5000 yuan" is the variable in the instruction), or a text is displayed, or a picture or animation generated in real time is displayed, or a program is executed.

Optionally, central controller 12 may maintain and update data in response database 113, including response data, standard response IDs, and the like.

The central controller 12 receives the customer demand expression information (including the irregular natural expression and the regularized unnatural expression) from the expression receiver 111, and cooperates with the robot 14 and the MAU manual seat 9 via the MAU workstation 13, so as to convert the customer's irregular natural expression information into a standard expression in the aforementioned method or extract and convert it into a required standard expression and fill the corresponding slot, and determines a standard response ID corresponding thereto according to the conversion result or intention acquisition result of the standard expression, and then transmits the standard response ID to the response generator 114. Alternatively, the central controller 12 may update the data in the MT training data table.

The robot 14 is an application robot that implements the machine intelligence technique described above. The robot 14 may perform a conversion of natural expressions (a language information) such as text information, voice information, image information, video information, etc., resulting in a standard expression (Y language information) and/or the aforementioned intent acquisition and slot filling operations. As described above, when the comprehension ability or the accurate information extraction ability of the robot 14 reaches a certain level, for example, when the judgment comprehension ability or the accurate information extraction ability of a certain category is mature, it can independently perform the a → X → Y conversion or the slot filling operation without the assistance of a human operator. The MT training data table may be provided within the robot 14 or may be an external database in which the requirement codes of the standard expression data or the slot filling result data (right language) stored therein may be associated with the standard response ID. The database may be updated by the central controller 12. The database for character translation, voice recognition, image recognition, video processing, and the like may be an external database, or may be provided in the robot 14.

The MAU workstation 13 is the interface of the intelligent answering machine 1 with the MAU human agent 9. The MAU workstation 13 presents the identified natural or customer original expressions to the MAU artificial agents 9. The MAU human agent 9 inputs or selects a standard expression or inputs or selects a slot filling content via the MAU workstation 13, and the MAU workstation 13 then sends the standard expression or the slot filling content to the central controller 12. Alternatively, the MAU human agent 9 inputs or selects a response (or standard response ID) via the MAU workstation 13 if human assistance is required to determine the response.

Optionally, a trainer 15 may also be included in the smart transponder device 1. The trainer 15 is used to train the robot 14 with the ability to convert natural expressions into standard expressions and/or with the ability to obtain intents from natural expressions. For example, the trainer 15 trains the robot 11 using the determination result of the MAU human position 9, and constantly raises the robot understanding accuracy or the intention understanding accuracy of the robot 11 in each category (for example, the aforementioned business category, the secondary business category, and the like). For each category, under the condition that the robot understanding accuracy rate does not reach the 'robot understanding maturity threshold', the trainer 15 compares the standard expression conversion result of the MAU artificial seat 9 with the standard expression conversion result of the robot 11, if the results are the same, the 'robot judgment accuracy times' and the 'robot judgment times' in the category are increased correspondingly once, similarly, under the condition that the robot intention acquisition accuracy rate does not reach the 'robot intention acquisition maturity threshold', the trainer 15 compares the intention acquisition result of the MAU artificial seat 9 with the intention acquisition result of the robot 11, if the results are the same, the 'robot intention acquisition accuracy times' and the 'robot intention acquisition times' in the category are increased correspondingly once; otherwise, the manual conversion result or the intention acquisition result (the intention acquisition result may also be represented by the manual slot filling result) is added to the MT training data table as new robot training data. The trainer 15 may also instruct the robot 14 to perform the aforementioned "self-learning".

In addition, trainer 15 may also be used to train robot 14 for machine intelligence techniques such as word translation, speech recognition, image recognition, video processing, and the like. The trainer 15 may also maintain and update MT training data tables, databases for word translation, speech recognition, image recognition, video processing.

Alternatively, the trainer 15 may also be integrated with the central controller 12.

Alternatively, the response generator 114 and response database 113 may be separate from the dialog gateway 11 or may be integrated in the central controller 12.

The intelligent answering device 1 can implement the natural expression processing and answering method described above. For example, the dialogue gateway 11 receives natural expression information from the client 8 from the calling device 2 through the expression receiver 111 and sends it to the central controller 12; the central controller 12 instructs the robot 11 to recognize the natural expression information as language information (for example, X language information) of a certain form that can be processed by a computer and related expression information, and then instructs the robot 11 to convert the language information and related expression information into a standard expression; if the understanding of the robot 11 is not mature enough or corpus matching is not realized, and the conversion of the standard expression cannot be completed, the central controller 12 instructs the MAU workstation 13 to prompt the MAU manual seat 9 to perform manual conversion of the standard expression; the MAU artificial seat 9 converts the language information and the related expression information recognized by the robot 11 into a standard expression, and inputs the standard expression through the MAU workstation 13 and sends the standard expression to the central controller 12, and optionally, the MAU artificial seat 9 can directly convert unidentified irregular natural expression information into the standard expression; the central controller 12 queries the expression-response database, retrieves a standard response ID matching the standard expression, and if there is no matching result, prompts the MAU manual operator 9 through the MAU workstation 13 to select a standard response and input a corresponding standard response ID, optionally, the MAU manual operator 9 may also directly associate the standard expression with response data, or establish new response data; central controller 12 instructs response generator 114 to invoke and/or run the data in response database 113 to generate responses expressed to clients 8; then, the dialogue gateway 11 feeds back the response to the client 8 through the calling device 2; optionally, the central controller 12 maintains and updates an MT training data table or response database, respectively, and an expression-response database accordingly, based on the standard expressions or standard responses determined or added by the MAU human agents 9.

The smart answering device 1 can also implement the intent retrieval and slot filling method described earlier. For example, the dialogue gateway 11 receives natural expression information from the client 8 from the calling device 2 through the expression receiver 111 and sends it to the central controller 12; the central controller 12 instructs the robot 11 to recognize the natural expression information as language information (for example, X language information) of a certain form that can be processed by the computer, and then instructs the robot 11 to extract a part corresponding to a desired standard expression from the language information, convert it into the standard expression, and fill in a slot; if the accurate information extraction capability of the robot 11 is not mature enough or the corpus matching is not realized, and the slot filling cannot be completed, the central controller 12 instructs the MAU workstation 13 to prompt the MAU manual seat 9 to manually fill the slot; the MAU artificial seat 9 directly understands the natural expression, performs slot filling operation according to an understanding result or standard expression obtained by understanding, inputs the result through an MAU workstation 13 and sends the result to a central controller 12; the central controller 12 queries the expression-response database, retrieves a standard response ID matching the standard expression corresponding to the slot filling result, and if there is no matching result, prompts the MAU manual position 9 through the MAU workstation 13 to select a standard response and input the corresponding standard response ID, optionally, the MAU manual position 9 may also directly associate the standard expression (including the slot filling result) with the response data, or establish new response data; central controller 12 instructs response generator 114 to invoke and/or run the data in response database 113 to generate responses expressed to clients 8; then, the dialogue gateway 11 feeds back the response to the client 8 through the calling device 2; optionally, the central controller 12 maintains and updates the MT training data table or response database, respectively, based on the standard expressions (including the slot filling results) or standard responses determined or added by the MAU human agents 9, and maintains and updates the expression-response database accordingly.

Fig. 12A-12P schematically illustrate an operational interface of an intent retrieval and slot filling system, according to an embodiment of the present invention.

Fig. 12A shows an interface for setting "FAQ". By "FAQ", it is meant a common problem in human-computer interaction (also referred to as the "query item") such as "Change Password" in the figure, "Check credit balance" in the figure, "Customer Service" in the figure is after-sales Service, and so on. An "Id" is a unique identification assigned to an FAQ for conveniently querying, entering, or selecting the FAQ. The interface of fig. 12A may be used to display and set the FAQs, for example, when a new FAQ needs to be added, for example, when an order is made for a ticket, the content descriptions "Flight" and "Id" of the FAQs may be manually input. The FAQ data table in the database can be newly added and modified in a batch uploading mode. The interface of FIG. 12A may also be used to set a more up-range of application scenarios. FAQs can be categorized by application scenarios. For example, "Check credit" is classified into a credit card service scenario, and "Flight" is classified into a travel service scenario. As described above, the intent acquisition and slot filling operations for multiple FAQs in one application scenario or multiple FAQs in multiple application scenarios may be implemented through multiple rounds of human-machine interaction, and the intent acquisition and slot filling operations across FAQs and even across application scenarios may also be implemented simultaneously through one intent acquisition and slot filling operation.

The dialog and response display interface of FIG. 12B shows an initial data modality for the FAQ. On the left side of FIG. 12B-Chat "section," Customer support "represents the utterance of the robot, and" test "represents the expression of the user; on the right side of FIG. 12B, the Engine Response "Engine Response" section, shows FAQID and FAQ as understood by the system (Engine). The user expression may be entered by way of input in an input box (with a "Type a message here" flag) or by way of a batch upload. If no user expression data has been previously stored in the database, the original user expression data is constructed in this manner. In the initial state of training as shown in FIG. 12B, since there is no training, the respective users corresponding to the "Chat" section express, for example, "there is no ticket to Beijing from Shanghai", "Check Credit Balance", "Change password", the robot answers are all "Sorry, I don't knock", and the Response results displayed by the Engine on the right side in Response to the "Engine Response" section are all default "ID: change passage "" Change passage ". The "AUTO" state shown in the drop-down menu at the bottom of the interface shown in FIG. 12B indicates that the current system process is in an automated fashion, without manual assistance. In addition, the number "80" in the input box after "Confidence" in the lower left corner of the interface shown in fig. 12B indicates that the currently set robot Confidence threshold is 80%, and other numbers may be set as the threshold. The higher the maturity threshold, the higher the accuracy requirements for the robot to automatically understand, and more human assistance is usually needed to help train the robot. The control "Send" below the right side of fig. 12B is used to Send a message entered in the input box on its left side, and the control "New Session" is used to open a New Session, clearing the data on the left side on the screen.

In the interface shown in FIG. 12C, a user expression ("Question") is assigned a corresponding FAQID/FAQ by human assistance. As shown in fig. 12C, since the robot is not trained or the understanding maturity for the FAQ/FAQID does not reach the threshold value, it is not understood that the correct FAQID is obtained, and thus the default value of the FAQID (the "FAQID" in the interface) is "-1". In this case, a FAQID (Expected Faqid in the interface) corresponding to "Question" can be filled in by manual assistance. For example, in fig. 12C, "Expected factory ID" corresponding to "Question" with "ID" 322-326 is all inputted or selected as "Flight", that is, the aforementioned ticket booking; "ExpectedFaqid" corresponding to "Question" with "ID" of 327-. Here "ID" is the identification of the entry and "QID" is the identification of "Question", in this example of the interface, the user expresses "queston" as the only entry type, so "QID" is set to "default" with "ID" as the identification; "Question" may also be identified by "QID" which is a good index when "Question" is entered by way of bulk upload. "Timestamp" is a Timestamp indicating the time at which each Question data entry is complete, with which the Question data can be retrieved to select data within a particular time window for manipulation. "CL%" is the self-confidence of understanding of the robot to Question, and since the robot has not been trained yet, the default "CL%" is "0.00". "response id" is an identifier of a system response corresponding to "Question", and may be different from "Question" or may correspond to the same "response id". The input box and "Go" in the upper left corner of the interface in fig. 12C are used to select the page of the data table below it, and the jump can be made by inputting the page number. Upper right portion of interface in fig. 12C: a radio box "included training data" (including training data) is used for selecting whether the search result includes existing training data; the radio box "Mismatch FAQID" is used to select whether the search result includes training data that is not the same as EXPECTED FAQID, so that the Mismatch data before being corrected by human can be checked; the Reset control key 'Reset' is used for resetting the retrieval condition of 'Question' once; the retrieval control key 'Search' is used for retrieving 'Question' and related data thereof according to set retrieval conditions; the training Engine control key "Train Engine" is used to start training of an Engine (which may be considered as the aforementioned robot or a part of the robot), and a corresponding FAQID is manually assigned to a user expression ("Question"), that is, after a corresponding FAQ is assigned, the robot is trained by clicking the control key "Train Engine".

The interface shown in FIG. 12D is used to generate a slot for the FAQ that corresponds to the intent. On the left side of the interface of FIG. 12D is a system menu bar, where the "FAQ" items are followed: "Tree Editor" (Tree Editor) for editing manual interactive dialogs, i.e., dialog logic that responds based on an understanding expressed by a user; "Import/Export" is used to Import or bulk upload FAQ data or Export FAQ data. The "Chat" (dialog) item is used for performing operations such as displaying, selecting, editing, etc. on a human-computer interaction dialog. The "Response" item is followed by: "Report" (Report) for generating reports on engine responses; "Import/Export" ("input/output") is used to input or output engine responses. The "Slot Filling" project is followed: "Report" (Report) for generating reports on slot filling; "Slots editor" is used to edit the slot for new creation, modification, deletion, etc. The "User" project is used to edit User data. The "Engine Config" project is used to configure the Engine.

When a slot is generated for a FAQ, first click on the "Slots editor" flag, then Select the FAQ that needs to generate (Add) a slot, e.g., "Flight", in the "FAQ" drop-down menu (labeled "Please Select"), and then click on the "Add" control on the right side of the drop-down menu. The information of the generated (added) slot is then filled in a pop-up window (as shown in fig. 12E-1 and 12E-2).

As shown in FIGS. 12E-1 and 12E-2, the selected FAQ is "Flight," and a "Slot identification" (Slot ID) is filled in the "ID" input box, e.g., "FROM" pointing TO "origin," TO "pointing TO destination, and so on. The "Sort" column is used TO input a shortcut Key (i.e., Hot Key) corresponding TO a slot, and is used for quick input when a silent seat manually assigns a slot corresponding value TO the slot, for example, "From" corresponds TO "1" and "TO" corresponds TO "2", so that the slot can be quickly specified by inputting a "Sort" value or code when a slot value (slot value) is input or queried by manual assistance later. The "Description" input box is used TO input a Description of the content filled in the slot, for example, the origin is described by "FROM" and the destination is described by "TO". The effective value of the slot "Valid Values" is a value that can effectively fill the slot. The valid value of the slot can be considered as a standard expression transformed and extracted from the user's natural expression. For example, as shown in FIGS. 12E-1 and 12E-2, "PEK", "PVG", "HKG", etc., entered in the edit box pointed to by "Valid Values" are all unique codes for airports. Because the aircraft origin and destination may vary according to the journey, but the airport code is generally not changed, the same slot valid value may be adapted to fill in different slots. And a certain bin valid value may be used in different FAQs or even in different application scenarios, and the meaning of the bin valid value in different FAQs or application scenarios is the same. For example, the slot effective value "PEK" corresponding to the beijing international airport may also be used in a "dining" or "shopping" application scenario, and may also be used in another FAQ "Pick Up" (Pick-Up station) in a "travel" application scenario. For application scenario products that employ separate databases and engines, the same slot valid value may also be used to represent different meanings. After each input box is filled, clicking an 'Add' button to Add the content of the slot corresponding to the slot identifier into a background database, thereby completing the addition of the slot.

The filled slot effective value corresponds to a standard expression, each standard expression can correspond to a plurality of X language information, and the A language information (natural expression) of the X language information obtained by conversion is various. For example, in the example shown in fig. 12E-1 and 12E-2, from each a language information, for example, SH, Shanghai Pu Dong, Shanghai, ShanghaiPudong, Pudong, Pu Dong, Shanghai purdong international airport, Shanghai purdong, may correspond to PVG, that is, when any one of these expressions appears in a user's natural expression, it may be considered to correspond to a PVG of a slot valid value and converted to PVG to fill in the corresponding slot. On the other hand, in training the accurate information extraction capability of the robot by manual assistance, the Shanghai Pudong airport, which is the place of departure, can be understood from the natural expression through the silence seat, and then PVG is input into the slot corresponding to the place of departure. The correct X language information and the matching data of the slot filling result are stored in the database (i.e. the MT training data table) for the robot to learn.

The robot performs learning by using correct X language information and matching data of the groove filling result to improve the understanding accuracy and confidence, and thus training of the robot can be accelerated by importing training data from the outside.

In addition, training can be performed by locally pairing data. As shown in fig. 12F, by clicking the control keyAs shown in FIG. 12G, the popup window, and then clicking the control button "Choose File" (select File) can upload the slot data File to the slot "FROM". The slot data file includes, for example, data such as: PVG, SH; PVG, Shanghai Pu Dong; PVG, Shanghai; PVG, Shanghai Pudong; PVG, Pudong; PVG, Pu Dong; PVG, shanghai purdong international airport; PVG, the purdong international airport; PVG, shanghai; PVG, Pudong, Shanghai; HKG, Hong Kong International Airport; HKG, Hong KongAirport; HKG, Hong Kong; HKG, HK; HKG, Hongkong; HKG, hong hao international airport in hong kong; HKG, hong kong international airport; HKG, hong kong airport; HKG, hong kong; PEK, BJ; PEK, Beijing; PEK, Beijing CapitalInternational Airport; PEK, Beijing Shou Du Ji Chang; PEK, Beijing Shou Du Guo Jiji Chang; PEK, beijing capital airport; PEK, capital airport; PEK, capital international airport; PEK, beijing capital international airport; PEK, beijing, etc. These groove data include PVG, HKG, PE, respectivelyAfter the multiple expression modes corresponding to K are trained, the robot can more accurately or confidently identify parts corresponding to PVG, HKG and PEK from X language information obtained by natural expression conversion.

For the filling of the groove, it is necessary TO consider the overall understanding of the expression, for example, even if the PVG is understood, it is necessary TO know FROM the overall expression whether it should be filled in as a departure place "FROM" or as a destination "TO", and this requires converting the natural expression or the overall natural expression covering the departure place and destination information associated with the PVG and the information of the corresponding PVG into the X language information and training the X language information and the filling result "FROM" PVG "TO constitute the paired data. Although the ability of the robot TO automatically fill the slot cannot be completely realized only by training the slot data corresponding TO the slot effective value (because information (such as 'FROM' or 'TO') corresponding TO the slot effective value is not included, and thus paired data formed by the slot effective value and the slot data corresponding TO the slot effective value can be called local paired data), the training can effectively improve the understanding accuracy and the confidence level of the robot, and thus the ability of the robot TO acquire and fill the slot is improved. And the training can be performed in advance before the artificial assistant training, so that the convergence speed of iterative operation is improved, and the workload of the artificial assistant training is reduced. Such training based on local pairing data can thus be seen as pre-training performed entirely by the robot itself. According to the self-learning principle, the actual data used in the training is still the paired data formed by the X language information and the slot effective value obtained by converting the slot data corresponding to the slot effective value.

In fig. 12G, when the above-mentioned slot data file is uploaded, the control key "Update" is clicked to Update the slot data file. The control key "Train Slot Values" can then be clicked, and as shown in fig. 12H, the robot performs training using the existing paired data of the Slot data and the Slot valid value.

The manner of training with local paired data shown in fig. 12F to 12H can also be used in the aforementioned standard expression understanding conversion as an alternative means of training.

In the previous steps, for example, FAQ "Flight" and Id corresponding thereto are added in the step shown in fig. 12A, and "Expected factory Id" is given to each input expression in the step shown in fig. 12C and engine training is performed, and the slot and the valid value of the input slot are added and trained in the steps shown in fig. 12D, 12E-1, and 12E-2, and slot data are uploaded and the slot value is trained (a paired data training engine using the slot data and the slot valid value) in the optional steps shown in fig. 12F, 12G, and 12H, and preparation may be performed for the following manual auxiliary slot filling process.

FIG. 12I illustrates the primary guide interface of the human-assisted caulking groove. A plurality of input boxes for data screening are arranged on the upper part of the interface. For example, "Update Date From … To …" uses the Update Date as the data filtering condition, "Create Date From … To …" uses the creation Date as the data filtering condition, "Confidence Min: … Max …" uses the Confidence level as the data filtering condition, "QID", "Question", "Faqid", "Expected Faqid", and "ResponseId" have the same meaning as described above, and can also be used as the data deleting condition. The key "Search" is used to perform a Search based on a set Search condition, and the key "Reset" is used to perform a one-time all-Reset of the Search condition. A radio box "included training data" (including training data) is used for selecting whether the search result includes existing training data; the radio box "Mismatch FAQID" (unmatched FAQID) is used to select whether the search result includes a unmatched FAQID.

The lower table portion of FIG. 12I is similar to FIG. 12C, but with the keys shown on the far right sideTo designate the slot. When a slot is added corresponding to the "Expexted Faqid" by the aforementioned operation, a key control is displayed behind the user expression record having such "Expexted Faqid

Click control keyEntries such as "response id", "Faq", "Question", etc. are then displayed in the pop-up window. The manual operation can be performed by selecting the filling part corresponding to the slot to be filled, for example, the blue highlighted part "hong kong" in FIG. 12J-1, by using the input tool such as mouse or keyboard, and selecting the control keyOr a shortcut key ("1" or "2") corresponding to a slot is input to select a slot corresponding to a part to be filled in, for example, slot 1 "FROM", and then "hong kong" is displayed in a text box in the middle of a row corresponding to "FROM", and the silence seat worker selects a corresponding slot effective value, for example, a slot effective value corresponding to the selected expression "hong kong" is "HKG", using a pull-down menu on the right side of the same row. Similarly, as shown in FIG. 12J-2, by key controlThe slot 2 "TO" is selected, and "shanghai" is selected in the section of the "Question" item, and the corresponding slot valid value "PVG" is selected through a pull-down menu. The goal of selecting the radio box "Template" as shown in FIG. 12K is to use the user expression "I wait to buy a ticket from there going to the pushshanghai" as the Template for training all slots. After the operations shown in FIGS. 12J-1, 12J-2, and 12K are completed, the button "Update" is clicked to upload data and the window can be closed at the same time. Thereafter, other keys shown in FIG. 12I may be providedThe expression of (2) is similarly processed.

Then, as shown in FIG. 12L, the Engine is trained by clicking the control key "Train Engine". The training at this time is no longer performed with the local paired data, but is performed with the paired data in which the complete user expresses the converted X language information and the slot filling result, and the accurate information extraction capability regarding the FAQ to which the paired data points is obtained or improved, and in the example shown in fig. 12L, the training is directed to the accurate information extraction capability regarding the FAQ "Flight" (ticket booking).

FIG. 12M illustrates an example of accessing an engine to verify training effects. As shown in the "Chat" dialog, when the input expresses "i want to buy a ticket from hong kong to beijing", the robot can correctly recognize that the FAQ is "Flight"; and as can be seen in the "Engine Response" section on the right, the robot correctly records that the FAQ ID is "Flight", and can automatically get the correct slot filling contents "From", "HKG", "To", "PEK"; when the input expression 'I wait to buy a ticketfrom Beijing to shanghai', the robot can correctly recognize that the FAQ is 'Flight'; and as can be seen in the "Engine Response" section on the right, the robot correctly records that the FAQ ID is "Flight" and can automatically get the correct slot filling contents "From", "PEK", "To", "PVG".

FIG. 12N illustrates another example of accessing an engine to verify training effects. In this example, the engine responds that the result is a missing slot value or a slot fill value error. As shown in fig. 12N, when the input expression is "i want to buy a ticket to beijing, fly from hong kong", the robot can correctly recognize that the FAQ is "Flight"; as can be seen in the "EngineResponse" section on the right, the robot correctly records that the FAQ ID is "Flight" and can automatically get the correct slot filling contents "From", "HKG", but lack the slot filling contents "To", "PEK". In this case, the interface shown in FIG. 12I can be entered by clicking "Report" under the Response item in the navigation bar on the left side of the interface shown in FIG. 12D, and the slot is manually filled in for the expression and the engine is trained as described above. The pairing of the expression and the result of the filling of the slot, which results from the manual correction after such a mistake, provides valuable training data, so that it is preferable to select "Template" to make this data the Template data for training all slots at the same time in the future. In the example, the character expression is processed, and since the information granularity of the character is coarser than the X language information and the information amount is smaller than the X language information, the character expression and the result of the groove filling may be stored as the paired data. That is, the X language information obtained by converting the character expression and the slot filling result may be used to form the paired data for training, or the character expression and the slot filling result may be used to form the paired data for storing, and the character expression may be converted into the X language information during training.

There are other situations where the correct slot fill content is not available. As shown in fig. 12O. In fig. 12O, for the expression "i want to go to the sea and go from beijing" shown on the left side, it can be seen in the "Engine Response" section on the right side that the robot correctly records that the FAQ ID is "Flight", but cannot automatically obtain the correct slot filling content. This occurs because the robot Confidence is below a set threshold, the Confidence threshold "Confidence" shown in fig. 12O is 80, while the current Confidence "CL%" of the robot shown in fig. 12P for the expression "i want to go to the sea and go from beijing" is 69.12, below 80. For this case, the robot may be retrained by clicking on the "Report" entry window under the Response entry in the navigation bar on the left side of the interface shown in FIG. 12D, as previously described, or may also turn the confidence threshold low, allowing the robot to automatically fill the corresponding slot with understood content with a confidence level below 80 and with a current confidence level of 69.12.

Fig. 13 schematically shows a process of natural expression processing in combination of robot understanding and human-aided understanding (MAU) according to an embodiment of the present invention. As shown in fig. 13, four layers of processing are included from top to bottom.

The first layer of processing is done automatically by the robot. As described above, the robot understanding maturity threshold, which may be a robot understanding accuracy threshold or a robot confidence threshold, may be set as a condition for the robot to automatically perform the process. For example, if the robot understanding maturity threshold is set to 90, then the natural expression of the robot understanding accuracy or the robot confidence level below 90 will not be automatically processed by the robot, but will be transferred to the silent seat for normalization processing, as shown in fig. 13.

The second layer processing is done by the silence agent. As described above, the silence seat is a seat that provides a standardized understanding result using the understanding ability of the customer service person to the natural expression, thereby assisting the robot in answering and forming pairing data for training the robot. Specifically, when the robot understanding maturity is below the threshold, the robot may pass the natural expression to be understood to the silence seat process. The silence seat receives the natural expression through self sense organs in a watching, answering and other modes after being prompted by the system, understands the natural expression based on self comprehension capability, outputs a comprehension result through standard expression, and then automatically responds according to the comprehension result by the robot. The comprehension ability of the silence seat is the comprehension ability of common customer service personnel, and the silence seat personnel do not need to have requirements on vocalization, accent, response proficiency and the like because the silence seat personnel do not need to directly respond to customer expressions, so that the requirement on the professional ability of the customer personnel is reduced, and the social employment is facilitated.

From the perspective of intelligent customer service, on one hand, the robot automatically receives the expression and responds, the silence seat only takes charge of understanding and does not need responding, manpower resources can be greatly saved, and the silence seat can simultaneously perform understanding operation on a plurality of conversations in the mode, so that the working efficiency is further improved; on the other hand, the silence seat is a standard expression according to an understanding result output by the natural expression, so that the natural expression and the corresponding standard expression form paired data, the paired data is added into the MT training data table, the MT training data table can be used for training the robot, the understanding capability of the robot is improved, and with the improvement of the understanding capability of the robot, under the condition that an understanding maturity threshold is not changed, a smaller and smaller proportion of customer expressions are transferred to the silence seat, so that the number of manual seats can be further reduced, the labor cost is reduced, and the closed-loop positive feedback of the system is realized.

The robot automatically responds according to the understanding result of the silence seat, and the response can be guaranteed not to be influenced by a plurality of factors such as emotion, vocal gland, accent, business proficiency and the like of customer service personnel. For a specific category (or a specific vertical application), if the amount of the standard response is not too much, the response may be made by pre-recorded voice, video, etc., which may bring a better user experience than synthesized voice or synthesized animation by TTS technology, etc.

Regarding the operation mode and the operation interface of the silence seat, reference may also be made to the aforementioned fig. 12A to 12P and the corresponding description thereof. For the normal work of the silence position with respect to human-assisted intent acquisition and slot filling, only slot filling operations like those of fig. 12J-1 and 12J-2 may be performed without performing FAQ setting, slot setting, and robot pre-training, etc.

Fig. 14 schematically shows an example of an operator interface presented by the MAU workstation to the MAU human agent 9, here the MAU human agent 9 is the silence agent. As shown in fig. 14, the operation interface of the MAU workstation 13 includes: a customer expression display area 131, a dialogue state display area 132, a navigation area 133, a category selection area 134, and a shortcut area 135.

The customer expression display area 131 displays a natural expression of the customer (i.e., user), for example, a form of text converted from characters, images, or voices, or displays the image itself as a natural expression, or may present a link or the like, and the MAU human agent 9 may select to click and listen to the voice expression.

The dialog state display area 132 displays dialog real-time state information between the client 8 and the MAU human agent 9 or the robot 14, such as: number of sessions to and from, total duration of sessions, customer information, etc. The display area may not be provided.

The navigation area 133 displays the category that the MAU human seat 9 has currently selected to reach. The left end of the field shows the text version of the current category path (as shown in the figure: bank → credit card), the right end shows the category pair code (as shown in the figure: 12 "," 1 "represents the category" bank ", and" 2 "represents the category" credit card "at the next level of the category" bank ".

The category selection area 134 allows the MAU manual agent 9 to select the next level category. As shown in the figure: MAU human agent 9 has entered the next category "credit card" of the "bank" category, whereas the "credit card" category has governed 7 sub-categories: "activate new card", "apply for new card and apply for progress inquiry", "repayment" … …. The expression of customer 8 is "my credit card can overdraw too little. "the MAU manual position 9 selects" 7 "in the current category" bank → credit card ", and the navigation area updates and displays" bank → credit card → adjust credit line … … 127 ", and enters the next category. The MAU human agent 9 may also enter "127" directly on the keyboard after receiving and understanding the expression of the client 8, and arrive at the target category "bank → credit card → adjust credit line". Therefore, the client 8 does not need to traverse a complex functional menu tree for a long time to find self-service needed by the client, and only needs to directly speak out the requirement of the client, the MAU manual seat 9 can quickly and conveniently help the client to directly start the 'credit card limit adjustment' processing, so that the user experience becomes easy and convenient, and the self-service flow utilization rate of the traditional IVR system is greatly improved.

The shortcut section 135 provides the MAU human agent 9 with common shortcuts, e.g., "-" returns to the top category, "0" relays human agent, "+" returns to the top category (in this example, the root category, "bank"). The shortcut region 135 may also provide other shortcuts to the MAU manual position 9. The shortcut region 135 may increase the processing speed of the MAU human agent 9. The shortcut section 135 is also an optional setting area.

Only one example of an operator interface for the MAU workstation 13 is given here, which may be used for the conversion process of the MAU human agent 9 for standard expressions. Manual processing of responses may also be performed through a similar operator interface.

The third layer of processing is performed by the senior agents. When the silence agent encounters a non-standard situation, i.e. when he/she is unsure whether his/her own understanding of the customer expression is correct, or finds that there is no standard expression available in the system to correspond to the expression, or finds that there is no standard answer in the system that can accurately respond to the customer expression, the silence agent can forward the processing to a higher-level agent, which communicates directly with the customer in a speech or text manner. That is, the senior agents are generally responsible for handling non-standard situations (including emerging situations). Of course, the silence seat can also feed back that the client does not hear clearly or can not understand the expression of the client, ask the client to express again or express in an alternative mode, and if the user still thinks that the user cannot process the silence seat, the silence seat is transferred to a high-level seat. The high-level seat is similar to the seat manager of the traditional customer service, and the problem is solved.

Advanced seating may also provide positive feedback to the system. Specifically, the high-level agents form Q & a (questions and answers) with the encountered customer questions (concrete expressions) and solutions (answers) to be provided to the knowledge base designers in the background. The knowledge base designer performs background construction of dialogs, such as designing a tree-like dialog scheme for a specific category or sub-categories below the specific category. As shown in FIG. 13, the knowledge base designer designs a new common problem "FAQ-12" under the sub-category "Branch-11" of the business category "Branch-1" based on the Q & A provided by the high-level agents. The FAQ may include a standard expression corresponding to the customer expression, a slot fill result, and a standard response corresponding to the standard expression, the slot fill result, and the like.

The MAU manual agents 9 may include the silence agents, the advanced agents, and the knowledge base designers.

According to the artificial intelligence-based natural expression processing method, the artificial intelligence-based natural expression processing device and the human-computer interaction system, paired data is formed by X language information (namely secondary language information) obtained through natural expression conversion and Y language information (namely standard expression) corresponding to the meaning (intention) of the natural expression, and self-learning (training) is conducted through iterative comparison of element permutation and combination. That is, the basis of machine self-learning (training) is the pairing data of a natural expression with a standard expression corresponding to the meaning of the natural expression. As described above, such pairing data can be obtained by a manner of human-aided understanding by a silence seat or the like, and can also be obtained by verification by a user inputting a natural expression. Such pairing data may also be obtained automatically by the machine.

Taking a natural expression as voice (sound wave) as an example, specifically, a text script corresponding to a standard expression may be generated first, for example, the standard expression is a meaning of "Yes", and then a plurality of text scripts may be written to correspond to the meaning, for example, "Yes" (english), "pair", "e", and the like, and the text scripts may be written manually or may be called from a database; then, the corresponding voice is obtained through the conversion of a text-to-speech (TTS) tool, and then the standard expression-voice pairing data is obtained. The standard expression can be designed in advance, and the TTS tool is accurate in conversion from text to voice, so that accurate pairing data can be obtained, the standard expression is further converted into secondary language information with information granularity smaller than characters and the pairing data of the standard expression, and data for self-learning of a machine is formed. We can also say that this approach is pre-training of natural intelligent robots.

And the voice corresponding to the standard expression can be enriched and expanded through a TTS tool, and the paired linguistic data is increased. For example, one or more parameters of the speech rate, volume, tone, and intonation of the changing speech, such as 1.1 times, 0.9 times, and fine-tuning of the speech sound waves by a random variable, the selection of which and the extent of the change can be determined based on a big data statistical model of human speech, can be adjusted by a TTS tool. TTS tools with different gender voice models, TTS tools with voice models of different languages or different dialects, and TTS tools with voice models of different speaking habits, speaking manners and the like can be adopted to generate the voice for training.

Such pre-training pairing data and pairing data generated based on such data and stored in the training database may also be copied to training databases of other vertical domains or domains as needed, or may be removed from the current training database.

In the human-computer interaction system shown in fig. 10 and 11, the answer generator 114 may be used as the TTS tool described above to generate speech corresponding to the standard expression.

The human-computer interaction system based on natural intelligence according to the embodiment of the invention can comprise one or more computers, mobile terminals or other data processing devices, wherein the data processing devices can be used for automatic conversion processing of natural expressions into standard expressions or precise information extraction based on the natural expressions. The system may also implement closed loop feedback and pre-training.

According to the natural expression processing and responding method, the natural expression processing and responding device and the natural expression processing and responding system and the multi-intention acquiring method and the multi-intention acquiring system, the response can be quickly pointed by using the standard expression (including the intention acquiring result), so that a client does not need to traverse a complicated conventional function menu for a long time to find self-service required by the client.

According to the human-computer interaction system based on natural intelligence, which is disclosed by the embodiment of the invention, through automatic learning, training and artificial auxiliary understanding of the robot, the converted natural expression information (X language information) -standard expression (including intention acquisition information) -standard response database can be established, and automatic understanding and response of the system are gradually realized. The converted natural expression information data stored in the database can also have the advantages of narrow service scope, high fidelity and the like, so that the training difficulty of the robot is reduced, and the intelligent mature period of the robot is shortened.

Unlike the conventional response mode, the silence seat mainly performs the "decision" work in the background, including determining the standard expression (Y language information) or intention, selecting the response (or response ID) or generating the response operation, etc., but does not need to directly communicate with the client in the foreground through the modes of conversation or text input, etc. Therefore, manpower resources can be greatly saved, and the working efficiency is improved. In addition, compared with the traditional free response of the traditional manual seat directly provided for the client, the system can respond to the standard response provided by the client, is not influenced by a plurality of factors such as emotion, vocal gland, accent, business proficiency and the like of the manual seat, and can better ensure the stability of the client experience.

In addition, the robot can automatically learn, train and evaluate the maturity and confidence by taking a specific application scene (business category) as a unit, so that the intellectualization of the whole system is realized point by point. In practical applications, the "robot understanding matures point by point" mechanism is more easily accepted and accepted by organizations, because the risk is relatively low, the cost of old system modification is not high, and no negative impact is caused to daily operation.

The above description is intended to be illustrative of the present invention and not to limit the scope of the invention, which is defined by the claims appended hereto.

Claims

1. A method for training a human-computer interaction system based on natural intelligence, wherein,

the method comprises the following steps:

a text script corresponding to the standard expression is generated,

obtaining the voice corresponding to the word script through a text voice conversion tool,

each piece of speech is converted into secondary language information,

wherein the magnitude of the information granularity of the secondary language information is smaller than that of the characters,

the secondary language information and the standard expression corresponding thereto are stored as pairing data in a database,

for the existing paired secondary language information and standard expression in the database, carrying out loop iteration on various permutation and combination of the elements of the secondary language information and the standard expression or various permutation and combination of the elements of the standard expression, establishing the corresponding relation between various permutation and combination of the elements of the secondary language information and various permutation and combination of the elements of the standard expression or the standard expression, obtaining more pairing data of the secondary language information and the standard expression, and storing the pairing data in the database.

2. The method for training a natural intelligence based human-computer interaction system of claim 1, wherein,

the voice is input to the voice input device,

converts the input speech into secondary language information,

comparing the secondary language information converted from the input speech with the existing secondary language information in the database, and then determining a standard expression corresponding to the secondary language information according to the comparison result, and/or calculating the probability of correctly corresponding the secondary language information to a certain standard expression,

if the machine understanding capability is not mature enough, insufficient, or uncertain, to translate the secondary linguistic information to some standard representation, then a human-assisted understanding is performed,

and manually understanding the input voice to obtain a standard expression, and corresponding secondary language information obtained from the voice to the standard expression to obtain new pairing data to be stored in the database.

3. The method for training a human-computer interaction system based on natural intelligence according to claim 2, wherein for the new pairing data of the secondary language information and the standard expression or the combination of the new natural expression and the standard expression or the combination of the standard expression, the various permutation combinations of the elements of the secondary language information or the secondary language information converted from the natural expression and the standard expression or the combination of the standard expression are iterated circularly, the corresponding relationship between the various permutation combinations of the elements of the secondary language information and the standard expression or the combination of the standard expression is established, and more pairing data of the secondary language information and the standard expression or the combination of the standard expression are obtained, and stored in the database.

4. A method of training a natural intelligence based human-computer interaction system as claimed in claim 2, wherein incorrect correspondences between secondary linguistic information in the database and standard expressions or combinations of standard expressions are corrected by human-aided understanding.

5. The method for training a natural intelligence based human-computer interaction system of claim 2, wherein,

machine comprehension capability is measured by self-confidence,

wherein the confidence level is calculated based on a correspondence of secondary language information to a standard expression.

6. A method for training a natural intelligence based human-computer interaction system as claimed in claim 5, wherein after obtaining the secondary linguistic information from the natural expression, the logarithmic probability or similar score for the single or multiple standard expressions is generated by one or more of a deep neural network, a finite state transformer, and an automatic encoder-decoder, and the self-confidence for the single or multiple standard expressions is calculated using the normalized exponential function.

7. The method for training a human-computer interaction system based on natural intelligence as claimed in claim 1, wherein the information granularity of the secondary language information is 1/10-1/1000 of the information granularity of the characters.

8. The method of claim 1, wherein the secondary language information obtained by loop iteration is used to test the conversion of the secondary language information into the standard expression by the machine, and the secondary language information that cannot be correctly converted and the standard expression that should be correctly corresponded to the secondary language information are written into the look-up table, and for the subsequently input natural expression, the secondary language information converted by the natural expression is compared with the secondary language information stored in the look-up table.

9. A method for training a human-computer interaction system based on natural intelligence as claimed in claim 1, wherein when performing loop iteration on the paired data of the secondary language information and the standard expression, performing loop optimization on the conversion model of the secondary language information to the standard expression.

10. The method of training a human-computer interaction system based on natural intelligence of claim 1, wherein one or more parameters of speed, volume, tone, and intonation of the changing speech are adjusted by a text-to-speech tool.

11. A natural intelligence based speech processing and responding device (1), comprising: a dialogue gateway (11), a central controller (12), a MAU workstation (13), a robot (14), a training database, a response database (113), a response generator (114), a text-to-speech converter, wherein,

the conversation gateway (11) receives the voice from the user (8), sends the voice to the central controller (12) for subsequent processing, and sends the response to the voice to the user (8);

the central controller (12) receives the voice from the conversation gateway (11), works together with the robot (14) and the MAU workstation (13), converts the voice into a standard expression representing the meaning of the voice, and generates a standard response corresponding to the standard expression according to the standard expression indication response generator (114);

the robot (14) converts the voice into secondary language information according to the instruction of the central controller (12), wherein the magnitude order of the information granularity of the secondary language information is smaller than that of the information granularity of the characters, and converts the secondary language information into the standard expression;

the MAU workstation (13) presents the voice to an external MAU manual position (9), the MAU manual position (9) inputs or selects a standard expression through the MAU workstation (13), and then the MAU workstation (13) sends the standard expression to the central controller (12);

the training database is used for storing the secondary language information and the pairing data of the standard expression;

a response database (113) stores response-related data, including standard response data for invocation and/or data for generating responses;

the response generator (114) receives the instruction of the central controller (12), generates the response to the voice of the user (8) by calling and/or operating the data in the response database (113),

a text-to-speech converter for generating speech corresponding to the text script based on the text script corresponding to the standard expression, the robot (14) converting the speech obtained by the text-to-speech converter into secondary language information and storing the secondary language information and the standard expression corresponding to the corresponding text as paired data in a training database,

wherein,

the device (1) further comprising a trainer (15), the trainer (15) being adapted to train the robot (14) to convert the speech into the standard representation,

the robot (14) performs loop iteration on various permutation and combination of the elements of the secondary language information and the corresponding standard expression or various permutation and combination of the elements of the standard expression, establishes the corresponding relation between various permutation and combination of the elements of the secondary language information and various permutation and combination of the elements of the standard expression or the standard expression, and stores the obtained pairing data of the secondary language information and the standard expression in a training database.

12. A man-machine interaction system based on natural intelligence, wherein,

the method comprises the following steps: a natural expression processing and responding device (1) and a calling device (2), wherein, a user (8) communicates with the natural expression processing and responding device (1) through the calling device (2), an MAU artificial seat (9) carries out artificial operation on the natural expression processing and responding device (1),

the natural expression processing and responding apparatus (1) includes: a dialogue gateway (11), a central controller (12), a MAU workstation (13), a robot (14), a training database, a response database (113), a response generator (114), a text-to-speech converter, wherein,

wherein,