CN112507104B

CN112507104B - Dialog system acquisition method, apparatus, storage medium and computer program product

Info

Publication number: CN112507104B
Application number: CN202011510559.0A
Authority: CN
Inventors: 王凡; 鲍思琪; 何煌; 吴华; 何径舟; 牛正雨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2022-07-22
Anticipated expiration: 2040-12-18
Also published as: CN112507104A

Abstract

The present disclosure discloses a dialog system acquisition method, apparatus, storage medium and computer program product, relating to the artificial intelligence field of intelligent voice, natural language processing and deep learning, wherein the method can include: modeling at least two agents using a neural network model; forming a virtual interactive system by utilizing the at least two agents; aiming at any agent, the following processes are respectively carried out: after the intelligent agent executes the interactive action to other intelligent agents, determining a reward value corresponding to the interactive action; continuously training the intelligent agent according to the determined reward value and with the goal of obtaining a higher reward value; when training is completed, the agent is used as a dialog system for conducting a human-computer dialog. By applying the scheme disclosed by the disclosure, the training effect, the performance and the like of the dialogue system can be improved.

Description

Dialog system acquisition method, apparatus, storage medium and computer program product

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for obtaining a dialog system in the fields of intelligent speech, natural language processing, and deep learning, a storage medium, and a computer program product.

Background

Currently, Artificial Intelligence (AI) based dialog systems are receiving increasing attention.

Traditional dialog systems are based primarily on human-to-human dialog corpora, using objective functions such as maximum likelihood to optimize the model. But human beings and human conversations often have a lot of background information, such as conversation occurrence scenes, backgrounds and states of two conversation parties, and the like. The absence of such information in the human-human dialogue corpus can easily cause the training of the dialogue system to be influenced by noise, thereby affecting the training effect.

Disclosure of Invention

The present disclosure provides a dialog system acquisition method, apparatus, storage medium, and computer program product.

A dialog system acquisition method, comprising:

modeling at least two agents using a neural network model;

forming a virtual interactive system by utilizing the at least two agents;

aiming at any agent, the following processes are respectively carried out:

after the intelligent agent executes interactive actions to other intelligent agents except the intelligent agent, determining a reward value corresponding to the interactive actions;

continuously training the intelligent agent according to the determined reward value and with the goal of obtaining a higher reward value;

and after the training is finished, the intelligent agent is used as a dialogue system for man-machine dialogue.

A dialog system acquisition apparatus comprising: the system comprises a first building module, a second building module and a training module;

the first building module is used for modeling at least two agents by utilizing a neural network model;

the second building module is used for forming a virtual interactive system by utilizing the at least two agents;

the training module is used for respectively carrying out the following processing aiming at any agent: after the intelligent agent executes interaction action to other intelligent agents except the intelligent agent, determining a reward value corresponding to the interaction action; continuously training the intelligent agent according to the determined reward value and with the goal of obtaining a higher reward value; and after the training is finished, the intelligent agent is used as a dialogue system for man-machine dialogue.

An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described above.

A computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

One embodiment in the above disclosure has the following advantages or benefits: the intelligent bodies can be trained in a virtual interaction system comprising at least two intelligent bodies, and the trained intelligent bodies can be used as a dialogue system for man-machine dialogue, so that man-man dialogue linguistic data is not needed, the problems in the prior art are avoided, and the training effect, the dialogue system performance and the like are improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of an embodiment of a dialog system acquisition method according to the present disclosure;

FIG. 2 is a schematic diagram of the interaction between two agents of the present disclosure;

FIG. 3 is a schematic diagram of a training process for the first agent and the second agent of FIG. 2;

fig. 4 is a schematic structural diagram of a component of the dialog system acquisition device 40 according to an embodiment of the present disclosure;

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Fig. 1 is a flowchart of an embodiment of a dialog system acquisition method according to the present disclosure. As shown in fig. 1, the following detailed implementation is included.

In step 101, at least two agents (agents) are modeled using a neural network model.

In step 102, a virtual interactive system is composed with the at least two agents.

In step 103, for any agent, the following processes are respectively performed: after the agent executes the interactive action to other agents except the agent, determining a Reward value (Reward) corresponding to the interactive action; continuously training the intelligent agent according to the determined reward value and with the goal of obtaining a higher reward value; when training is completed, the agent is used as a dialog system for conducting a human-computer dialog.

Existing dialog system solutions are based on the emulation of human dialogs, which lack a discussion of the nature of the dialog. In the scheme of the embodiment of the method, a method for training a dialogue system in a virtual multi-robot (intelligent agent) world is provided, the essence of the dialogue system is returned, namely, information is efficiently exchanged between the intelligent agents, the intelligent agents capable of effectively communicating according to the self requirements can be trained based on the requirement of information exchange, and the intelligent agents can be more efficiently and purposefully communicated with the human beings through the migration from the virtual world to the real world.

Each agent is a system capable of executing a certain action according to the interaction records of the agent and other agents, the system can be modeled by using a neural network model, and the specific modeling mode is not limited in the disclosure and can be determined according to actual needs.

The specific number of modeled agents may likewise depend on the actual need, but is at least two. With multiple agents modeled, a virtual interactive system can be composed.

Any two agents may interact with each other. For example, for any agent, the interaction may be performed to any agent other than the agent. Preferably, the interaction may comprise: dialogue exchange and recommendation.

Conversational communication may refer to sending conversational messages or the like, and recommendation may refer to recommending content resources or the like. The specific type of content resource is not limited, and may be, for example, an article, a video, etc.

In addition, a content resource pool C, C ═ C, may be maintained for each agent, respectively₁,c₂,…,c_NN denotes the number of content resources included in the content resource pool. The specific types of content resources included in each content resource pool, the specific number of different types of content resources, and the like can be determined according to actual needs. In addition, the content resources included in the content resource pools corresponding to different agents may be the same or different.

Accordingly, the content resource recommended by the agent may be a content resource in the corresponding content resource pool, such as recommending an article.

When one agent performs an interaction with another agent, a reward value corresponding to the interaction may be determined.

For example, when the executed interaction is a dialogue exchange, the rationality score, the word number penalty score, and the language model score corresponding to the sent dialogue message may be respectively obtained, and the difference between the rationality score and the word number penalty score may be calculated to obtain a first calculation result.

Namely, the following steps are included: r1 ═ r_coh-r_cha-r_fluency； (1)

Where r1 represents the reward value for conversational communication, r_cohIndicates a plausibility score, r_chaA penalty score, r, representing the number of words_fluencyRepresenting a language model score.

Preferably, the reasonableness score corresponding to the dialogue message is determined by using a reasonableness score model, and the language model score corresponding to the dialogue message is determined by using a language model. Both the rationality score model and the language model may be pre-trained.

In addition, the word number contained in the dialogue message can be counted, and the word number punishment score corresponding to the dialogue message is determined according to the counted word number. For example, the counted word number may be used as a word number penalty score, or the counted word number may be converted in a predetermined manner and used as a word number penalty score. How the conversion is performed can be determined according to actual needs, for example, normalization processing and the like can be performed.

When the executed interactive action is recommendation, the grade of the recommended content resource by the recommended agent can be obtained, the difference between the grade and the recommendation cost is calculated, the obtained difference is used as the reward value corresponding to the interactive action, and the recommendation cost is a preset value.

Namely, the following steps are included: r2 ═ s-r_rec； (2)

Where r2 represents the reward value corresponding to the recommendation, s represents the recommended agent's score for the recommended content resource, r_recThe recommended cost is usually a fixed value set in advance.

Preferably, a preset scoring mechanism can be used to determine the score of the recommended agent for the recommended content resource according to the interest distribution information of the recommended agent and the content information of the recommended content resource.

A scoring mechanism f (u) can be set_j,c_i) For determining a rating of the recommended content resource by the recommended agent. Wherein u is_jPrior art for how to obtain interest distribution information of a recommended agent, e.g. which categories (e.g. sports, entertainment, etc.) of content resources are of interest, c_iThe content information indicating the recommended content resource may include, for example, a category to which the content resource belongs. The specific implementation of the scoring mechanism in this disclosure is not limiting.

Based on the processing, the calculation mode of the reward values corresponding to different interactive actions is reasonably set, so that the reward values are efficiently and accurately evaluated.

For any agent, the next interaction action, such as conversation communication or recommendation, specific content of a conversation message or content resource to be recommended, and the like, can be determined through own policy and the like according to the interaction records and the like with other agents. In particular as prior art.

In addition, according to the determined reward value, a higher reward value can be obtained as a target, and the intelligent agent is continuously trained. For example, it is also the prior art to continuously train the agent by means of reinforcement learning to obtain a higher reward value until the training is completed.

Namely, each agent has a target to be completed, a quantitative evaluation mode, namely a reward value, and the agent can obtain higher evaluation through continuous training.

Fig. 2 is a schematic diagram of the interaction content between two agents according to the present disclosure. As shown in fig. 2, a first agent sends a dialog message "do you like music? (Do you Like music) ", the reward value corresponding to the dialog message may be determined according to the corresponding rationality score, word number penalty score and language model score, then the second agent sends the dialog message" Sports (Sports) # $ likes (Like) "to the first agent, the reward value corresponding to the dialog message may be determined according to the corresponding rationality score, word number penalty score and language model score, it can be seen that the dialog message includes some meaningless characters, which indicates that the dialog message sent by the agent may include any content if the dialog message is not trained, then the first agent recommends an article" recommend you: the Brandon English gram of the pelican team is rated as 2019? (I like it, You like sports.

Fig. 3 is a schematic diagram of a training process of the first agent and the second agent shown in fig. 2, and please refer to the related description for specific implementation, which is not repeated.

And the trained intelligent agent can be used as a dialogue system for man-machine dialogue to interact with human beings.

It can be seen from the above processing that, in the solution of the present disclosure, an agent can be trained in a virtual interactive system including at least two agents, and the trained agent can be used as a dialog system for performing man-machine dialog, so that a man-man dialog corpus is not needed, problems in the prior art are avoided, training effects, dialog system performance and the like are improved, and the obtained dialog system can communicate with human beings in a more efficient manner, learn interests of the human beings, and can reasonably recommend content resources and the like by using a content resource pool of the dialog system.

It should be noted that for simplicity of description, the aforementioned method embodiments are presented as a series of combinations of acts, but those skilled in the art will appreciate that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and/or concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the disclosure.

The above is a description of embodiments of the method, and the embodiments of the apparatus are described below to further illustrate the aspects of the disclosure.

Fig. 4 is a schematic structural diagram of a component of the dialog system acquiring apparatus 40 according to an embodiment of the present disclosure. As shown in fig. 4, includes: the method comprises the following steps: a first building block 401, a second building block 402, and a training block 403.

A first building block 401 for modeling at least two agents using a neural network model.

A second building module 402 for composing a virtual interactive system with the at least two agents.

A training module 403, configured to perform the following processing for any agent: after the intelligent agent executes the interactive action to other intelligent agents except the intelligent agent, determining a reward value corresponding to the interactive action; continuously training the intelligent agent according to the determined reward value and with the goal of obtaining a higher reward value; when training is completed, the agent is used as a dialog system for conducting a human-computer dialog.

Each agent is a system capable of executing certain actions according to the interaction records of the agent and other agents, the system can be modeled by using a neural network model, and the specific modeling mode is not limited in the disclosure and can be determined according to actual needs.

The specific number of modeled agents may also depend on the actual needs, but is at least two. The second building module 402 may compose a virtual interactive system using a plurality of agents modeled by the first building module 401.

Conversational communication may refer to sending conversational messages or the like, and recommendation may refer to recommending content resources or the like. The specific type of content asset is not limited.

Accordingly, the second building module 402 may maintain a content resource pool for each agent; the recommended content resource may be a content resource in a content resource pool corresponding to the agent, such as recommending an article.

For example, when the interactive action is a dialogue exchange, the training module 403 may respectively obtain the rationality score, the word number penalty score, and the language model score corresponding to the sent dialogue message, and may calculate a difference between the rationality score and the word number penalty score to obtain a first calculation result, further may calculate a difference between the first calculation result and the language model score to obtain a second calculation result, and use the second calculation result as the reward value corresponding to the interactive action.

Preferably, the training module 403 determines the reasonableness score corresponding to the dialogue message by using the reasonableness score model, and determines the language model score corresponding to the dialogue message by using the language model. Both the rationality score model and the language model may be pre-trained.

In addition, the training module 403 may count the number of words included in the dialog message, and determine a word number penalty score corresponding to the dialog message according to the counted number of words. For example, the counted word number may be used as a word number penalty score, or the counted word number may be converted in a predetermined manner and used as a word number penalty score. How the conversion is performed can be determined according to actual needs, for example, normalization processing and the like can be performed.

When the interaction is recommendation, the training module 403 may obtain a score of the recommended agent on the recommended content resource, calculate a difference between the score and the recommendation cost, and use the difference as a reward value corresponding to the interaction, where the recommendation cost is a predetermined value.

Preferably, the training module 403 may determine, by using a preset scoring mechanism, a score of the recommended agent on the recommended content resource according to the interest distribution information of the recommended agent and the content information of the recommended content resource.

For any agent, the next interaction action, such as conversation communication or recommendation, specific content of a conversation message or content resource to be recommended, and the like, can be determined through own policy and the like according to the interaction records and the like with other agents.

In addition, the training module 403 may continue to train the agent with the goal of obtaining a higher reward value based on the determined reward value. For example, the agent may continue to be trained in a reinforcement learning manner with the goal of obtaining a higher reward value until training is complete.

For a specific work flow of the apparatus embodiment shown in fig. 4, reference is made to the related description in the foregoing method embodiment, and details are not repeated.

In a word, by adopting the scheme of the embodiment of the device disclosed by the invention, the intelligent bodies can be trained in the virtual interactive system comprising at least two intelligent bodies, and the trained intelligent bodies can be used as a dialogue system for man-machine dialogue, so that man-man dialogue corpora are not needed, the problems in the prior art are avoided, and the training effect, the dialogue system performance and the like are further improved.

The scheme disclosed by the disclosure can be applied to the field of artificial intelligence, in particular to the fields of intelligent voice, natural language processing, deep learning and the like.

Artificial intelligence is a subject for studying a computer to simulate some thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning and the like) of a human, and has a hardware technology and a software technology, the artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge graph technology and the like.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the device 500 comprises a computing unit 501 which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 501 performs the various methods and processes described above, such as the methods described in this disclosure. For example, in some embodiments, the methods described in this disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 500 via ROM 502 and/or communications unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the methods described in the present disclosure may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured in any other suitable manner (e.g., by way of firmware) to perform the methods described in this disclosure.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A dialog system acquisition method, comprising:

modeling at least two agents using a neural network model;

forming a virtual interactive system by utilizing the at least two agents;

for each agent, the following processes are respectively carried out:

after the intelligent agent executes interaction action to other intelligent agents except the intelligent agent, determining a reward value corresponding to the interaction action according to a reward value calculation mode corresponding to the interaction action, wherein different interaction actions correspond to different reward value calculation modes; when the interaction is taken as a recommendation, the determining of the reward value corresponding to the interaction comprises: obtaining the grade of a recommended content resource by a recommended agent, calculating the difference value between the grade and the recommendation cost, and taking the difference value as the reward value corresponding to the interactive action, wherein the recommendation cost is a preset value; when the interaction is used as a dialogue interaction, the determining of the reward value corresponding to the interaction comprises: respectively obtaining a rationality score, a word number penalty score and a language model score corresponding to the sent dialogue message, calculating a difference value between the rationality score and the word number penalty score to obtain a first calculation result, calculating a difference value between the first calculation result and the language model score to obtain a second calculation result, and taking the second calculation result as an award value corresponding to the interactive action;

and after training is finished, using the intelligent agent as a dialogue system for man-machine dialogue.

2. The method of claim 1, further comprising: respectively maintaining a content resource pool for each agent; and the recommended content resource is the content resource in the content resource pool corresponding to the agent.

3. The method of claim 1, wherein the obtaining the rationality score, word number penalty score, and language model score corresponding to the transmitted dialog message comprises:

determining a reasonability score corresponding to the dialogue message by using a reasonability score model;

determining a language model score corresponding to the dialogue message by using a language model;

counting the word number contained in the dialogue message, and determining the word number penalty score corresponding to the dialogue message according to the word number.

4. The method of claim 1, wherein the obtaining a rating score of the recommended agent for the recommended content resource comprises:

and determining the grade of the recommended agent on the recommended content resource by using a preset grading mechanism according to the interest distribution information of the recommended agent and the content information of the recommended content resource.

5. A dialog system acquisition apparatus comprising: the system comprises a first building module, a second building module and a training module;

the training module is used for respectively carrying out the following processing aiming at each agent: after the intelligent agent executes interaction actions to other intelligent agents except the intelligent agent, determining the reward value corresponding to the interaction actions according to the reward value calculation mode corresponding to the interaction actions, wherein different interaction actions correspond to different reward value calculation modes; when the interaction is taken as a recommendation, the determining of the reward value corresponding to the interaction comprises: obtaining the grade of a recommended content resource by a recommended agent, calculating the difference value between the grade and the recommendation cost, and taking the difference value as the reward value corresponding to the interactive action, wherein the recommendation cost is a preset value; when the interaction is used as a dialogue interaction, the determining of the reward value corresponding to the interaction comprises: respectively obtaining rationality scores, word number penalty scores and language model scores corresponding to the sent dialogue messages, calculating the difference value of the rationality scores and the word number penalty scores to obtain a first calculation result, calculating the difference value of the first calculation result and the language model scores to obtain a second calculation result, and taking the second calculation result as a reward value corresponding to the interactive action; continuously training the intelligent agent according to the determined reward value and with the goal of obtaining a higher reward value; and after the training is finished, the intelligent agent is used as a dialogue system for man-machine dialogue.

6. The apparatus of claim 5, wherein,

the second building module is further configured to maintain a content resource pool for each agent; and the recommended content resource is the content resource in the content resource pool corresponding to the agent.

7. The apparatus of claim 5, wherein,

the training module determines the rationality score corresponding to the dialogue message by using a rationality score model, determines the language model score corresponding to the dialogue message by using a language model, counts the word number contained in the dialogue message, and determines the word number penalty score corresponding to the dialogue message according to the word number.

8. The apparatus of claim 5, wherein,

the training module determines the grade of the recommended intelligent agent on the recommended content resource according to the interest distribution information of the recommended intelligent agent and the content information of the recommended content resource by using a preset grade mechanism.

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.

10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-4.

11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-4.