CN114239687A

CN114239687A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN114239687A
Application number: CN202111389006.9A
Authority: CN
Inventors: 李旭; 黄泰然; 孙明明; 李平
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-03-25

Abstract

The disclosure provides a data processing method, a data processing device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical field of deep learning. The scheme is as follows: the method comprises the steps of obtaining state data of an environment where a target object is located, inputting the state data into a policy network of a reinforced model, sampling a plurality of actions corresponding to the state data from an action set, inputting the plurality of actions and the state data into a guide network of the reinforced model, outputting a target matching degree between each action and the state data, and determining a target action of the target object from the plurality of sampled actions according to the target matching degree of each action. The efficiency of subsequent processing is improved by sampling a plurality of actions from the action set output by the policy network, the state data and the plurality of actions are combined and input into the guide network, and the target matching degree corresponding to each action is calculated, wherein the target matching degree indicates stronger relevance between each action and the state data, so that the accuracy of target action determination is improved.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.

Background

In recent years, reinforcement learning has been applied to various fields such as games, robots, recommendation systems, and the like. However, training a reinforcement learning model is time consuming because it requires a large amount of interaction with the environment during the training process to determine the matching actions, which is costly. Meanwhile, the simple exploration strategy makes the learning speed of the model slow and even leads the model to make actions harmful to the environment. Therefore, how to improve the accuracy of action determination is an urgent technical problem to be solved.

Disclosure of Invention

The disclosure provides a data processing method, a data processing device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a data processing method including:

acquiring state data of an environment where a target object is located;

inputting the state data into a policy network of a reinforced model so as to sample from an action set to obtain a plurality of actions corresponding to the state data;

inputting the plurality of actions and the state data into a guide network of the reinforced model so as to output a target matching degree between each action and the state data;

and determining the target action of the target object from the plurality of actions obtained by sampling according to the target matching degree between each action and the state data.

According to another aspect of the present disclosure, there is provided a data processing apparatus including:

the acquisition module is used for acquiring state data of the environment where the target object is located;

the first determining module is used for inputting the state data into a policy network of a reinforced model so as to obtain a plurality of actions corresponding to the state data by sampling from an action set;

a second determination module, configured to input the plurality of actions and the state data into a guidance network of the reinforced model, so as to output a target matching degree between each of the actions and the state data;

and the third determining module is used for determining the target action of the target object from the plurality of actions obtained by sampling according to the target matching degree between each action and the state data.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of the preceding aspect.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the steps of the method of the preceding aspect.

According to another aspect of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method of the preceding aspect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of another data processing method provided in the embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a reinforcing network according to an embodiment of the disclosure;

fig. 4 is a schematic flow chart of another data processing method provided in the embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a motion profile provided by an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure;

fig. 7 is a block diagram of an example electronic device of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A data processing method, an apparatus, an electronic device, and a storage medium of the embodiments of the present disclosure are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure.

The execution main body of the data processing method in the embodiment of the present disclosure may be the data processing apparatus in the embodiment of the present disclosure, the data processing apparatus may be disposed in an electronic device, and the electronic device may be a robot, a mobile phone, or other wearable device, which is not limited in this embodiment.

As shown in fig. 1, the method comprises the following steps:

step 101, acquiring state data of the environment where the target object is located.

In the embodiment of the present disclosure, the state data may indicate a state of an environment in which the target object is located, for example, the target object is a robot, the robot is used to grab the target object, and a mechanical arm of the robot is in a lifted state; or the robot is in a target following state, or the robot is in a question receiving state in a question and answer scene, and the like. The target object may also be a vehicle in an intelligent driving scene, and the like, and the service scenes are different and the target object is different, which is not limited in this embodiment. The status data may be presented in the form of image data, text data, voice data, or video data, which is not limited in this embodiment.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

Step 102, inputting the state data into a policy network of the reinforcement model to obtain a plurality of actions corresponding to the state data by sampling from the action set.

As an implementation manner, the robust model may be a Soft Actor-Critic (SAC) model, where Actor is a policy network and Critic is a guide network.

In the embodiment of the present disclosure, the state data is input into the policy network of the reinforcement model, an action set consisting of a plurality of actions corresponding to the state data output by the policy network may be obtained, and a plurality of actions corresponding to the state data are obtained by sampling from the action set. As a second implementation, based on the setting conditions, a plurality of actions corresponding to the state data may be obtained by sampling from the action set. The setting condition may be an association relationship between the action and the state data, or a matching relationship, or an environmental state, for example, a sudden change of the environment.

Step 103, inputting a plurality of action and state data into the guide network of the reinforced model to output the target matching degree between each action and state data.

In the embodiment of the disclosure, a plurality of motion and state data are converted into corresponding motion vectors and state vectors, the motion vectors and the state vectors are fused to obtain a fusion vector, and the fusion vector is input into a guide network of a reinforced model to output a target matching degree between each motion and state data, wherein the target matching degree indicates that a target object selects a possible row of the corresponding motion based on the state data of the current environment, and the higher the value of the target matching degree, the higher the possibility that the corresponding motion is selected.

In the embodiment of the disclosure, the training of the policy network depends on the guidance network, so that the accumulated expected value corresponding to the action selected based on the guidance network is greater than or equal to the accumulated expected value corresponding to the action selected by the policy network, that is, the accuracy of the matching degree between each action and the state data output by the guidance network is higher than the matching degree between each action and the state output by the policy network, and therefore, the state data of the environment where the target object is located and the plurality of actions are combined and input into the guidance network, the target matching degree corresponding to each action is calculated, and the target action is selected based on the target matching degree corresponding to each action, so that the accuracy of determining the target action can be improved.

And 104, determining a target action of the target object from the plurality of sampled actions according to the target matching degree between each action and the state data.

In the embodiment of the present disclosure, according to the target matching degree between each action and the status data, the action whose matching degree satisfies the set threshold is regarded as the target action of the target object. For example, in a question-answer scenario, if the state data is a current question and the action is a corresponding answer, the answer with the highest target matching degree is selected as the target answer.

In the data processing method of the disclosed embodiment, state data of an environment where a target object is located is acquired, the state data is input into a policy network of a reinforced model to obtain a plurality of actions corresponding to the state data by sampling from an action set, a plurality of actions and state data are input into a guide network of the reinforced model to output a target matching degree between each action and the state data, a target action of the target object is determined from the plurality of actions obtained by sampling according to the target matching degree between each action and the state data, the efficiency of subsequent processing is improved by sampling a plurality of actions from the action set output by the policy network, the state data and the plurality of actions are combined and input into the guide network, a target matching degree corresponding to each action is calculated, the target matching degree indicates stronger relevance between each action and the state data, and the target action is selected based on the target matching degree corresponding to each action, the accuracy of target action determination is improved.

Based on the foregoing embodiments, an embodiment of the present disclosure provides another data processing method, and fig. 2 is a schematic flow chart of the another data processing method provided in the embodiment of the present disclosure, as shown in fig. 2, the method includes the following steps:

step 201, obtaining the state data of the environment where the target object is located.

Step 202, inputting the state data into a policy network of the reinforcement model to obtain a plurality of actions corresponding to the state data by sampling from the action set.

Step 201 and step 202 may refer to the explanations in the foregoing embodiments, and the principle is the same, which is not described again in this embodiment.

Step 203, inputting the plurality of actions and status data into at least one sub-network in the guiding network respectively to obtain the candidate matching degree corresponding to each action output by each sub-network.

The bootstrap network includes at least one sub-network, for example, 2 sub-networks, or 3 sub-networks, which is not limited in this embodiment.

In the embodiment of the present disclosure, a plurality of pieces of action and state data are respectively input to at least one sub-network included in the guidance network, so as to obtain the candidate matching degrees corresponding to each action output by each sub-network. As an embodiment, fig. 3 is a schematic structural diagram of a reinforced network according to an embodiment of the present disclosure, and as shown in fig. 3, the guide network includes 2 sub-networks for explanation, which are referred to as a first sub-network and a second sub-network for convenience of distinction, and a plurality of motion data and state data are input into the first sub-network and the second sub-network, respectively, so as to obtain a candidate matching degree of each motion output by the first sub-network and a candidate matching degree of each motion output by the second sub-network.

And step 204, determining the target matching degree corresponding to each action according to the candidate matching degree corresponding to each action output by each sub-network.

For example, two sub-networks are still used as an example for illustration.

As one implementation mode, aiming at each action, the candidate matching degree of the action output by the first sub-network and the candidate matching degree of the action output by the second sub-network are averaged, and the average value is used as the target matching degree corresponding to the action, so that the balance processing of the target matching degree output finally is realized, and the accuracy of the target matching degree is improved.

As another implementation manner, for each action, the smaller value of the candidate matching degree of the action output by the first sub-network and the candidate matching degree of the action output by the second sub-network is determined, and the candidate matching degree corresponding to the smaller value is taken as the candidate matching degree corresponding to the action.

In the embodiment of the disclosure, the guidance network is provided with a plurality of sub-networks, and the target matching degree corresponding to each action is determined according to the candidate matching degree corresponding to each action output by each sub-network, for example, a smaller candidate matching degree is selected as the candidate matching degree of each action, so that the over-estimation of the matching degree output by a single guidance network is prevented, and the reliability of determining the target matching degree is improved by providing a plurality of sub-networks.

Step 205, according to the target matching degree between each action and the state data, the target action of the target object is determined from the plurality of sampled actions.

Specifically, the explanation in the signing embodiment can be referred to, the principle is the same, and the description in this embodiment is omitted.

In the data determination method according to the embodiment of the present disclosure, the guidance network is provided to include a plurality of sub-networks, and the target matching degree corresponding to each action is determined according to the candidate matching degree corresponding to each action output by each sub-network, for example, a smaller candidate matching degree is selected as the candidate matching degree for each action, so that the over-estimation of the value of the matching degree output by a single guidance network is prevented, and the reliability of determining the target matching degree is improved by providing a plurality of sub-networks.

Based on the foregoing embodiments, an embodiment of the present disclosure provides another data processing method, and fig. 4 is a schematic flow chart of the another data processing method provided in the embodiment of the present disclosure, as shown in fig. 4, the method includes the following steps:

step 401, obtaining state data of an environment where the target object is located.

In step 401, reference may be made to the explanations in the foregoing embodiments, and the principle is the same, which is not described again in this embodiment.

Step 402, inputting the state data into the policy network, and outputting the initial action distribution corresponding to the state data.

The initial action distribution is used for indicating the initial matching degree between each action in the action set and the state data. The initial matching degree corresponding to each action indicates the probability of selecting the corresponding action next step from the state data of the environment where the target object is located, or the expectation value of selecting the action.

And step 403, sampling from the action set to obtain a plurality of actions corresponding to the state data according to the initial matching degree between each action and the state data in the initial action distribution.

In the embodiment of the present disclosure, according to the initial matching degree, a plurality of actions corresponding to state data are obtained by sampling from the action set, so that sampling based on probability, that is, the initial matching degree is achieved, and randomness of sampling is achieved, and actions with higher initial matching degrees are more likely to be sampled, and because the actions with a set number are sampled, a value of the set number is adjusted according to a scene requirement, for example, the set number of the sampling is usually 200. Sampling is carried out based on the initial matching degree, the sampling quantity is too small, so that actions with higher matching degree can not be sampled, the sampling quantity is too large, the difficulty of data processing can be increased, and the efficiency is lower.

As an example of the initial state distribution, fig. 5 is a schematic diagram of an action distribution provided by the embodiment of the present disclosure, and as shown in fig. 5, P is an initial action distribution, which is a gaussian distribution, and the initial matching degree between each action and state data in the action set is indicated in the distribution. Where each white circular point a on the horizontal axis indicates each sample point sampled from the action set, one action for each sample point.

Step 404, inputting the plurality of motion and state data into a guide network of the reinforced model to output a target matching degree between each motion and state data.

Step 404 may refer to the explanations in the foregoing embodiments, and the principle is the same, which is not described again in this embodiment.

Step 405, generating a target action distribution corresponding to the state data according to the target matching degree corresponding to each action.

The target action distribution is used for indicating the matching degree between each action in the sampling actions and the state data.

In an implementation manner of the embodiment of the present disclosure, the target matching degree corresponding to each action is normalized to obtain the matching degree corresponding to each action after normalization, and a target action distribution corresponding to the state data is generated according to the matching degree corresponding to each action after normalization, that is, each action sampled in the target action distribution and the normalized matching degree between each action and the state data are indicated in the target action distribution. The target matching degree corresponding to each action is subjected to normalization processing, so that the target matching degree corresponding to each action is mapped to the same data interval, for example, a value between 0 and 1, data processing or data comparison is conveniently performed subsequently, and the data processing efficiency is improved.

The target matching degree corresponding to each action obtained by sampling is normalized by adopting a normalization function as an implementation mode, so as to obtain the matching degree corresponding to each action after normalization, wherein the normalization function is, for example, a softmax function, and the generated target action distribution is smoother through the matching degree corresponding to each action obtained by normalization processing through the normalization function, that is, the matching degree corresponding to each action is more uniformly distributed, so that the stability of the target action corresponding to a subsequently selected target object can be improved, and the accuracy of target action determination is improved.

As another implementation manner, normalization processing may be performed according to the maximum value and the minimum value of the target matching degree corresponding to each action to obtain the matching degree corresponding to each action after normalization processing, a target action distribution is generated based on the matching degree obtained through normalization, and a target action interacting with the environment state where the target object is located is selected from the target action distribution.

As shown in fig. 5, a target action distribution, i.e., P ', corresponding to the state data is generated according to a target matching degree corresponding to each action, where the selection probability, i.e., matching degree, corresponding to each action sampled in the initial action distribution P and the target action distribution P' is different, i.e., the initial matching degree and the matching degree obtained by normalization between each action and the state data sampled are different, where the guidance network outputs the target matching degree between each action and the state data, which is higher than the initial matching degree between each action and the state output by the policy network, so as to select the target action based on the matching degree corresponding to each action, and the accuracy of determining the target action can be improved.

And step 406, determining a target action of the target object from the plurality of sampled actions according to the matching degree between each action and the state data in the target action distribution.

The step 406 may refer to the explanations in the foregoing embodiments, and the principle is the same, which is not described herein again.

In order to implement the above embodiments, the present embodiment provides a data processing apparatus.

Fig. 6 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure, as shown in fig. 6, the apparatus includes:

the obtaining module 61 is configured to obtain state data of an environment where the target object is located.

And a first determining module 62, configured to input the state data into a policy network of a reinforcement model, so as to obtain a plurality of actions corresponding to the state data from the action set by sampling.

A second determining module 63, configured to input the plurality of actions and the state data into a guiding network of the reinforced model, so as to output a target matching degree between each of the actions and the state data.

And a third determining module 64, configured to determine a target action of the target object from the plurality of sampled actions according to a target matching degree between each action and the state data.

Further, in an implementation manner of the embodiment of the present disclosure, the second determining module 63 is specifically configured to:

inputting the plurality of actions and the state data into at least one sub-network in the guide network to obtain candidate matching degrees corresponding to the actions output by each sub-network;

and determining the target matching degree corresponding to each action according to the candidate matching degree corresponding to each action and output by each sub-network.

In an implementation manner of the embodiment of the present disclosure, the first determining module 62 is specifically configured to:

inputting the state data into the policy network, and outputting initial action distribution corresponding to the state data; wherein the initial action distribution is used for indicating an initial matching degree between each action in the action set and the state data;

and sampling from an action set to obtain a plurality of actions corresponding to the state data according to the initial matching degree between each action in the initial action distribution and the state data.

In an implementation manner of the embodiment of the present disclosure, the third determining module 64 is specifically configured to:

generating a target action distribution corresponding to the state data according to a target matching degree corresponding to each action, wherein the target action distribution is used for indicating the matching degree between each action obtained by sampling and the state data;

and determining the target action of the target object from the plurality of actions obtained by sampling according to the matching degree between each action in the target action distribution and the state data.

In an implementation manner of the embodiment of the present disclosure, the third determining module 64 is further specifically configured to:

normalizing the target matching degree corresponding to each action to obtain the normalized matching degree corresponding to each action;

and generating target action distribution corresponding to the state data according to the matching degree corresponding to each action after normalization processing.

and carrying out normalization processing on the matching degree corresponding to each action by adopting a normalization function to obtain the matching degree corresponding to each action after the normalization processing.

It should be understood that the explanations in the foregoing method embodiments also apply to the apparatus in this embodiment, and the principle is the same, and the descriptions in this embodiment are omitted.

In the data processing apparatus of the disclosed embodiment, state data of an environment in which a target object is located is acquired, the state data is input to a policy network of a hardened model to obtain a plurality of actions corresponding to the state data by sampling from an action set, a plurality of actions and the state data are input to a guidance network of the hardened model to output a target matching degree between each action and the state data, a target action of the target object is determined from the plurality of actions obtained by sampling according to the target matching degree between each action and the state data, efficiency of subsequent processing is improved by sampling a plurality of actions from the action set output by the policy network, the state data and the plurality of actions are combined and input to the guidance network, a target matching degree corresponding to each action is calculated, the target matching degree indicates stronger relevance between each action and the state data, and a target action is selected based on the target matching degree corresponding to each action, the accuracy of target action determination is improved.

In order to implement the above embodiments, an embodiment of the present disclosure further provides an electronic device, including:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the method of the preceding method embodiment.

To achieve the above embodiments, the embodiments of the present disclosure further provide a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the steps of the method of the foregoing method embodiments.

To implement the above embodiments, the present disclosure also provides a computer program product including computer instructions, which when executed by a processor implement the steps of the method of the foregoing method embodiments.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 7 is a block diagram of an example electronic device of an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic apparatus 700 includes a computing unit 701, which can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 702 or a computer program loaded from a storage unit 708 into a RAM (Random Access Memory) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An I/O (Input/Output) interface 705 is also connected to the bus 704.

A number of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing Unit 701 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 701 executes the respective methods and processes described above, such as the data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the data processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, Integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, System On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: other kinds of devices may also be used to provide interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that artificial intelligence is a subject for studying a computer to simulate some human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and includes both hardware and software technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of data processing, comprising:

acquiring state data of an environment where a target object is located;

2. The method of claim 1, wherein said inputting the plurality of actions and the state data into a guiding network of the augmented model to output a target degree of match between each of the actions and the state data comprises:

3. The method of claim 1, wherein the inputting the state data into a policy network of a hardened model to sample a plurality of actions corresponding to the state data from a set of actions comprises:

4. The method of any one of claims 1-3, wherein said determining a target action for the target object from the sampled plurality of actions based on a target degree of match between the actions and the state data comprises:

5. The method of claim 4, wherein the generating a target action distribution corresponding to the state data according to the target matching degree corresponding to each action comprises:

6. The method according to claim 5, wherein the normalizing the target matching degree corresponding to each of the actions to obtain the normalized matching degree corresponding to each of the actions includes:

7. A data processing apparatus comprising:

8. The apparatus of claim 7, wherein the second determining module is specifically configured to:

9. The apparatus of claim 7, wherein the first determining module is specifically configured to:

10. The apparatus according to any one of claims 7 to 9, wherein the third determining module is specifically configured to:

11. The apparatus of claim 10, wherein the third determining module is further specifically configured to:

12. The apparatus of claim 11, wherein the third determining module is further specifically configured to:

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the steps of the method according to any one of claims 1-6.

15. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method according to any one of claims 1-6.