CN112001570B

CN112001570B - Data processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN112001570B
Application number: CN202011052216.4A
Authority: CN
Inventors: 罗世楷; 宋歌; 朱宏图
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-07-09
Anticipated expiration: 2040-09-29
Also published as: CN112001570A

Abstract

The embodiment of the invention provides a data processing method, a data processing device, electronic equipment and a readable storage medium, and relates to the technical field of computers.

Description

Data processing method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a readable storage medium.

Background

There are some types of services associated with cities, which are affected by the supply and demand relationship of the area, such as network car-booking type services, and when the supply and demand relationship of the area a is greater than demand, each network car-booking tends to leave the area a, whereas when the supply and demand relationship of the area a is less than demand, each network car-booking tends to approach the area a.

In the prior art, a worker of a service platform can formulate a service strategy based on the attributes of a city, so that supply and demand parties of the city are reasonably distributed.

However, in a city, there may be an area where supply is greater than demand, an area where supply is less than demand, and an area where supply is equal to demand, so that a worker needs to continuously adjust a policy for different areas, so that the working efficiency is low, and if a unified policy is executed in each area of the city, the supply and demand relationship in a certain area of the city may be more unbalanced, resulting in a problem of supply and demand mismatch.

Disclosure of Invention

In view of this, embodiments of the present invention provide a data processing method, an apparatus, an electronic device, and a readable storage medium, so as to improve the efficiency of policy adjustment, save labor, and solve the problem of supply and demand mismatch.

In a first aspect, a data processing method is provided, where the method is applied to a server, and the method includes:

determining a target strategy, wherein the target strategy is used for representing interaction rules between users in a target city;

based on the target strategy, determining observation data, wherein the observation data is at least used for representing strategy actions of each region in the target city, the state of each region in the target city and the regional strategy score of each region in the target city;

determining a target policy score for the target policy based on the observation data;

determining a score difference between the target policy score and a preset policy score, wherein the preset policy score is a policy score determined based on a preset policy; and

in response to the score difference being a positive value, determining the target policy to be a beneficial policy.

Optionally, the determining observation data based on the target policy includes:

determining the observation data at least based on a policy action of a first region at a first moment, a state of the first region at the first moment, a policy score of the policy action of the first region at the first moment and states of the first region at each moment in a preset time period, wherein the first region is used for representing the region in the target city, and the first moment is the moment in the preset time period.

Optionally, the target city satisfies a consistency hypothesis, a sequence randomization hypothesis, a markov hypothesis, and a conditional mean-independent hypothesis;

the consistency hypothesis is used for representing the state of the first area at the first time, and is related to the strategy action of the target city from a starting time to a second time, wherein the starting time is a preset time, and the second time is a time before the first time;

the sequence randomization hypothesis is used for representing that the strategy action of the target city at the time t is related to the strategy action history of the target city and the current state of the target city;

the Markov assumption is used for representing the state of the target city at the first time, and depends on the state of the target city at the second time and a strategy action;

the conditional mean-independent hypothesis is used for representing that the strategy score expectation corresponding to the first region is determined based on a preset expectation algorithm, and the preset expectation algorithm comprises strategy actions and states corresponding to the first region.

Optionally, the determining a target policy score of the target policy based on the observation data includes:

and determining a target strategy score of the target strategy based on a preset first strategy score algorithm, wherein the first strategy score algorithm is constructed based on an importance sampling model.

and determining a target strategy score of the target strategy based on a preset second strategy score algorithm, wherein the second strategy score algorithm is constructed based on a model with robustness.

Optionally, the determining a score difference between the target policy score and a preset policy score includes:

and determining the score difference between the target strategy score and the preset strategy score based on the strategy score expectation corresponding to the target strategy, the strategy score expectation corresponding to the preset strategy and a preset score difference algorithm.

Optionally, the method further includes:

and determining the average state of each region in the target city based on the average function of the state of each region in the target city, and determining the average strategy action of each region in the target city based on the average function of the strategy action of each region in the target city.

In a second aspect, a data processing apparatus is provided, the apparatus being applied to a server, the apparatus comprising:

the target strategy module is used for determining a target strategy, and the target strategy is used for representing interaction rules between users in a target city;

the observation data module is used for determining observation data based on the target strategy, and the observation data is at least used for representing the strategy action of each region in the target city, the state of each region in the target city and the regional strategy score of each region in the target city;

a target policy score module for determining a target policy score for the target policy based on the observation data;

the score difference module is used for determining the score difference between the target strategy score and a preset strategy score, wherein the preset strategy score is a strategy score determined based on a preset strategy; and

a determination module, configured to determine that the target policy is a beneficial policy in response to the score difference being a positive value.

Optionally, the observation data module is specifically configured to:

Optionally, the target policy score module is specifically configured to:

Optionally, the score difference module is specifically configured to:

Optionally, the apparatus further comprises:

and the average state module is used for determining the average state of each area in the target city based on the average function of the state of each area in the target city, and determining the average strategy action of each area in the target city based on the average function of the strategy action of each area in the target city.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory is used to store one or more computer program instructions, where the one or more computer program instructions are executed by the processor to implement the method according to the first aspect.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium on which computer program instructions are stored, which when executed by a processor implement the method according to the first aspect.

According to the embodiment of the invention, the server can determine the target strategy score of the target strategy based on the observation data of the target strategy, and the score can be used for evaluating the target strategy, so that the server can judge the feasibility of the target strategy through the target strategy score, and if the target strategy is a beneficial strategy, the server can execute the target strategy aiming at the target city, so that the strategy adjusting efficiency is improved, the manpower is saved, and the problem of supply and demand mismatch is solved.

Drawings

The above and other objects, features and advantages of the embodiments of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a data processing system according to an embodiment of the present invention;

fig. 2 is a schematic diagram of areas of a target city according to an embodiment of the present invention;

fig. 3 is a flowchart of a data processing method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.

Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

As shown in fig. 1, fig. 1 is a data processing system according to an embodiment of the present invention, where the system includes: a terminal device 11 and a server 12.

The terminal device 11 may be a mobile terminal (e.g., a smart phone), or may be a vehicle-mounted terminal installed in a vehicle, and the server 12 may be a single server, or may be a server cluster configured in a distributed manner.

In one implementation, the data processing system shown in fig. 1 may be used for policy selection of a target city, the terminal device 11 may be a smart phone used by each car booking driver in the target city, and the server 12 may be a server of a car booking platform.

It should be noted that the terminal device 11 shown in fig. 1 is used to represent a plurality of terminals, the number of which is not limited to 3, and the number of the terminal devices 11 is not limited in the embodiment of the present invention.

The target city includes: region 1, region 2, region 3, region 4, and region 5, where each region of the target city corresponds to the policy action for that region, the status of that region, and the region policy score, respectively.

The server 12 may configure a target policy for the target city, where the target policy is used to characterize an interaction rule between the user and the target city, and in the embodiment of the present invention, the target policy is a set of policy actions in the region 1 to the region 5.

The server 12 may collect observation data of the target city, where the observation data may include the above policy action, status, and policy score, then the server 12 may determine a target policy score of the target policy according to the observation data, then the server 12 may determine a score difference between the target policy score and a preset policy score, and then determine whether the target policy is a beneficial policy according to the score difference.

In the embodiment of the present invention, the preset policy score is a score corresponding to a preset policy, and the preset policy may be a no-policy or a basic policy (the basic policy is a common policy).

It should be noted that the target city and the areas 1 to 5 only provide an example for the embodiment of the present invention, and the embodiment of the present invention does not limit the target city and the partition thereof.

As shown in fig. 2, fig. 2 is a schematic view of each area of a target city according to an embodiment of the present invention, where the target city includes: region 1, region 2, region 3, region 4, and region 5, each region of the target city corresponding to the policy action for that region, the status of that region, and the region policy score, respectively.

In conjunction with the content shown in fig. 1, the server 12 may configure a target policy for the target city, where the target policy is used to characterize interaction rules between users in the target city, and in this embodiment of the present invention, the target policy is a set of policy actions in the area 1 to the area 5.

The server 12 may further collect observation data of the target city, where the observation data may include the policy action, the state, and the policy score, and then the server 12 may determine a target policy score of the target policy according to the observation data, and then the server 12 may determine a score difference between the target policy score and a preset policy score, and further determine whether the target policy is a beneficial policy according to the score difference.

With reference to the contents shown in fig. 1 and fig. 2, an embodiment of the present invention provides an application scenario of a target policy, where a target city includes 5 areas (area 1-area 5), in the 5 areas, a plurality of network appointments are working, a terminal device 11 is a driver-side device used by a driver of each network appointment, and a server 12 is a network appointment policy platform.

Specifically, the area 3 is a city center, the supply and demand relationship of the network car booking service is greater than demand, the area 5 is a suburb, and the supply and demand relationship of the network car booking service is less than demand, at this time, the network car booking strategy platform may execute a target strategy for the target city 11, for example, execute an empty strategy (i.e., not execute any strategy action) in the area 3, and execute an incentive strategy in the area 5 (e.g., when a network car booking driver gets orders in the area 5, the network car booking driver may obtain an additional reward).

In this way, a part of the taxi appointment drivers can actively go to the area 5 to take orders in order to obtain additional rewards, and the supply and demand relations of the area 3 and the area 5 are close to the equilibrium state.

In order to select the most preferable policy, a data processing method provided by the embodiment of the present invention will be described in detail below with reference to a specific implementation, where the method is applied to a server, and as shown in fig. 3, the specific steps are as follows:

in step 100, a target policy is determined.

The target strategy is used for representing interaction rules between users in the target city.

At step 200, observation data is determined based on the target policy.

The observation data is at least used for representing the strategy action of each region in the target city, the state of each region in the target city and the regional strategy score of each region in the target city.

At step 300, a target policy score for the target policy is determined based on the observed data.

At step 400, a score difference between the target policy score and a preset policy score is determined.

The preset strategy score is a strategy score determined based on a preset strategy.

In step 500, in response to the score difference being positive, the target policy is determined to be a beneficial policy.

It should be further noted that the policy action of each region is used to characterize whether to perform the policy action in the region, the status of each region is at least used to characterize the quantity of supply and demand, the equilibrium state of supply and demand, and the weather condition of the region, and the regional policy score of each region is used to evaluate the feasibility of the policy performed by the region.

In the embodiment of the present invention, the server may use each region in the target city as an agent in the mar based on a Multi-agent Reinforcement Learning (mar) framework, thereby determining the target policy score.

The MARL is a distributed computing technology, and can perform operations on strategies of all agents in the MARL to determine an overall optimal solution.

Specifically, in the process of determining the policy score (target policy score or preset policy score), the policy action, the state, and the regional policy score may be defined.

From a spatial perspective, a policy action may be defined as A_iWhere i is used to characterize a region in the target city, 1 is used to characterize the policy action for performing the region in the i region, and 0 is used to characterize the policy action for not performing the region in the i region。

Meanwhile, the state can be defined as S from the perspective of space_iAnd is used for characterizing the state of the i area.

Further, the policy action for the target city may be expressed as a ═ a₁×A₂×…×A_N＝{0,1}^NThe state of the target city may be expressed as S ═ S₁×S₂×…×S_N。

From the perspective of time, the policy action history of all areas in the target city from time 0 to time t can be defined as

Wherein, a₀,a₁,…,a_t∈{0,1}^NIs a sequence of N-dimensional vectors.

Further, from a time perspective, for each region i ∈ {0,1, …, N } in the target city, a definition is made

Following policy at target city for i-zone

The state at time t +1, define

Following policy at target city for i-zone

Time t, the regional strategy score.

Further, the target policy of the target city may be defined as pi ═ pi (pi ═ pi)₁,π₂,…,π_N)^TWherein each pi_iIs a binary function pi with respect to the current state_i(S_t) E {0,1}, under the policy pi, the region i can execute the policy action pi at the time t_i(S_t)。

For the target strategy pi,

is the initial policy action of the target policy pi,

is the historical policy action from the initial time to time t.

In the embodiment of the present invention, an expression of the target policy score may be determined based on the above definition, which is specifically as follows:

wherein the content of the first and second substances,

for the purpose of characterizing the target strategy,

for characterizing target city executions

Strategic score, V, of the time i zone at time j_i(π_l) For characterizing the target policy score.

In one possible implementation, the server may determine the observation data based on at least a policy action of the first region at the first time, a status of the first region at the first time, a policy score of the policy action of the first region at the first time, and a status of the first region at each time within a preset time period.

The first area is used for representing an area in a target city, and the first time is the time within a preset time period.

Based on the above definitions, an optional implementation manner is provided in the embodiments of the present invention, and specifically, the process of determining the observation data may be to determine the observation data based on the following formula:

wherein the content of the first and second substances,

for characterizing a_tSatisfy { A_i,j,S_i,j,R_i,j}_{1≤i≤N,0≤j<t}∪{S_i,t}_1≤i≤NProbability of time, A_tA policy action for characterizing the target city to execute at time t, i for characterizing the area in the target city, j for characterizing the time, A_i,jFor characterizing the policy action of the i-region at time j, S_i,jFor characterizing the state of the i-region at time j, R_i,jThe strategy score of the i area at the time j is represented, and the observation data of the target city are represented by b.

That is, the i region may be used to represent the first region, the j time may be used to represent the first time, and the observation data may be expressed as { A }_i,t,S_i,t,R_i,t}_{1≤i≤N,0≤t≤T}I.e., the observation data includes policy actions, states, and policy scores.

According to the embodiment of the invention, the server can determine the target strategy score of the target strategy by combining the time dimension and the space dimension by defining the strategy action, the state and the regional strategy score from the aspects of time and space, so that the evaluation result of the target strategy by the server can better accord with the actual situation.

In one possible implementation, to make the target policy score of the target policy more accurate, the target city policy action, status, and regional policy score may be conditionally defined.

Specifically, the target city satisfies a Consistency Assumption (CA), a Sequence Randomization Assumption (SRA), a Markov Assumption (MA), and a Conditional Mean Independence Assumption (CMIA).

The consistency hypothesis is used for representing the state of the first area at a first time and is related to the strategy action of the target city from a starting time to a second time, wherein the starting time is a preset time, and the second time is a time before the first time.

In conjunction with the above definitions, the consistency assumptions in the embodiments of the present invention may be used to characterize

It is true that, among other things,

the method is used for representing the strategy action history of the target city from 0 moment to t moment in observation data.

Wherein the time t-1 can be used to characterize the second time.

The sequence randomization assumption used to characterize the policy action of the target city at time t is only related to the policy action history of the target city, as well as the current state of the target city.

The markov assumption is used to characterize the state of the target city at a first time instant, depending on the state of the target city at a second time instant and the policy action, that is, the state of the target city at time instant t, depending on the state of the target city at time instant t-1 and the policy action.

The conditional mean-independent hypothesis is used to characterize the policy score expectation corresponding to the first region based on a predetermined expectation algorithm.

The preset expectation algorithm includes a policy action and a state corresponding to the first region, and in combination with the above definitions, the preset expectation algorithm may be expressed as:

wherein the content of the first and second substances,

for characterizing the policy score expectation,

for characterizing a target cityPolicy action history from time 0 to time t in city, r_iFor characterizing target city to execute A at time t_tThe latter desired policy score.

According to the embodiment of the invention, the strategy action, the state and the regional strategy score of the target city meet the assumptions, so that the target strategy score determined by the server can be more targeted, and the target strategy score is more accurate.

After the server determines the observation data, the target policy score may be determined continuously, but in practical applications, since too many regions may be included in the target city, the server may encounter a dimensional disaster (customer of dimension) in the process of determining the policy score, thereby causing a drastic increase in computational stress.

In the embodiment of the present invention, the observation data of each region may be used as a vector of a dimension, and in the process of calculating the target policy score by the server, as the number of regions increases (i.e., the dimension increases), the calculation amount of the server increases exponentially, thereby causing a dimension disaster.

Therefore, in one possible implementation, to solve the problem of the dimensional disaster, the server may define, for each area, the area to be affected only by its neighboring areas in the calculation process, so as to solve the dimensional disaster.

In an implementation manner, the average state of each region in the target city may be determined based on an average function of the state of each region in the target city, and specifically, the average function of the state of each region in the target city may be represented as:

wherein the content of the first and second substances,

and the average function is used for representing the states of all the areas in the target city.

In another possible implementation manner, the average policy action of each area in the target city may be determined based on an average function of the policy actions of each area in the target city, and specifically, the average function of the policy actions of each area in the target city may be represented as:

wherein the content of the first and second substances,

and the average function is used for representing the strategy action of each area in the target city.

By the embodiment of the invention, the problem of dimension disaster can be solved through the average function, and after the problem of dimension disaster is solved, the server can more effectively determine the target strategy score of the target strategy.

Wherein for region i e {1, …, N } in the target city, there is a region policy score

For any one

a∈{0,1}^NAll are provided with

Let p be_b(. and p)_π(. as S under policy b and policy π, respectively_tWherein the strategy b and the strategy pi can be any strategy.

Then let p_i,π(. as)

Corresponding marginal distribution, let p_i,b(. as)

Corresponding Marginal Distribution, wherein Marginal Distribution refers to probability theory and statisticsThe multidimensional random variables of science only comprise the probability distribution of part of the variables.

Further, it can be determined

Wherein the content of the first and second substances,

can be used to characterize the weight that area i affects the status of the target city.

Furthermore, the embodiment of the present invention provides two ways of determining the score of the target policy, that is, the present invention provides two ways of performing simulation on the target city, which are specifically as follows:

in one possible implementation, when the server determines the observation data, the target policy score may be determined based on the importance sampling model.

Specifically, the server may determine a target policy score of the target policy based on a preset first policy score algorithm, where the first policy score algorithm is constructed based on the importance sampling model.

Further, the process of determining, by the server, the target policy score of the target policy based on the first policy score algorithm may specifically be:

calculating a target strategy score of the target strategy based on the observation data and a first strategy score algorithm;

wherein the content of the first and second substances,

for characterizing target policy scores, w for characterizing weights,

is w_iIs determined by the estimated value of (c),

determined by a preset deep learning algorithm,

to indicate the function, pi is used to characterize the target strategy,

and the strategy action average function is used for characterizing the i area in the target city.

Importance sampling is a method used in statistics to estimate a property of a distribution, which is sampled from another distribution different from the original distribution to estimate the property of the original distribution.

In the embodiment of the invention, the environment simulation can be carried out on the target city through the importance sampling model, and meanwhile, the environment simulation is the simulation of two dimensions of time and space, so that the target strategy score can be more real and accurate.

In another possible implementation, after the server determines the observation data, the target policy score may also be determined based on the dual robust model.

Specifically, the server may determine a target policy score of the target policy based on a preset second policy score algorithm, where the second policy score algorithm is constructed based on a model with robustness.

Further, the process of determining, by the server, the target policy score of the target policy based on the second policy score algorithm may specifically be:

calculating a target strategy score of the target strategy based on the observation data and a second strategy score algorithm;

whereinV is used to characterize the target strategy score, Q is used to characterize the complementary cumulative distribution function (Q-function), w is used to characterize the weights,

to indicate a function, π is used to characterize the target strategy.

In the embodiment of the invention, robustness (Robust) can be used for representing the tolerance of data change, and then the server can perform environment simulation on the target city through a double-Robust model, and simultaneously, because the environment simulation is the simulation of two dimensions of time and space, the target strategy score can be more real and accurate.

When the server determines the target policy score, a score difference between the target policy score and a preset policy score may be determined, where the preset policy may be no policy or a policy being used by the target city.

Specifically, the server may determine a score difference between the target policy score and the preset policy score based on the policy score expectation corresponding to the target policy, the policy score expectation corresponding to the preset policy, and a preset score difference algorithm.

Wherein the score difference may be determined based on the following formula:

wherein ATE is used to characterize score differences, π₀For characterizing a predetermined strategy, pi_lFor the purpose of characterizing the target strategy,

for characterizing target city executions

The policy score for the time i field at time j.

Further, in practical applications, when the server determines the score difference, it may be determined whether the target policy is a beneficial policy based on the score difference.

If the target strategy score is larger than the preset strategy score, the target strategy is represented as a beneficial strategy, namely the target strategy is on-line and beneficial to the supply and demand balance of a target city, and the problem of supply and demand mismatch is solved.

Based on the same technical concept, an embodiment of the present invention further provides a data processing apparatus, as shown in fig. 4, the apparatus includes: a target policy module 41, an observation data module 42, a target policy score module 43, a score difference module 44, and a determination module 45;

a target policy module 41, configured to determine a target policy, where the target policy is used to represent interaction rules between users in a target city;

an observation data module 42, configured to determine observation data based on the target policy, where the observation data is at least used to represent policy actions of each region in the target city, a state of each region in the target city, and a region policy score of each region in the target city;

a target policy score module 43, configured to determine a target policy score of the target policy based on the observation data;

a score difference module 44 configured to determine a score difference between the target policy score and a preset policy score, where the preset policy score is a policy score determined based on a preset policy; and

and a determining module 45, configured to determine that the target policy is a beneficial policy in response to the score difference being a positive value.

Optionally, the observation data module 42 is specifically configured to:

and determining observation data at least based on the policy action of the first region at the first moment, the state of the first region at the first moment, the policy score of the policy action of the first region at the first moment and the state of the first region at each moment in a preset time period, wherein the first region is used for representing the region in the target city, and the first moment is the moment in the preset time period.

the consistency hypothesis is used for representing the state of the first area at a first moment and is related to the strategy action of the target city from a starting moment to a second moment, wherein the starting moment is a preset moment, and the second moment is a moment before the first moment;

a sequence randomization assumption used for representing that the strategy action of the target city at the time t is related to the strategy action history of the target city and the current state of the target city;

a Markov assumption for characterizing the state of the target city at a first time instant, dependent on the state of the target city at a second time instant and the policy action;

and the conditional average independent hypothesis is used for representing that the strategy score expectation corresponding to the first region is determined based on a preset expectation algorithm, and the preset expectation algorithm comprises strategy actions and states corresponding to the first region.

Optionally, the target policy score module 43 is specifically configured to:

Optionally, the score difference module 44 is specifically configured to:

Optionally, the apparatus further comprises:

Fig. 5 is a schematic diagram of an electronic device of an embodiment of the invention. As shown in fig. 5, the electronic device shown in fig. 5 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 51 and a memory 52. The processor 51 and the memory 52 are connected by a bus 53. The memory 52 is adapted to store instructions or programs executable by the processor 51. The processor 51 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 51 implements the processing of data and the control of other devices by executing instructions stored by the memory 52 to perform the method flows of embodiments of the present invention as described above. The bus 53 connects the above components together, and also connects the above components to a display controller 54 and a display device and an input/output (I/O) device 55. Input/output (I/O) devices 55 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output device 55 is connected to the system through an input/output (I/O) controller 56.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus (device) or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.

These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

Another embodiment of the invention is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.

That is, as can be understood by those skilled in the art, all or part of the steps in the method of the above embodiments may be accomplished by specifying related hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of data processing, the method comprising:

determining a target policy score for the target policy based on the observation data, the target city satisfying at least a Markov assumption, the Markov assumption being used to characterize a state of the target city at a first time, the first time being a time within a preset time period, and the second time being a time before the first time, depending on the state of the target city and a policy action;

2. The method of claim 1, wherein determining observation data based on the target policy comprises:

determining the observation data at least based on a policy action of a first region at a first moment, a state of the first region at the first moment, a policy score of the policy action of the first region at the first moment and states of the first region at each moment in a preset time period, wherein the first region is used for representing a region in the target city.

3. The method of claim 2, wherein the target city further satisfies a consistency hypothesis, a sequence randomization hypothesis, and a conditional mean-independent hypothesis;

4. The method of claim 3, wherein determining a target policy score for the target policy based on the observation data comprises:

5. The method of claim 3, wherein determining a target policy score for the target policy based on the observation data comprises:

6. The method of claim 3, wherein determining a score difference between the target policy score and a preset policy score comprises:

7. The method of claim 2, further comprising:

8. A data processing apparatus, characterized in that the apparatus comprises:

a target policy score module, configured to determine a target policy score of the target policy based on the observation data, where the target city at least satisfies a markov assumption, and the markov assumption is used to characterize a state of the target city at a first time, and depends on a state and a policy action of the target city at a second time, where the first time is a time within a preset time period, and the second time is a time before the first time;

9. The apparatus of claim 8, wherein the observation data module is specifically configured to:

10. The apparatus of claim 9, wherein the target city further satisfies a consistency hypothesis, a sequence randomization hypothesis, and a conditional mean-independent hypothesis;

11. The apparatus according to claim 10, wherein the target policy score module is specifically configured to:

12. The apparatus according to claim 10, wherein the target policy score module is specifically configured to:

13. The apparatus according to claim 10, wherein the score difference module is specifically configured to:

14. The apparatus of claim 9, further comprising:

15. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-7.

16. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 7.