CN112001570B - Data processing method and device, electronic equipment and readable storage medium - Google Patents

Data processing method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112001570B
CN112001570B CN202011052216.4A CN202011052216A CN112001570B CN 112001570 B CN112001570 B CN 112001570B CN 202011052216 A CN202011052216 A CN 202011052216A CN 112001570 B CN112001570 B CN 112001570B
Authority
CN
China
Prior art keywords
target
strategy
score
policy
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011052216.4A
Other languages
Chinese (zh)
Other versions
CN112001570A (en
Inventor
罗世楷
宋歌
朱宏图
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Didi Infinity Technology and Development Co Ltd
Original Assignee
Beijing Didi Infinity Technology and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology and Development Co Ltd filed Critical Beijing Didi Infinity Technology and Development Co Ltd
Priority to CN202011052216.4A priority Critical patent/CN112001570B/en
Publication of CN112001570A publication Critical patent/CN112001570A/en
Application granted granted Critical
Publication of CN112001570B publication Critical patent/CN112001570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06311Scheduling, planning or task assignment for a person or group
    • G06Q50/40

Abstract

The embodiment of the invention provides a data processing method, a data processing device, electronic equipment and a readable storage medium, and relates to the technical field of computers.

Description

Data processing method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a readable storage medium.
Background
There are some types of services associated with cities, which are affected by the supply and demand relationship of the area, such as network car-booking type services, and when the supply and demand relationship of the area a is greater than demand, each network car-booking tends to leave the area a, whereas when the supply and demand relationship of the area a is less than demand, each network car-booking tends to approach the area a.
In the prior art, a worker of a service platform can formulate a service strategy based on the attributes of a city, so that supply and demand parties of the city are reasonably distributed.
However, in a city, there may be an area where supply is greater than demand, an area where supply is less than demand, and an area where supply is equal to demand, so that a worker needs to continuously adjust a policy for different areas, so that the working efficiency is low, and if a unified policy is executed in each area of the city, the supply and demand relationship in a certain area of the city may be more unbalanced, resulting in a problem of supply and demand mismatch.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data processing method, an apparatus, an electronic device, and a readable storage medium, so as to improve the efficiency of policy adjustment, save labor, and solve the problem of supply and demand mismatch.
In a first aspect, a data processing method is provided, where the method is applied to a server, and the method includes:
determining a target strategy, wherein the target strategy is used for representing interaction rules between users in a target city;
based on the target strategy, determining observation data, wherein the observation data is at least used for representing strategy actions of each region in the target city, the state of each region in the target city and the regional strategy score of each region in the target city;
determining a target policy score for the target policy based on the observation data;
determining a score difference between the target policy score and a preset policy score, wherein the preset policy score is a policy score determined based on a preset policy; and
in response to the score difference being a positive value, determining the target policy to be a beneficial policy.
Optionally, the determining observation data based on the target policy includes:
determining the observation data at least based on a policy action of a first region at a first moment, a state of the first region at the first moment, a policy score of the policy action of the first region at the first moment and states of the first region at each moment in a preset time period, wherein the first region is used for representing the region in the target city, and the first moment is the moment in the preset time period.
Optionally, the target city satisfies a consistency hypothesis, a sequence randomization hypothesis, a markov hypothesis, and a conditional mean-independent hypothesis;
the consistency hypothesis is used for representing the state of the first area at the first time, and is related to the strategy action of the target city from a starting time to a second time, wherein the starting time is a preset time, and the second time is a time before the first time;
the sequence randomization hypothesis is used for representing that the strategy action of the target city at the time t is related to the strategy action history of the target city and the current state of the target city;
the Markov assumption is used for representing the state of the target city at the first time, and depends on the state of the target city at the second time and a strategy action;
the conditional mean-independent hypothesis is used for representing that the strategy score expectation corresponding to the first region is determined based on a preset expectation algorithm, and the preset expectation algorithm comprises strategy actions and states corresponding to the first region.
Optionally, the determining a target policy score of the target policy based on the observation data includes:
and determining a target strategy score of the target strategy based on a preset first strategy score algorithm, wherein the first strategy score algorithm is constructed based on an importance sampling model.
Optionally, the determining a target policy score of the target policy based on the observation data includes:
and determining a target strategy score of the target strategy based on a preset second strategy score algorithm, wherein the second strategy score algorithm is constructed based on a model with robustness.
Optionally, the determining a score difference between the target policy score and a preset policy score includes:
and determining the score difference between the target strategy score and the preset strategy score based on the strategy score expectation corresponding to the target strategy, the strategy score expectation corresponding to the preset strategy and a preset score difference algorithm.
Optionally, the method further includes:
and determining the average state of each region in the target city based on the average function of the state of each region in the target city, and determining the average strategy action of each region in the target city based on the average function of the strategy action of each region in the target city.
In a second aspect, a data processing apparatus is provided, the apparatus being applied to a server, the apparatus comprising:
the target strategy module is used for determining a target strategy, and the target strategy is used for representing interaction rules between users in a target city;
the observation data module is used for determining observation data based on the target strategy, and the observation data is at least used for representing the strategy action of each region in the target city, the state of each region in the target city and the regional strategy score of each region in the target city;
a target policy score module for determining a target policy score for the target policy based on the observation data;
the score difference module is used for determining the score difference between the target strategy score and a preset strategy score, wherein the preset strategy score is a strategy score determined based on a preset strategy; and
a determination module, configured to determine that the target policy is a beneficial policy in response to the score difference being a positive value.
Optionally, the observation data module is specifically configured to:
determining the observation data at least based on a policy action of a first region at a first moment, a state of the first region at the first moment, a policy score of the policy action of the first region at the first moment and states of the first region at each moment in a preset time period, wherein the first region is used for representing the region in the target city, and the first moment is the moment in the preset time period.
Optionally, the target city satisfies a consistency hypothesis, a sequence randomization hypothesis, a markov hypothesis, and a conditional mean-independent hypothesis;
the consistency hypothesis is used for representing the state of the first area at the first time, and is related to the strategy action of the target city from a starting time to a second time, wherein the starting time is a preset time, and the second time is a time before the first time;
the sequence randomization hypothesis is used for representing that the strategy action of the target city at the time t is related to the strategy action history of the target city and the current state of the target city;
the Markov assumption is used for representing the state of the target city at the first time, and depends on the state of the target city at the second time and a strategy action;
the conditional mean-independent hypothesis is used for representing that the strategy score expectation corresponding to the first region is determined based on a preset expectation algorithm, and the preset expectation algorithm comprises strategy actions and states corresponding to the first region.
Optionally, the target policy score module is specifically configured to:
and determining a target strategy score of the target strategy based on a preset first strategy score algorithm, wherein the first strategy score algorithm is constructed based on an importance sampling model.
Optionally, the target policy score module is specifically configured to:
and determining a target strategy score of the target strategy based on a preset second strategy score algorithm, wherein the second strategy score algorithm is constructed based on a model with robustness.
Optionally, the score difference module is specifically configured to:
and determining the score difference between the target strategy score and the preset strategy score based on the strategy score expectation corresponding to the target strategy, the strategy score expectation corresponding to the preset strategy and a preset score difference algorithm.
Optionally, the apparatus further comprises:
and the average state module is used for determining the average state of each area in the target city based on the average function of the state of each area in the target city, and determining the average strategy action of each area in the target city based on the average function of the strategy action of each area in the target city.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory is used to store one or more computer program instructions, where the one or more computer program instructions are executed by the processor to implement the method according to the first aspect.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium on which computer program instructions are stored, which when executed by a processor implement the method according to the first aspect.
According to the embodiment of the invention, the server can determine the target strategy score of the target strategy based on the observation data of the target strategy, and the score can be used for evaluating the target strategy, so that the server can judge the feasibility of the target strategy through the target strategy score, and if the target strategy is a beneficial strategy, the server can execute the target strategy aiming at the target city, so that the strategy adjusting efficiency is improved, the manpower is saved, and the problem of supply and demand mismatch is solved.
Drawings
The above and other objects, features and advantages of the embodiments of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 is a schematic diagram of a data processing system according to an embodiment of the present invention;
fig. 2 is a schematic diagram of areas of a target city according to an embodiment of the present invention;
fig. 3 is a flowchart of a data processing method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.
Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.
Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
As shown in fig. 1, fig. 1 is a data processing system according to an embodiment of the present invention, where the system includes: a terminal device 11 and a server 12.
The terminal device 11 may be a mobile terminal (e.g., a smart phone), or may be a vehicle-mounted terminal installed in a vehicle, and the server 12 may be a single server, or may be a server cluster configured in a distributed manner.
In one implementation, the data processing system shown in fig. 1 may be used for policy selection of a target city, the terminal device 11 may be a smart phone used by each car booking driver in the target city, and the server 12 may be a server of a car booking platform.
It should be noted that the terminal device 11 shown in fig. 1 is used to represent a plurality of terminals, the number of which is not limited to 3, and the number of the terminal devices 11 is not limited in the embodiment of the present invention.
The target city includes: region 1, region 2, region 3, region 4, and region 5, where each region of the target city corresponds to the policy action for that region, the status of that region, and the region policy score, respectively.
The server 12 may configure a target policy for the target city, where the target policy is used to characterize an interaction rule between the user and the target city, and in the embodiment of the present invention, the target policy is a set of policy actions in the region 1 to the region 5.
The server 12 may collect observation data of the target city, where the observation data may include the above policy action, status, and policy score, then the server 12 may determine a target policy score of the target policy according to the observation data, then the server 12 may determine a score difference between the target policy score and a preset policy score, and then determine whether the target policy is a beneficial policy according to the score difference.
In the embodiment of the present invention, the preset policy score is a score corresponding to a preset policy, and the preset policy may be a no-policy or a basic policy (the basic policy is a common policy).
It should be noted that the target city and the areas 1 to 5 only provide an example for the embodiment of the present invention, and the embodiment of the present invention does not limit the target city and the partition thereof.
As shown in fig. 2, fig. 2 is a schematic view of each area of a target city according to an embodiment of the present invention, where the target city includes: region 1, region 2, region 3, region 4, and region 5, each region of the target city corresponding to the policy action for that region, the status of that region, and the region policy score, respectively.
In conjunction with the content shown in fig. 1, the server 12 may configure a target policy for the target city, where the target policy is used to characterize interaction rules between users in the target city, and in this embodiment of the present invention, the target policy is a set of policy actions in the area 1 to the area 5.
The server 12 may further collect observation data of the target city, where the observation data may include the policy action, the state, and the policy score, and then the server 12 may determine a target policy score of the target policy according to the observation data, and then the server 12 may determine a score difference between the target policy score and a preset policy score, and further determine whether the target policy is a beneficial policy according to the score difference.
In the embodiment of the present invention, the preset policy score is a score corresponding to a preset policy, and the preset policy may be a no-policy or a basic policy (the basic policy is a common policy).
With reference to the contents shown in fig. 1 and fig. 2, an embodiment of the present invention provides an application scenario of a target policy, where a target city includes 5 areas (area 1-area 5), in the 5 areas, a plurality of network appointments are working, a terminal device 11 is a driver-side device used by a driver of each network appointment, and a server 12 is a network appointment policy platform.
Specifically, the area 3 is a city center, the supply and demand relationship of the network car booking service is greater than demand, the area 5 is a suburb, and the supply and demand relationship of the network car booking service is less than demand, at this time, the network car booking strategy platform may execute a target strategy for the target city 11, for example, execute an empty strategy (i.e., not execute any strategy action) in the area 3, and execute an incentive strategy in the area 5 (e.g., when a network car booking driver gets orders in the area 5, the network car booking driver may obtain an additional reward).
In this way, a part of the taxi appointment drivers can actively go to the area 5 to take orders in order to obtain additional rewards, and the supply and demand relations of the area 3 and the area 5 are close to the equilibrium state.
In order to select the most preferable policy, a data processing method provided by the embodiment of the present invention will be described in detail below with reference to a specific implementation, where the method is applied to a server, and as shown in fig. 3, the specific steps are as follows:
in step 100, a target policy is determined.
The target strategy is used for representing interaction rules between users in the target city.
At step 200, observation data is determined based on the target policy.
The observation data is at least used for representing the strategy action of each region in the target city, the state of each region in the target city and the regional strategy score of each region in the target city.
At step 300, a target policy score for the target policy is determined based on the observed data.
At step 400, a score difference between the target policy score and a preset policy score is determined.
The preset strategy score is a strategy score determined based on a preset strategy.
In step 500, in response to the score difference being positive, the target policy is determined to be a beneficial policy.
According to the embodiment of the invention, the server can determine the target strategy score of the target strategy based on the observation data of the target strategy, and the score can be used for evaluating the target strategy, so that the server can judge the feasibility of the target strategy through the target strategy score, and if the target strategy is a beneficial strategy, the server can execute the target strategy aiming at the target city, so that the strategy adjusting efficiency is improved, the manpower is saved, and the problem of supply and demand mismatch is solved.
It should be further noted that the policy action of each region is used to characterize whether to perform the policy action in the region, the status of each region is at least used to characterize the quantity of supply and demand, the equilibrium state of supply and demand, and the weather condition of the region, and the regional policy score of each region is used to evaluate the feasibility of the policy performed by the region.
In the embodiment of the present invention, the server may use each region in the target city as an agent in the mar based on a Multi-agent Reinforcement Learning (mar) framework, thereby determining the target policy score.
The MARL is a distributed computing technology, and can perform operations on strategies of all agents in the MARL to determine an overall optimal solution.
Specifically, in the process of determining the policy score (target policy score or preset policy score), the policy action, the state, and the regional policy score may be defined.
From a spatial perspective, a policy action may be defined as AiWhere i is used to characterize a region in the target city, 1 is used to characterize the policy action for performing the region in the i region, and 0 is used to characterize the policy action for not performing the region in the i region。
Meanwhile, the state can be defined as S from the perspective of spaceiAnd is used for characterizing the state of the i area.
Further, the policy action for the target city may be expressed as a ═ a1×A2×…×AN={0,1}NThe state of the target city may be expressed as S ═ S1×S2×…×SN
From the perspective of time, the policy action history of all areas in the target city from time 0 to time t can be defined as
Figure BDA0002709916640000091
Wherein, a0,a1,…,at∈{0,1}NIs a sequence of N-dimensional vectors.
Further, from a time perspective, for each region i ∈ {0,1, …, N } in the target city, a definition is made
Figure BDA0002709916640000092
Following policy at target city for i-zone
Figure BDA0002709916640000093
The state at time t +1, define
Figure BDA0002709916640000094
Following policy at target city for i-zone
Figure BDA0002709916640000095
Time t, the regional strategy score.
Further, the target policy of the target city may be defined as pi ═ pi (pi ═ pi)12,…,πN)TWherein each piiIs a binary function pi with respect to the current statei(St) E {0,1}, under the policy pi, the region i can execute the policy action pi at the time ti(St)。
For the target strategy pi,
Figure BDA0002709916640000096
is the initial policy action of the target policy pi,
Figure BDA0002709916640000097
is the historical policy action from the initial time to time t.
In the embodiment of the present invention, an expression of the target policy score may be determined based on the above definition, which is specifically as follows:
Figure BDA0002709916640000098
wherein the content of the first and second substances,
Figure BDA0002709916640000099
for the purpose of characterizing the target strategy,
Figure BDA00027099166400000910
for characterizing target city executions
Figure BDA00027099166400000911
Strategic score, V, of the time i zone at time jil) For characterizing the target policy score.
In one possible implementation, the server may determine the observation data based on at least a policy action of the first region at the first time, a status of the first region at the first time, a policy score of the policy action of the first region at the first time, and a status of the first region at each time within a preset time period.
The first area is used for representing an area in a target city, and the first time is the time within a preset time period.
Based on the above definitions, an optional implementation manner is provided in the embodiments of the present invention, and specifically, the process of determining the observation data may be to determine the observation data based on the following formula:
Figure BDA00027099166400000912
wherein the content of the first and second substances,
Figure BDA00027099166400000913
for characterizing atSatisfy { Ai,j,Si,j,Ri,j}1≤i≤N,0≤j<t∪{Si,t}1≤i≤NProbability of time, AtA policy action for characterizing the target city to execute at time t, i for characterizing the area in the target city, j for characterizing the time, Ai,jFor characterizing the policy action of the i-region at time j, Si,jFor characterizing the state of the i-region at time j, Ri,jThe strategy score of the i area at the time j is represented, and the observation data of the target city are represented by b.
That is, the i region may be used to represent the first region, the j time may be used to represent the first time, and the observation data may be expressed as { A }i,t,Si,t,Ri,t}1≤i≤N,0≤t≤TI.e., the observation data includes policy actions, states, and policy scores.
According to the embodiment of the invention, the server can determine the target strategy score of the target strategy by combining the time dimension and the space dimension by defining the strategy action, the state and the regional strategy score from the aspects of time and space, so that the evaluation result of the target strategy by the server can better accord with the actual situation.
In one possible implementation, to make the target policy score of the target policy more accurate, the target city policy action, status, and regional policy score may be conditionally defined.
Specifically, the target city satisfies a Consistency Assumption (CA), a Sequence Randomization Assumption (SRA), a Markov Assumption (MA), and a Conditional Mean Independence Assumption (CMIA).
The consistency hypothesis is used for representing the state of the first area at a first time and is related to the strategy action of the target city from a starting time to a second time, wherein the starting time is a preset time, and the second time is a time before the first time.
In conjunction with the above definitions, the consistency assumptions in the embodiments of the present invention may be used to characterize
Figure BDA0002709916640000101
It is true that, among other things,
Figure BDA0002709916640000102
the method is used for representing the strategy action history of the target city from 0 moment to t moment in observation data.
Wherein the time t-1 can be used to characterize the second time.
The sequence randomization assumption used to characterize the policy action of the target city at time t is only related to the policy action history of the target city, as well as the current state of the target city.
The markov assumption is used to characterize the state of the target city at a first time instant, depending on the state of the target city at a second time instant and the policy action, that is, the state of the target city at time instant t, depending on the state of the target city at time instant t-1 and the policy action.
The conditional mean-independent hypothesis is used to characterize the policy score expectation corresponding to the first region based on a predetermined expectation algorithm.
The preset expectation algorithm includes a policy action and a state corresponding to the first region, and in combination with the above definitions, the preset expectation algorithm may be expressed as:
Figure BDA0002709916640000111
wherein the content of the first and second substances,
Figure BDA0002709916640000112
for characterizing the policy score expectation,
Figure BDA0002709916640000113
for characterizing a target cityPolicy action history from time 0 to time t in city, riFor characterizing target city to execute A at time ttThe latter desired policy score.
According to the embodiment of the invention, the strategy action, the state and the regional strategy score of the target city meet the assumptions, so that the target strategy score determined by the server can be more targeted, and the target strategy score is more accurate.
After the server determines the observation data, the target policy score may be determined continuously, but in practical applications, since too many regions may be included in the target city, the server may encounter a dimensional disaster (customer of dimension) in the process of determining the policy score, thereby causing a drastic increase in computational stress.
In the embodiment of the present invention, the observation data of each region may be used as a vector of a dimension, and in the process of calculating the target policy score by the server, as the number of regions increases (i.e., the dimension increases), the calculation amount of the server increases exponentially, thereby causing a dimension disaster.
Therefore, in one possible implementation, to solve the problem of the dimensional disaster, the server may define, for each area, the area to be affected only by its neighboring areas in the calculation process, so as to solve the dimensional disaster.
In an implementation manner, the average state of each region in the target city may be determined based on an average function of the state of each region in the target city, and specifically, the average function of the state of each region in the target city may be represented as:
Figure BDA0002709916640000114
wherein the content of the first and second substances,
Figure BDA0002709916640000115
and the average function is used for representing the states of all the areas in the target city.
In another possible implementation manner, the average policy action of each area in the target city may be determined based on an average function of the policy actions of each area in the target city, and specifically, the average function of the policy actions of each area in the target city may be represented as:
Figure BDA0002709916640000121
wherein the content of the first and second substances,
Figure BDA0002709916640000122
and the average function is used for representing the strategy action of each area in the target city.
By the embodiment of the invention, the problem of dimension disaster can be solved through the average function, and after the problem of dimension disaster is solved, the server can more effectively determine the target strategy score of the target strategy.
Wherein for region i e {1, …, N } in the target city, there is a region policy score
Figure BDA0002709916640000123
For any one
Figure BDA0002709916640000124
a∈{0,1}NAll are provided with
Figure BDA0002709916640000125
Let p beb(. and p)π(. as S under policy b and policy π, respectivelytWherein the strategy b and the strategy pi can be any strategy.
Then let pi,π(. as)
Figure BDA0002709916640000126
Corresponding marginal distribution, let pi,b(. as)
Figure BDA0002709916640000127
Corresponding Marginal Distribution, wherein Marginal Distribution refers to probability theory and statisticsThe multidimensional random variables of science only comprise the probability distribution of part of the variables.
Further, it can be determined
Figure BDA0002709916640000128
Wherein the content of the first and second substances,
Figure BDA0002709916640000129
can be used to characterize the weight that area i affects the status of the target city.
Furthermore, the embodiment of the present invention provides two ways of determining the score of the target policy, that is, the present invention provides two ways of performing simulation on the target city, which are specifically as follows:
in one possible implementation, when the server determines the observation data, the target policy score may be determined based on the importance sampling model.
Specifically, the server may determine a target policy score of the target policy based on a preset first policy score algorithm, where the first policy score algorithm is constructed based on the importance sampling model.
Further, the process of determining, by the server, the target policy score of the target policy based on the first policy score algorithm may specifically be:
calculating a target strategy score of the target strategy based on the observation data and a first strategy score algorithm;
Figure BDA00027099166400001210
wherein the content of the first and second substances,
Figure BDA00027099166400001211
for characterizing target policy scores, w for characterizing weights,
Figure BDA00027099166400001212
is wiIs determined by the estimated value of (c),
Figure BDA0002709916640000131
determined by a preset deep learning algorithm,
Figure BDA0002709916640000132
to indicate the function, pi is used to characterize the target strategy,
Figure BDA0002709916640000133
and the strategy action average function is used for characterizing the i area in the target city.
Importance sampling is a method used in statistics to estimate a property of a distribution, which is sampled from another distribution different from the original distribution to estimate the property of the original distribution.
In the embodiment of the invention, the environment simulation can be carried out on the target city through the importance sampling model, and meanwhile, the environment simulation is the simulation of two dimensions of time and space, so that the target strategy score can be more real and accurate.
In another possible implementation, after the server determines the observation data, the target policy score may also be determined based on the dual robust model.
Specifically, the server may determine a target policy score of the target policy based on a preset second policy score algorithm, where the second policy score algorithm is constructed based on a model with robustness.
Further, the process of determining, by the server, the target policy score of the target policy based on the second policy score algorithm may specifically be:
calculating a target strategy score of the target strategy based on the observation data and a second strategy score algorithm;
Figure BDA0002709916640000134
Figure BDA0002709916640000135
whereinV is used to characterize the target strategy score, Q is used to characterize the complementary cumulative distribution function (Q-function), w is used to characterize the weights,
Figure BDA0002709916640000136
to indicate a function, π is used to characterize the target strategy.
In the embodiment of the invention, robustness (Robust) can be used for representing the tolerance of data change, and then the server can perform environment simulation on the target city through a double-Robust model, and simultaneously, because the environment simulation is the simulation of two dimensions of time and space, the target strategy score can be more real and accurate.
When the server determines the target policy score, a score difference between the target policy score and a preset policy score may be determined, where the preset policy may be no policy or a policy being used by the target city.
Specifically, the server may determine a score difference between the target policy score and the preset policy score based on the policy score expectation corresponding to the target policy, the policy score expectation corresponding to the preset policy, and a preset score difference algorithm.
Wherein the score difference may be determined based on the following formula:
Figure BDA0002709916640000141
wherein ATE is used to characterize score differences, π0For characterizing a predetermined strategy, pilFor the purpose of characterizing the target strategy,
Figure BDA0002709916640000142
for characterizing target city executions
Figure BDA0002709916640000143
The policy score for the time i field at time j.
Further, in practical applications, when the server determines the score difference, it may be determined whether the target policy is a beneficial policy based on the score difference.
If the target strategy score is larger than the preset strategy score, the target strategy is represented as a beneficial strategy, namely the target strategy is on-line and beneficial to the supply and demand balance of a target city, and the problem of supply and demand mismatch is solved.
Based on the same technical concept, an embodiment of the present invention further provides a data processing apparatus, as shown in fig. 4, the apparatus includes: a target policy module 41, an observation data module 42, a target policy score module 43, a score difference module 44, and a determination module 45;
a target policy module 41, configured to determine a target policy, where the target policy is used to represent interaction rules between users in a target city;
an observation data module 42, configured to determine observation data based on the target policy, where the observation data is at least used to represent policy actions of each region in the target city, a state of each region in the target city, and a region policy score of each region in the target city;
a target policy score module 43, configured to determine a target policy score of the target policy based on the observation data;
a score difference module 44 configured to determine a score difference between the target policy score and a preset policy score, where the preset policy score is a policy score determined based on a preset policy; and
and a determining module 45, configured to determine that the target policy is a beneficial policy in response to the score difference being a positive value.
Optionally, the observation data module 42 is specifically configured to:
and determining observation data at least based on the policy action of the first region at the first moment, the state of the first region at the first moment, the policy score of the policy action of the first region at the first moment and the state of the first region at each moment in a preset time period, wherein the first region is used for representing the region in the target city, and the first moment is the moment in the preset time period.
Optionally, the target city satisfies a consistency hypothesis, a sequence randomization hypothesis, a markov hypothesis, and a conditional mean-independent hypothesis;
the consistency hypothesis is used for representing the state of the first area at a first moment and is related to the strategy action of the target city from a starting moment to a second moment, wherein the starting moment is a preset moment, and the second moment is a moment before the first moment;
a sequence randomization assumption used for representing that the strategy action of the target city at the time t is related to the strategy action history of the target city and the current state of the target city;
a Markov assumption for characterizing the state of the target city at a first time instant, dependent on the state of the target city at a second time instant and the policy action;
and the conditional average independent hypothesis is used for representing that the strategy score expectation corresponding to the first region is determined based on a preset expectation algorithm, and the preset expectation algorithm comprises strategy actions and states corresponding to the first region.
Optionally, the target policy score module 43 is specifically configured to:
and determining a target strategy score of the target strategy based on a preset first strategy score algorithm, wherein the first strategy score algorithm is constructed based on an importance sampling model.
Optionally, the target policy score module 43 is specifically configured to:
and determining a target strategy score of the target strategy based on a preset second strategy score algorithm, wherein the second strategy score algorithm is constructed based on a model with robustness.
Optionally, the score difference module 44 is specifically configured to:
and determining the score difference between the target strategy score and the preset strategy score based on the strategy score expectation corresponding to the target strategy, the strategy score expectation corresponding to the preset strategy and a preset score difference algorithm.
Optionally, the apparatus further comprises:
and the average state module is used for determining the average state of each area in the target city based on the average function of the state of each area in the target city, and determining the average strategy action of each area in the target city based on the average function of the strategy action of each area in the target city.
According to the embodiment of the invention, the server can determine the target strategy score of the target strategy based on the observation data of the target strategy, and the score can be used for evaluating the target strategy, so that the server can judge the feasibility of the target strategy through the target strategy score, and if the target strategy is a beneficial strategy, the server can execute the target strategy aiming at the target city, so that the strategy adjusting efficiency is improved, the manpower is saved, and the problem of supply and demand mismatch is solved.
Fig. 5 is a schematic diagram of an electronic device of an embodiment of the invention. As shown in fig. 5, the electronic device shown in fig. 5 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 51 and a memory 52. The processor 51 and the memory 52 are connected by a bus 53. The memory 52 is adapted to store instructions or programs executable by the processor 51. The processor 51 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 51 implements the processing of data and the control of other devices by executing instructions stored by the memory 52 to perform the method flows of embodiments of the present invention as described above. The bus 53 connects the above components together, and also connects the above components to a display controller 54 and a display device and an input/output (I/O) device 55. Input/output (I/O) devices 55 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output device 55 is connected to the system through an input/output (I/O) controller 56.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus (device) or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.
These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.
Another embodiment of the invention is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.
That is, as can be understood by those skilled in the art, all or part of the steps in the method of the above embodiments may be accomplished by specifying related hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (16)

1. A method of data processing, the method comprising:
determining a target strategy, wherein the target strategy is used for representing interaction rules between users in a target city;
based on the target strategy, determining observation data, wherein the observation data is at least used for representing strategy actions of each region in the target city, the state of each region in the target city and the regional strategy score of each region in the target city;
determining a target policy score for the target policy based on the observation data, the target city satisfying at least a Markov assumption, the Markov assumption being used to characterize a state of the target city at a first time, the first time being a time within a preset time period, and the second time being a time before the first time, depending on the state of the target city and a policy action;
determining a score difference between the target policy score and a preset policy score, wherein the preset policy score is a policy score determined based on a preset policy; and
in response to the score difference being a positive value, determining the target policy to be a beneficial policy.
2. The method of claim 1, wherein determining observation data based on the target policy comprises:
determining the observation data at least based on a policy action of a first region at a first moment, a state of the first region at the first moment, a policy score of the policy action of the first region at the first moment and states of the first region at each moment in a preset time period, wherein the first region is used for representing a region in the target city.
3. The method of claim 2, wherein the target city further satisfies a consistency hypothesis, a sequence randomization hypothesis, and a conditional mean-independent hypothesis;
the consistency hypothesis is used for representing the state of the first area at the first time, and is related to the strategy action of the target city from a starting time to a second time, wherein the starting time is a preset time, and the second time is a time before the first time;
the sequence randomization hypothesis is used for representing that the strategy action of the target city at the time t is related to the strategy action history of the target city and the current state of the target city;
the conditional mean-independent hypothesis is used for representing that the strategy score expectation corresponding to the first region is determined based on a preset expectation algorithm, and the preset expectation algorithm comprises strategy actions and states corresponding to the first region.
4. The method of claim 3, wherein determining a target policy score for the target policy based on the observation data comprises:
and determining a target strategy score of the target strategy based on a preset first strategy score algorithm, wherein the first strategy score algorithm is constructed based on an importance sampling model.
5. The method of claim 3, wherein determining a target policy score for the target policy based on the observation data comprises:
and determining a target strategy score of the target strategy based on a preset second strategy score algorithm, wherein the second strategy score algorithm is constructed based on a model with robustness.
6. The method of claim 3, wherein determining a score difference between the target policy score and a preset policy score comprises:
and determining the score difference between the target strategy score and the preset strategy score based on the strategy score expectation corresponding to the target strategy, the strategy score expectation corresponding to the preset strategy and a preset score difference algorithm.
7. The method of claim 2, further comprising:
and determining the average state of each region in the target city based on the average function of the state of each region in the target city, and determining the average strategy action of each region in the target city based on the average function of the strategy action of each region in the target city.
8. A data processing apparatus, characterized in that the apparatus comprises:
the target strategy module is used for determining a target strategy, and the target strategy is used for representing interaction rules between users in a target city;
the observation data module is used for determining observation data based on the target strategy, and the observation data is at least used for representing the strategy action of each region in the target city, the state of each region in the target city and the regional strategy score of each region in the target city;
a target policy score module, configured to determine a target policy score of the target policy based on the observation data, where the target city at least satisfies a markov assumption, and the markov assumption is used to characterize a state of the target city at a first time, and depends on a state and a policy action of the target city at a second time, where the first time is a time within a preset time period, and the second time is a time before the first time;
the score difference module is used for determining the score difference between the target strategy score and a preset strategy score, wherein the preset strategy score is a strategy score determined based on a preset strategy; and
a determination module, configured to determine that the target policy is a beneficial policy in response to the score difference being a positive value.
9. The apparatus of claim 8, wherein the observation data module is specifically configured to:
determining the observation data at least based on a policy action of a first region at a first moment, a state of the first region at the first moment, a policy score of the policy action of the first region at the first moment and states of the first region at each moment in a preset time period, wherein the first region is used for representing a region in the target city.
10. The apparatus of claim 9, wherein the target city further satisfies a consistency hypothesis, a sequence randomization hypothesis, and a conditional mean-independent hypothesis;
the consistency hypothesis is used for representing the state of the first area at the first time, and is related to the strategy action of the target city from a starting time to a second time, wherein the starting time is a preset time, and the second time is a time before the first time;
the sequence randomization hypothesis is used for representing that the strategy action of the target city at the time t is related to the strategy action history of the target city and the current state of the target city;
the conditional mean-independent hypothesis is used for representing that the strategy score expectation corresponding to the first region is determined based on a preset expectation algorithm, and the preset expectation algorithm comprises strategy actions and states corresponding to the first region.
11. The apparatus according to claim 10, wherein the target policy score module is specifically configured to:
and determining a target strategy score of the target strategy based on a preset first strategy score algorithm, wherein the first strategy score algorithm is constructed based on an importance sampling model.
12. The apparatus according to claim 10, wherein the target policy score module is specifically configured to:
and determining a target strategy score of the target strategy based on a preset second strategy score algorithm, wherein the second strategy score algorithm is constructed based on a model with robustness.
13. The apparatus according to claim 10, wherein the score difference module is specifically configured to:
and determining the score difference between the target strategy score and the preset strategy score based on the strategy score expectation corresponding to the target strategy, the strategy score expectation corresponding to the preset strategy and a preset score difference algorithm.
14. The apparatus of claim 9, further comprising:
and the average state module is used for determining the average state of each area in the target city based on the average function of the state of each area in the target city, and determining the average strategy action of each area in the target city based on the average function of the strategy action of each area in the target city.
15. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-7.
16. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 7.
CN202011052216.4A 2020-09-29 2020-09-29 Data processing method and device, electronic equipment and readable storage medium Active CN112001570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011052216.4A CN112001570B (en) 2020-09-29 2020-09-29 Data processing method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011052216.4A CN112001570B (en) 2020-09-29 2020-09-29 Data processing method and device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN112001570A CN112001570A (en) 2020-11-27
CN112001570B true CN112001570B (en) 2021-07-09

Family

ID=73475683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011052216.4A Active CN112001570B (en) 2020-09-29 2020-09-29 Data processing method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112001570B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112631517B (en) * 2020-12-24 2021-09-03 北京百度网讯科技有限公司 Data storage method and device, electronic equipment and storage medium

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101459965B (en) * 2007-12-12 2010-07-14 中国移动通信集团公司 Method, device and communication system for resource scheduling
CN101287157A (en) * 2008-05-27 2008-10-15 黄国灿 Bi-end satellite positioning communication and cab scheduling method and system by centralized scheduling
CN101969696B (en) * 2010-11-24 2012-09-26 武汉大学 Multi-data source resource distribution method for wireless Ad Hoc network
CN104537838B (en) * 2014-12-31 2017-02-01 哈尔滨工业大学 Link time delay dynamic prediction method which is used for highway and takes V2V in VANETs of intersection into consideration
CN104765643A (en) * 2015-03-25 2015-07-08 华迪计算机集团有限公司 Method and system for achieving hybrid scheduling of cloud computing resources
CN105825297A (en) * 2016-03-11 2016-08-03 山东大学 Markov-model-based position prediction method
US11599833B2 (en) * 2016-08-03 2023-03-07 Ford Global Technologies, Llc Vehicle ride sharing system and method using smart modules
WO2019113875A1 (en) * 2017-12-14 2019-06-20 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for optimizing order allocation
CN108389069A (en) * 2018-01-11 2018-08-10 国网山东省电力公司 Top-tier customer recognition methods based on random forest and logistic regression and device
CN111189471A (en) * 2018-11-14 2020-05-22 中移物联网有限公司 Correction method, correction device and computer storage medium
CN111310956A (en) * 2018-12-11 2020-06-19 北京嘀嘀无限科技发展有限公司 Method and device for determining scheduling strategy and electronic equipment
CN109948854B (en) * 2019-03-21 2022-07-01 华侨大学 Intercity network taxi booking order distribution method based on multi-objective optimization
CN110443517A (en) * 2019-08-12 2019-11-12 首约科技(北京)有限公司 It is a kind of to influence net Yue Che driver and go out the key index of vehicle enthusiasm to determine method

Also Published As

Publication number Publication date
CN112001570A (en) 2020-11-27

Similar Documents

Publication Publication Date Title
US20200302322A1 (en) Machine learning system
Jin et al. Dynamic task pricing in multi-requester mobile crowd sensing with markov correlated equilibrium
US11157316B1 (en) Determining action selection policies of an execution device
CN111459993B (en) Configuration updating method, device, equipment and storage medium based on behavior analysis
US11700302B2 (en) Using reinforcement learning to scale queue-based services
CN103971170A (en) Method and device for forecasting changes of feature information
CN111461812A (en) Object recommendation method and device, electronic equipment and readable storage medium
CN111731326B (en) Obstacle avoidance strategy determination method and device and storage medium
CN113689699B (en) Traffic flow prediction method and device, electronic equipment and storage medium
CN112001570B (en) Data processing method and device, electronic equipment and readable storage medium
CN111061564A (en) Server capacity adjusting method and device and electronic equipment
CN114261400A (en) Automatic driving decision-making method, device, equipment and storage medium
CN108139930B (en) Resource scheduling method and device based on Q learning
CN111488527A (en) Position recommendation method and device, electronic equipment and computer-readable storage medium
Schuller et al. Towards heuristic optimization of complex service-based workflows for stochastic QoS attributes
Zhang et al. Home health care routing problem via off-line learning and neural network
CN110826695A (en) Data processing method, device and computer readable storage medium
CN108170404B (en) Web service combination verification method based on parameterized model
CN110516872A (en) A kind of information processing method, device, storage medium and electronic equipment
JP6608731B2 (en) Price setting device and price setting method
CN114358692A (en) Distribution time length adjusting method and device and electronic equipment
CN114489966A (en) Job scheduling method and device
CN112819507A (en) Service pushing method and device, electronic equipment and readable storage medium
CN108471362B (en) Resource allocation prediction technique and device
CN112991008A (en) Position recommendation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant