CN111563662A

CN111563662A - Service quality evaluation system and method based on time-window-based deep reinforcement learning

Info

Publication number: CN111563662A
Application number: CN202010298848.2A
Authority: CN
Inventors: 孙雁飞; 陈根鑫; 亓晋; 许斌; 王堃
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2020-08-21
Anticipated expiration: 2040-04-16
Also published as: CN111563662B

Abstract

The invention provides a service quality evaluation system based on time-window-based deep reinforcement learning, which comprises a data acquisition module, a model adjustment module, a reward feedback module, a parallel learning module, a Q table updating module, a periodic iteration module, a prediction learning module and a time window adjustment module.

Description

Service quality evaluation system and method based on time-window-based deep reinforcement learning

Technical Field

The invention relates to a service quality evaluation system, in particular to a service quality evaluation system and an evaluation method.

Background

With the continuous development of the economic level of China, the service industry is prosperous. The service industry is huge in magnitude, and the service ecology of the service industry depends on a huge number of service providing objects, such as supermarket business points, medical networks, Taobao merchants, video producers and program producers, in spite of traditional physical economy, such as supermarkets, hospitals and the like, or online internet economy, such as e-commerce (Taobao, Jingdong and the like), digital media operators (micro blogs, trembles and the like) and the like. For such a huge number of service providing objects, how an operator should evaluate the service quality of the service providing objects, so that the evaluation result can reflect the real situation, thereby distinguishing the service providing objects according to the service quality of the service providing objects, forming a good feedback mechanism for the service providing objects on the one hand, making the service providing objects adjust according to the evaluation result, and providing a reference for consumers on the other hand, so that the consumers obtain better services, which is a problem that needs to be solved urgently.

In the conventional service quality assessment, on one hand, an experience-based assessment method is adopted, that is, according to the experience of a practitioner, part of service quality assessment support indexes are selected, weight is distributed to fill in an experience model, the experience model is applied to service quality assessment, and subsequent services are provided according to the result. However, the experience-based evaluation cannot guarantee the matching with the actual situation, that is, there is a possibility of evaluation deviation, and for a large-scale service providing object, the evaluation deviation causes great loss and misleading, thereby disturbing the normal service order. On the other hand, a mathematical modeling mode is also adopted, namely a mathematical model is constructed through large-scale data analysis, so that the mathematical model is applied to service evaluation. And the change of the actual situation cannot be dynamically tracked based on a mathematical modeling mode. Considering that the evaluation mode should change dynamically with the actual situation and the customer's demand, the solidification evaluation system provided by the mathematical modeling cannot realize dynamic alternation, and cannot follow the description of the actual service quality, so that the method has disadvantages. Therefore, the evaluation of the quality of service still needs to be improved.

The prior art discloses a method and a device for training a service quality assessment model, and the application numbers are as follows: 201110436065.7, used for evaluating the quality of service of the sellers in the network mall, the training method is mainly to grade the quality evaluation objects according to the parameter information, then to collect the corresponding characteristics in each grade, to obtain the corresponding characteristic weight, to further obtain the characteristic data corresponding to the corresponding characteristics of the sellers in the network mall, to respectively establish different quality of service evaluation models, and to evaluate the quality of service of the sellers in the network mall by using the models, to obtain the quality of service score of the quality evaluation objects, to reflect the comprehensive quality of the quality evaluation objects by using the score. The disadvantages are that: according to the scheme, through the relatively solidified service quality assessment model, the model parameters cannot be dynamically and accurately updated, the problems of poor timeliness, low accuracy of assessment effect and the like of the assessment model are easily caused, and the dynamic update and service quality maintenance of the service quality assessment model are not facilitated.

Disclosure of Invention

The invention aims to provide a service quality evaluation system based on time-window-based deep reinforcement learning, which solves the problems of low evaluation accuracy, poor timeliness and the like in the existing quality evaluation method by utilizing the interactivity and decision-making capability of reinforcement learning and the perception capability of deep learning.

The purpose of the invention is realized as follows: a service quality assessment system based on time-window deep reinforcement learning comprises:

a data acquisition module: the system is used for collecting various support data of a service evaluation object, including multi-dimensional support data of the service, related data of a service provider and evaluation data of a consumer, and providing a data source for evaluating the service quality;

a model adjustment module: the method is used for constructing a basic quality evaluation model and designing a reinforcement learning model adjustment action set;

a parallel learning module: the system is used for realizing the parallel learning of multiple service providers;

a reward feedback module: a hysteresis reward for design and summary assessment model adjustment;

a Q table update module: according to the updating formula of the Q table in the Q-learning algorithm of reinforcement learning:

wherein α is learning rate, γ is discount factor, and Q (S, A) is model tuning taken under current evaluation model state SThe desired prize value achieved by the entire action a,

taking the maximized reward obtained by the maximized model adjustment reward action for the next state S' of the current assessment model taking the model adjustment action A;

a periodic iteration module: the method is used for controlling the iteration of the reinforcement learning period adjusted based on the service quality evaluation model;

a prediction learning module: rewards for predicting adjustment actions made under different service quality assessment model states;

a time window adjustment module: the method is used for predicting the comprehensive model evaluation effect based on the time window under different conditions and service quality evaluation objects.

As a further limitation of the present invention, the model adjustment module specifically functions as: firstly, a basic quality evaluation model is built based on an experience or mathematical model, the basic quality evaluation model is used as a starting point of reinforcement learning, and a learning result of the previous period is used as a basic quality evaluation model when the learning starting point of the subsequent period is used; each evaluation model adjustment result and the basic evaluation model based on the model adjustment action become a state set part of the reinforcement learning; dynamically updating the service quality evaluation result of the service providing object based on the dynamic quality evaluation model applied by the service providing object; based on the basic model, a reinforcement learning action set is designed, and the basic adjustment action and amplitude of the model are determined according to the structure of the basic model, so that the model adjustment action is prevented from generating too large deviation, progressive model adjustment is realized, and the loss caused by too fast model adjustment is reduced.

As a further limitation of the present invention, the parallel learning module function further comprises: the method adopts a traversing action combination mode, namely under the condition that the data volume allows to ensure the universality of the reinforcement learning result, in a certain model adjusting action specification, the evaluation model adjusting action combinations with more moderate service providing objects are distributed, so that the generalization reward under various actions and states is obtained, and a plurality of learning action combinations are completed and the reward is obtained in one learning action period; after the first round of reward feedback, applying the rest part of service providing objects to the model adjusting result of the previous round to realize the multi-action parallel quality evaluation model adjustment of the second round; and repeating multiple rounds, and stopping the parallel learning when the action combination explodes to the end of the reinforcement learning until less action combinations are executed by all parts of service providers based on the states of the parts of service providers or an iteration cycle is completed.

As a further limitation of the present invention, the periodic iteration module specifically has the functions of: the reinforcement learning cycle iteration based on the service quality evaluation model adjustment is controlled by setting the cycle control model learning time or times, and the optimal service quality evaluation model in all service providers is applied to all service providing objects in time, so that the reinforcement learning efficiency is improved, and the instantaneity and the uniformity of the quality evaluation model adjustment effect application are ensured;

as a further limitation of the present invention, the function of the prediction learning module is specifically as follows: according to the reinforcement learning historical data of a certain period, taking contents such as model adjustment actions in different evaluation model states as input data, taking a corresponding Q value as a label, and adopting an automatic encoder to perform dimension reduction on the input data to train a deep neural network model; setting a suitable model adjustment dynamic time window based on prediction learning, stopping interactive reinforcement learning in the time window, adopting a predictive reinforcement learning model adjustment scheme based on an evaluation model before the time window, adjusting a service quality evaluation model based on a maximized Q value, and applying the adjusted model to service evaluation

As a further limitation of the present invention, the time window adjusting module specifically has the following functions: firstly, a proper model based on prediction learning is set manually to adjust a dynamic time window, then a deep neural network model is trained based on data such as historical data and the comprehensive effect of a corresponding evaluation model set based on the time window length, the comprehensive model evaluation effect based on the time window of a service quality evaluation object under different conditions is predicted, and the time window length adjusted based on the prediction learning model is dynamically adjusted on the principle of maximizing the model evaluation effect.

A service quality assessment method based on time-window deep reinforcement learning comprises the following steps:

the method comprises the following steps: aiming at a specific service industry, a data acquisition module is used for dynamically acquiring service quality evaluation support data in real time, such as relevant contents of service reliability indexes, safety indexes, responsiveness indexes and the like, and quantifying the relevant contents;

step two: according to the collected service quality evaluation support data, a model adjusting module is utilized to design a basic service quality evaluation model according to experience or a mathematical model, the basic service quality evaluation model is applied to all service providing objects to evaluate the service quality of the service providing objects, and the dynamic service evaluation result is fed back to a service participant in real time;

step three: designing a model adjusting action based on the basic service quality evaluation model by using a model adjusting module, determining a reinforced learning state starting point, and updating a service quality evaluation result of the service provided by a service providing object in real time;

step four: dividing the service providing object into a plurality of equal parts by utilizing a parallel learning module, and distributing the model adjusting action combination generated in the step three to one part of the service providing object so as to generate more model micro-adjusting states;

step five: obtaining the model adjustment action reward of each service providing object in the previous step based on the original model state by utilizing a reward feedback module;

step six: utilizing a Q table updating module to calculate a Q value according to the original service quality evaluation model state and model adjusting action reward of each service providing object, and updating a corresponding reward expectation formed by combining the evaluation model state and the evaluation model adjusting action in a Q table;

step seven: judging whether the reinforcement learning period is finished or not by using a period iteration module, if not, taking a proper part of the rest service providing objects, and continuing the reinforcement learning process from the fourth step to the sixth step by combining the service object providing part of the previous step and the service quality evaluation model state adopted by the service object providing part; if the process is finished, going to the next step;

step eight: applying the learning result of the service quality evaluation model of the previous period to all service providing objects by using a period iteration module, and determining whether to start a reinforcement learning process of a new period from the step two;

step nine: training a deep neural network to predict different action rewards under the current service quality evaluation model state by using a prediction learning module under the condition of periodical data accumulation based on reinforcement learning, and realizing the evaluation model adjustment of maximizing prediction rewards based on the deep neural network model;

step ten: controlling the time length of the step nine by using a time window adjusting module, namely controlling the iteration of two model adjusting schemes based on reinforcement learning and depth reinforcement learning; training a deep neural network under the condition that historical data and corresponding model evaluation effects and other related data are set based on the time window length, so as to predict the model evaluation effects corresponding to different time window lengths under different conditions; further realizing the length setting of a deep reinforcement learning time window for maximizing the evaluation effect based on a deep neural network model; and after the time window is finished, continuously adopting an interactive reinforcement learning scheme on the basis of the model adjustment result in the time window.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

(1) the method utilizes the reinforcement learning algorithm to realize large-scale dynamic interaction between the service quality evaluation model and actual service benefits, and ensures the accuracy and the timeliness of service quality evaluation;

(2) the invention utilizes the parallel reinforcement learning algorithm of the periodic micro-action, improves the learning efficiency of the reinforcement learning under the condition of long reward lag time in a parallel mode, controls the trial and error cost of the reinforcement learning by the micro-action, and ensures the rapid and uniform application of the reinforcement learning result by the period. In general, the application scene of reinforcement learning in practice is widened;

(3) the model adjustment is realized by adopting the fusion of deep learning and reinforcement learning in a time-sharing window, the reward of the reinforcement learning is predicted through the deep learning based on the historical data of the reinforcement learning, the model adjustment scheme based on the deep reinforcement learning is realized, the trial and error cost of the reinforcement learning in the real scene is further reduced, and the efficiency is improved. The service quality assessment effect brought by the time window length is predicted by adopting a deep neural network, the time window length is selected and adjusted according to the maximum prediction effect, and the accuracy of deep reinforcement learning is ensured;

(4) the service quality assessment model with high accuracy and strong timeliness obtained through reinforcement learning is beneficial to improving the service quality of the service providing object, the management efficiency of service industry operators, the consumption cost performance of consumers and the maintenance of the normal order of the service industry.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

A service quality evaluation system based on time window deep reinforcement learning comprises a data acquisition module, a model adjusting module, a reward feedback module, a parallel learning module, a Q table updating module, a periodic iteration module, a prediction learning module and a time window adjusting module, wherein the specific contents of the modules are as follows:

(1) a data acquisition module:

the system is used for collecting various support data of a service evaluation object, including multi-dimensional support data (such as cost performance, product quality, service attitude and the like) of the service, related data (such as service scale, integrity record and the like) of a service provider, evaluation data (such as satisfaction, poor evaluation rate and the like) of a consumer and the like, and providing a data source for evaluating the service quality.

(2) A model adjustment module:

the method is used for constructing a basic quality evaluation model and designing a reinforcement learning model adjustment action set. Considering that the augmented learning based on reality cannot be started from zero, a basic quality evaluation model needs to be constructed based on experience or mathematical models firstly, and the basic quality evaluation model is used as the starting point of the augmented learning, and the learning result of the previous period is used as the basic quality evaluation model when the learning starting point of the subsequent period is used. Each of the evaluation model adjustment results and the basic evaluation model based on the model adjustment action will become part of the state set for reinforcement learning. Dynamically updating the service quality evaluation result of the service providing object based on the dynamic quality evaluation model applied by the service providing object;

based on a basic model, a reinforcement learning action set is designed, and basic model adjustment actions and amplitudes, such as support parameter weight adjustment (increasing or decreasing certain parameter weight) and estimation model structure adjustment, are determined according to a basic model structure, so that the model adjustment actions are not prone to generating too large deviation, progressive model adjustment is realized, and loss caused by too fast model adjustment is reduced.

(3) A parallel learning module:

the method is used for realizing the multi-service provider parallel learning. Considering that the traditional single-agent learning needs to be rapidly interacted with the environment, a better learning effect can be obtained. And the reward acquisition of the service quality evaluation model adjustment action needs a longer time, days or even months, so that the efficiency of reinforcement learning needs to be improved based on the data magnitude of a huge service provider. Parallel learning is adopted, namely, in a learning action period, a service providing object is divided into a plurality of parts, quality evaluation model adjustment actions of different types or different stages (such as direct distribution of a plurality of continuous actions) are distributed to the service providing objects of different parts, and the shortage of reward feedback time is made up by the advantage of quantity;

in consideration of the huge magnitude of data in the early stage of reinforcement learning, a mode of traversing action combinations is adopted, namely under the condition that the data volume allows the universality of reinforcement learning results to be ensured, in a certain model adjusting action specification, a proper number of evaluation model adjusting action combinations (including one or a proper number of action combinations executable under the current evaluation model) of part of service providing objects are allocated, so that the generalization rewards under various actions and states are obtained, and a plurality of learning action combinations are completed and the rewards are obtained in one learning action cycle.

After the first round of reward feedback, applying the rest part of service providing objects to the model adjusting result of the previous round to realize the multi-action parallel quality evaluation model adjustment of the second round; and repeating multiple rounds, and stopping the parallel learning when the action combination explodes to the end of the reinforcement learning until less action combinations are executed by all parts of service providers based on the states of the parts of service providers or an iteration cycle is completed.

(4) A reward feedback module:

hysteresis rewards for design and summary assessment model adjustment. With respect to the design of reinforcement learning reward, it is necessary to lead to the goal of reinforcement learning so that the evaluation model adjustment action sequence can continuously trend toward the real service quality. Therefore, the service benefits (such as turnover, sales volume, service times, service frequency and the like) adjusted by the evaluation model are used as rewards, and the benefit rewards cannot be fed back in real time after the evaluation model is adjusted, so that the delay reward time is set according to the actual effect generation time after the model is adjusted, and after the evaluation model is adjusted, the service benefit changes before and after the period are counted by taking the delay reward time as a reference. When the service benefit is increased, the service quality evaluation result after the service evaluation model adjusting action is closer to the real service quality, and the positive reward of the increasing proportion of the model adjusting action benefit under the evaluation model state is distributed; on the contrary, when the service benefit is attenuated, the larger the difference between the evaluation result and the real service quality is, the negative reward of the attenuation proportion of the model adjusting action benefit under the evaluation model state is distributed;

aggregation of late rewards requires the collection of benefit change data from multiple partial service offerings. And collecting the average benefit change of the service providing objects in the different parts according to the evaluation model adjusting action combination distributed to the service providing objects in the different parts, and obtaining the generalization reward after the evaluation model adjusting action combination is generated in the previous model state according to the average benefit change.

(5) Q table updating module

According to the updating formula of the Q table in the Q-learning algorithm of reinforcement learning:

where α is the learning rate, γ is the discount factor, Q (S, A)The expected reward value achieved by taking the model adaptation action a for the currently evaluated model state S,

the next state S' for the current assessment model to take the model adjustment action a takes the maximized reward gained by the maximized model adjustment reward action. And designing reinforcement learning parameters according to actual conditions, and realizing dynamic updating of the Q table based on the formula, thereby obtaining a model adjustment action combination which maximizes the expected reward in a learning period.

(6) A periodic iteration module:

and setting the learning time or times of the period control model, controlling the reinforcement learning period iteration based on the service quality evaluation model adjustment, and applying the optimal service quality evaluation model in all service providers to all service providing objects in time so as to improve the reinforcement learning efficiency and ensure the instantaneity and uniformity of the quality evaluation model adjustment effect application.

(7) A prediction learning module:

according to the reinforcement learning historical data of a certain period, taking contents such as model adjusting actions in different evaluation model states as input data, taking a corresponding Q value as a label, adopting an automatic encoder to perform dimensionality reduction on the input data, and training a deep neural network model to predict rewards of the adjusting actions made in different service quality evaluation model states. Setting a proper model adjustment dynamic time window based on prediction learning, stopping interactive reinforcement learning in the time window, adopting a predictive reinforcement learning model adjustment scheme based on an evaluation model before the time window, adjusting a service quality evaluation model based on a maximized Q value, and applying the adjusted model to service evaluation.

(8) A time window adjustment module:

firstly, a proper model based on prediction learning is set manually to adjust a dynamic time window, then a deep neural network model is trained based on data such as historical data and the comprehensive effect of a corresponding evaluation model set based on the time window length, the comprehensive model evaluation effect based on the time window of a service quality evaluation object under different conditions is predicted, and the time window length adjusted based on the prediction learning model is dynamically adjusted on the principle of maximizing the model evaluation effect.

As shown in fig. 1, a service quality assessment method based on time-window-based deep reinforcement learning specifically includes the following steps:

the method comprises the following steps: aiming at a specific service industry, the data acquisition module is used for dynamically acquiring service quality assessment support data in real time, such as relevant contents of service reliability indexes, safety indexes, responsiveness indexes and the like, and quantifying the relevant contents.

Step two: according to the collected service quality evaluation support data, a model adjusting module is utilized to design a basic service quality evaluation model according to experience or mathematical models, the basic service quality evaluation model is applied to all service providing objects to evaluate the service quality of the service providing objects, and the dynamic service evaluation result is fed back to service participants such as service providers, service enjoyers, service supervisors and the like in real time.

Step three: and designing a model adjusting action based on the basic service quality evaluation model by using a model adjusting module, determining a starting point of a reinforcement learning state, and updating a service quality evaluation result of the service provided by the service providing object in real time.

Step four: and dividing the service providing object into a plurality of equal parts by using a parallel learning module, and distributing the model adjusting action combination generated in the step three to one part of the service providing object so as to generate more model micro-adjusting states.

Step five: and acquiring the model adjustment action reward of each part of service providing object in the previous step based on the original model state by using a reward feedback module.

Step six: and utilizing a Q table updating module to calculate a Q value according to the original service quality evaluation model state and the model adjusting action reward of each service providing object, and updating the corresponding reward expectation formed by combining the evaluation model state and the evaluation model adjusting action in the Q table.

Step seven: and judging whether the reinforcement learning period is finished or not by using a period iteration module, if not, taking a proper amount of parts in the rest service providing objects, and continuing the reinforcement learning process from the fourth step to the sixth step by combining the service object providing part of the previous step and the service quality evaluation model state adopted by the service object providing part. If the process is finished, the next step is carried out.

Step eight: and applying the learning result of the service quality evaluation model of the previous period to all service providing objects by using a period iteration module, and determining whether to start a reinforcement learning process of a new period from the step two.

Step nine: and training the deep neural network to predict different action rewards in the current service quality evaluation model state by using a prediction learning module under the condition of periodical data accumulation based on reinforcement learning, and realizing the evaluation model adjustment of maximizing the prediction rewards based on the deep neural network model.

Step ten: and controlling the time length of the step nine, namely controlling the iteration of two model adjusting schemes based on reinforcement learning and deep reinforcement learning by using a time window adjusting module. And training the deep neural network under the condition of setting historical data and corresponding model evaluation effects and other related data based on the time window length so as to predict the model evaluation effects corresponding to different time window lengths under different conditions. And further realizing the length setting of the deep reinforcement learning time window for maximizing the evaluation effect based on the deep neural network model. And after the time window is finished, continuously adopting an interactive reinforcement learning scheme on the basis of the model adjustment result in the time window.

The technical scheme of the invention is further explained in detail by combining the attached drawings:

taking e.g. e-commerce service quality assessment, there are one million merchants, that is, service providing objects, whose service quality assessment is supported by X1, X2, and X3 indexes, and whose basic service quality assessment model based on experience or mathematical modeling is Y0.5X 1+ 0.2X 2-6X 3, which needs to be dynamically adjusted based on reinforcement learning and set a learning period to four weeks.

Setting the model adjusting action set as { X1 index weight value increased by 0.001 by taking Y as 0.5X 1+ 0.2X 2-6X 3 as the initial evaluation model state; the weight of the X1 index is reduced by 0.001; the weight of the X2 index is increased by 0.001; the weight of the X2 index is reduced by 0.001; the weight of the X3 index is increased by 0.001; the weight of the X3 index is reduced by 0.001 }. Let the model adjust the action lag reward as: if the turnover increase amplitude is larger than 0.001% after one week due to model adjustment, the increased percentage is taken as the reward; otherwise, if the turnover reduction amplitude is larger than 0.001%, the reduced percentage is used as the negative reward.

Suppose that one million merchants are divided into one million groups, each group comprising one million merchants. Firstly, twenty groups of action combinations in the execution model adjustment action set are taken, wherein six groups execute six basic actions, and the rest fourteen groups execute superposition basic actions based on the six basic actions, for example, action combinations such as X2 index weight is reduced on the basis that the X1 index weight is increased by 0.001, and the parallel adjustment of the evaluation model parameters is realized.

After one week, the reward of each service providing object group for adjusting action combination based on the service quality evaluation model executed by the service evaluation model is calculated, and the Q table is updated based on the Q table updating formula.

And based on the adjusted model, continuously using the action adjustment service quality evaluation model in the model adjustment action set to carry out interactive learning, and adding the rest proper amount of one hundred groups of merchants into a learning process based on the last evaluation model state, thereby realizing more model adjustment schemes.

After four weeks, according to the learned Q table, the service quality assessment model reached by the action sequence with the largest reward is selected, and if the action sequence with the largest reward is changed from Y1 to 0.5 to X1+0.2 to X2 to 6 to X3 to Y2 to 0.501 to X1 to 0.203 to X2 to 5.008 to X3, the assessment model Y2 with stronger timeliness and more accuracy after reinforcement learning is applied to the service quality assessment of all merchants. And starting multi-period dynamic service quality assessment model learning adjustment and a dynamic assessment process based on the assessment model by taking Y2 as an initial state of the next round of learning, namely a basic assessment model.

Based on the model adjusting behaviors, extracting action and reward sets corresponding to each model state, taking rewards as labels, taking the model states, corresponding action combinations and data sets representing external influence services such as market vitality and the like as input, training a deep neural network to predict the maximum rewards for adopting different model adjusting actions in different model states under different conditions, and adopting the model adjusting action based on the maximum predicted rewards to adjust a service quality evaluation model in a certain time window (such as one week).

And after the time window is finished, continuously adopting a reinforcement learning strategy to adjust the service quality evaluation model result of the last time window, and training a time window adjustment model according to the set time window length historical data, the set service related data and the corresponding service quality evaluation effect, so that the time window dynamic adjustment based on deep learning is realized, and the accuracy of the deep reinforcement learning strategy under the prediction condition is ensured.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A service quality assessment system based on time-window deep reinforcement learning, comprising:

where α is the learning rate, γ is the discount factor, Q (S, A) is the expected reward value obtained by taking the model adjustment action A in the current assessment model state S,

2. The system for evaluating the service quality based on the time-windowing deep reinforcement learning of claim 1, wherein the model adjusting module specifically functions as: firstly, a basic quality evaluation model is built based on an experience or mathematical model, the basic quality evaluation model is used as a starting point of reinforcement learning, and a learning result of the previous period is used as a basic quality evaluation model when the learning starting point of the subsequent period is used; each evaluation model adjustment result and the basic evaluation model based on the model adjustment action become a state set part of the reinforcement learning; dynamically updating the service quality evaluation result of the service providing object based on the dynamic quality evaluation model applied by the service providing object; based on the basic model, a reinforcement learning action set is designed, and the basic adjustment action and amplitude of the model are determined according to the structure of the basic model, so that the model adjustment action is prevented from generating too large deviation, progressive model adjustment is realized, and the loss caused by too fast model adjustment is reduced.

3. The system for evaluating the service quality based on the time-windowing deep reinforcement learning of claim 2, wherein the parallel learning module function further comprises: the method adopts a traversing action combination mode, namely under the condition that the data volume allows to ensure the universality of the reinforcement learning result, in a certain model adjusting action specification, the evaluation model adjusting action combinations with more moderate service providing objects are distributed, so that the generalization reward under various actions and states is obtained, and a plurality of learning action combinations are completed and the reward is obtained in one learning action period; after the first round of reward feedback, applying the rest part of service providing objects to the model adjusting result of the previous round to realize the multi-action parallel quality evaluation model adjustment of the second round; and repeating multiple rounds, and stopping the parallel learning when the action combination explodes to the end of the reinforcement learning until less action combinations are executed by all parts of service providers based on the states of the parts of service providers or an iteration cycle is completed.

4. The service quality assessment system based on time-windowed deep reinforcement learning of claim 3, wherein the periodic iteration module specifically functions as: by setting the learning time or times of the period control model, the reinforcement learning period iteration based on the service quality evaluation model adjustment is controlled, and the optimal service quality evaluation model in all service providers is applied to all service providing objects in time, so that the reinforcement learning efficiency is improved, and the instantaneity and the uniformity of the quality evaluation model adjustment effect application are ensured.

5. The system for evaluating the service quality based on the time-windowing deep reinforcement learning of claim 4, wherein the prediction learning module specifically functions as: according to the reinforcement learning historical data of a certain period, taking contents such as model adjustment actions in different evaluation model states as input data, taking a corresponding Q value as a label, and adopting an automatic encoder to perform dimension reduction on the input data to train a deep neural network model; setting a proper model adjustment dynamic time window based on prediction learning, stopping interactive reinforcement learning in the time window, adopting a predictive reinforcement learning model adjustment scheme based on an evaluation model before the time window, adjusting a service quality evaluation model based on a maximized Q value, and applying the adjusted model to service evaluation.

6. The system for evaluating the service quality based on the time-window deep reinforcement learning of claim 5, wherein the time-window adjusting module specifically functions as: firstly, a proper model based on prediction learning is set manually to adjust a dynamic time window, then a deep neural network model is trained based on data such as historical data and the comprehensive effect of a corresponding evaluation model set based on the time window length, the comprehensive model evaluation effect based on the time window of a service quality evaluation object under different conditions is predicted, and the time window length adjusted based on the prediction learning model is dynamically adjusted on the principle of maximizing the model evaluation effect.

7. A service quality assessment method based on time-window deep reinforcement learning is characterized by comprising the following steps: