CN108304586A

CN108304586A - A kind of availability of data improvement method of task orientation

Info

Publication number: CN108304586A
Application number: CN201810186852.2A
Authority: CN
Inventors: 李保珍; 韩占校; 张亭亭; 余臻
Original assignee: NANJING AUDIT UNIVERSITY
Current assignee: NANJING AUDIT UNIVERSITY
Priority date: 2018-03-07
Filing date: 2018-03-07
Publication date: 2018-07-20

Abstract

The invention discloses a kind of availability of data improvement methods of task orientation, and the correlation based on two part figures theory and data attribute and task attribute builds the potentially useful attribute excavation model of task orientation；And the data attribute correlation based on two part figures theory and task orientation, the multi-source data mining model with complementary attribute of structure task orientation；Then the potentially useful attribute of available data collection and complementary multi-source data are excavated by the potentially useful attribute excavation model of constructed task orientation；Other multi-source data collection of the available data collection with complementary attribute are excavated by the multi-source data mining model with complementary attribute of constructed task orientation again.Inherent nature mining model proposed by the invention and complementary multi-source data mining model, it is more than available attributes and multi-source data expected from user that can be filtered out for particular task, and then the realization efficiency of particular task can be improved.

Description

A kind of availability of data improvement method of task orientation

Technical field

The present invention relates to data processing fields, and in particular to a kind of availability of data improvement method of task orientation.

Background technology

With the development of information technology, data retrieval capabilities, which have, to be greatly improved, we can obtain magnanimity, more in time Source, isomery data, however, for specific decision or prediction task, related data has great noise, namely very much Existing obtainable data attribute is uncorrelated to particular task target；On the other hand, due to information island, data-privacy safety etc. Reason much has and can not obtain in time with particular prediction or the relevant data attribute of decision task.

Thus there is an intrinsic contradictions in the availability of data analytic process of particular task：Specific task needs Specific data attribute is wanted, but we cannot obtain these attributes from available data；There are many attributes for data available, but These characteristics are not directly dependent upon with specific task.Previous problem is " mission requirements are supplied more than data ", i.e., specific to appoint Being engaged in requirement cannot be by the attributes match of perhaps multiattribute data available；Latter problem is " data supply is more than mission requirements ", I.e. there are many available data attributes, but they do not have the attribute of specific mission requirements related.For available data can It is always that theoretical circles and application circle are paid close attention to more with sex chromosome mosaicism, namely for the data dependence and problem of completeness of particular task Hot and difficult issue.

Currently, the shortcomings that prior art includes mainly the following：

(1) research of the quality of data is concentrated mainly on the accuracy and correlation research of data, and for towards specific The research of the availability of data of business is less；

(2) research of data set correlation is concentrated mainly on information retrieval field, and application is mainly reflected in e-commerce Precision marketing and personalized recommendation, and the research of current data dependence focuses mostly in available data attribute and mission requirements Correlation still lacks the relevant mining for the potential valuable value attribute of data；

(3) correlative study of data set completeness is concentrated mainly on domain of data fusion, and application is mainly reflected in data Transaction field, and current data extrapolating research focuses mostly in the integrality of available data attribute, still lacks and is directed to specific Business demand excavates the complementarity of multi-source heterogeneous data attribute.

Invention content

To solve the above problems, the present invention provides a kind of availability of data improvement methods of task orientation.

To achieve the above object, the technical solution that the present invention takes is：

A kind of availability of data improvement method of task orientation, includes the following steps：

S1, correlation and completeness based on data attribute with task attribute formulate the availability of data of task orientation Quantitative assessing index system；

S2, the correlation based on two part figures theory and data attribute and task attribute, build the potential of task orientation Available attributes mining model；

S3, the data attribute correlation based on two part figures theory and task orientation, having for structure task orientation are mutual Mend the multi-source data mining model of attribute；

S4, the potential of available data collection is excavated by the potentially useful attribute excavation model of constructed task orientation Available attributes and complementary multi-source data；

S5, number is had by the multi-source data mining model excavation with complementary attribute of constructed task orientation There are other multi-source data collection of complementary attribute according to collection.

Wherein, the potentially useful attribute excavation model of the task orientation is built by following steps：

Input：Data attribute matrix M_DF, task attribute matrix M_TF；

Output：Data source D_jWith task T_iMatching matrix with potentially useful property matching value；

Step 1：Based on bipartite graph theoretical calculation data task matrix

Step 2：In data task matrix M_DTParticular task T_iIn, select the data source D with maximum matching value_j；

Step 3：For particular task T_iWith particular source D_j, it is based on data attribute matrix M_DFWith task attribute matrix M_TF Calculate particular source D_jPotentially useful degree；

Step 3.1：Calculate data source D_jEach attribute and task T_iEach attribute between degree of correlation C_F；

Step 3.2：Based on certain dependent thresholds, the higher attribute of the degree of correlation is selected, and the addition of these attributes is taken office Be engaged in T_iProperty set in；

Step 3.3：Task based access control T_iNew attribute, pass through data attribute matrix M_DFWith new task attribute matrix M_TFIt calculates Data source D_jPotentially useful degree；

4th step：It repeats the above steps, until traversing all tasks.

The step S4 specifically comprises the following steps：

Input：Data attribute matrix M_DF, task attribute matrix M_TF；

Output：Data source D_jWith task T_iMatching matrix with complementary availability matching value；

Step 1：Based on bipartite graph theoretical calculation data task matrix

Step 2：In data task matrix M_DTParticular task T_iThe data source D of the middle maximum matching value of selection_j；

Step 3：For specific tasks T_i(contain D with data source D_j), according to data attribute matrix M_DFWith task attribute matrix M_TF, calculate particular source D between data source D_jThe complementarity of availability；

Step 3.1：Calculate each data source D_jSimilarity S between particular source_D；

Step 3.2：Based on certain similarity threshold, the lower data source of similarity is selected；

Step 3.3：By selected data source (including particular source D_j) it is aggregated into entire data source D；

Step 3.4：Based on new data source D and data attribute matrix M_DFWith new task attribute matrix M_TF, calculate Data source D_jThe complementarity of availability；

4th step：It repeats the above steps, until traversing all tasks

According to a certain particular task in said program, and the potentially useful of data available is excavated based on certain dependent thresholds Attribute, and excavate the multi-source data that there are potential supplementary functions with available data collection；Specifically：

(1) by the mining model of potentially useful attribute, we can be that particular task selects suitable available data sets Inherent nature, so as to effectively improve the availability of available data.It is not only does this facilitate and increases the available of available data Value, and the sunk cost of available data can be reduced.

(2) mining model of multi-source complementary data can be applied to the multi-source complementary data collection selection of particular task, in turn It can more efficiently realize particular task.Be not only does this facilitate increase particular task realized value, but also can reduce by The opportunity cost caused by the attribute requirements of particular task part is can not achieve in lacking data.

(3) consider available data and particular task, available data all can be improved for particular task reality in relational approach Existing validity.

In short, inherent nature mining model proposed by the invention and complementary multi-source data mining model, can be directed to specific Task, which filters out, to be more than available attributes and multi-source data expected from user, and then the realization efficiency of particular task can be improved.

Description of the drawings

Fig. 1 is available data collection usable value and the sunk cost signal that approach is improved based on different data collection availability Figure；

In figure：(a) usable value of data set attribute；(b) sunk cost of data set attribute.

Fig. 2 is the realizable value and opportunity cost for the particular task that approach is improved based on different data availability；

In figure：(a) realizable value of particular task；(b) opportunity cost of particular task.

Fig. 3 is validity of the available data to particular task that approach is improved based on different data availability.

Specific implementation mode

In order to make objects and advantages of the present invention be more clearly understood, the present invention is carried out with reference to embodiments further It is described in detail.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to limit this hair It is bright.

Embodiment

The matching matrix of table 1. task attribute and data attribute

(1) potential value (Potential value) of data set:PV=13/ (13+36)=26.53%

(2) sunk cost (Sunk cost) of data set:SC=36/ (13+36)=73.47%

(3) realizable value (Realization value) of particular task:RV=13/ (13+28)=31.71%

(4) opportunity cost (Opportunity cost) of particular task:OC=28/ (13+28)=68.29%

(5) validity (Validity) of the data set for particular task:V=13/ (36+13+28)=16.88%

User's Travel Demand Forecasting example：

(1) particular task describes

Congested in traffic and demand side of driving is being solved, customization public service can provide quick and efficient clothes for passenger Business.Currently, domestic, there are " the real-time public transport " of " the panda public transport " in Dalian, " heart enjoys bus " in Hangzhou and Nanjing etc. customizations Public transit system.

Customization public transport be it is a kind of with demand be oriented to Public Transport Service, according to user trip needs, be capable of providing spirit " specific time ", " locality " and " one seat of a people " bus service living, wherein user's Travel Demand Forecasting is to realize to have The premise and key of effect customization bus service.

In customizing bus service, the specific tasks of user's Travel Demand Forecasting are required there are many relevant data attribute, But existing multi-source data not only have part association attributes, but also exist with particular task demand properties it is unmatched its His redundant attributes.

(2) available data describes：

Related data has following features：

(1) real-time：Pass through internet, phone, mobile phone and smart mobile phone；

(2) multi-source：Social media, smart card, point of interest map, GPS, location based service, video monitoring, RFID；

(3) isomery：Information and travelling are changed in different traffic, such as public transport, taxi, subway, bicycle IC card information Information；

(4) higher-dimension：Identity card, card type, travel permit, departure time, arrival time, starting station and destination；

(5) hierarchy：Urban district, street, etc..

Table 2 outlines associated data set and its attribute.

The data set and its attribute of 2. user's Travel Demand Forecasting of table

These above-mentioned data can by between confidentiality agreement and partner share, we with Nanjing Ya Gao Bus Groups Cooperate and has data confidentiality agreement.We have collected relevant user's trip requirements data by data-interface, in addition, we from The public transport company in Nanjing obtains historical user's travel data of real-time GPS data and passenger's IC card data.In short, in table 3 7 particular tasks, 13 data sets and relevant 23 attributes can pass through the inherent nature mining model of this research and complementation Multi-source data mining model improves the realization efficiency of user's Travel Demand Forecasting task.

Table 3. can get data set, particular task and association attributes

(3) for it can get data set, the data set inherent nature mining model and complementary data of task orientation are dug Dig the evaluation of result of model

For different mission requirements, property set possessed by specific set of data is there are different usable values and sinks No cost.Excavate the potentially useful attribute of available data collection and complementary multi-source data by the model of this research, and by its with it is first Beginning state is compared, it can be seen that the usable value of available data attribute is significantly increased, and its sunk cost has significant decrease. Experimental result is as shown in Figure 1.

(4) for particular task, the data set inherent nature mining model and complementary data of task orientation excavate mould The evaluation of result of type

For different available data collection and its attribute, particular task has different realized values and due to data Different opportunity costs caused by attribute missing.The potentially useful attribute and mutually of available data collection is excavated by the model of this research Multi-source data to be mended, and it is compared with original state, it can be seen that the realizable value of particular task is significantly increased, and its Opportunity cost has significant decrease.Experimental result is as shown in Figure 2.

(5) particular task and available data, the data set inherent nature mining model and complementary data of task orientation are taken into account The evaluation of result of mining model

The property set of available data and the property set of particular task are taken into account, available data collection is excavated by the model of this research Potentially useful attribute and complementary multi-source data, and it is compared with original state, it can be seen that available data collection is for spy The validity for determining task is significantly increased.Experimental result is as shown in Figure 3.

(6) particular task and available data collection, screening inherent nature and complementary multi-source data are directed to

For specific task, we can find the data attribute for being most suitable for the task in the initial state.Based on latent In attribute excavation model, we can further filter out its potentially useful attribute.Based on complementary multi-source data mining model, I Can further screen other multi-source data collection that there is complementary attribute for available data collection.Our experimental result is shown in Table 4.

Potentially useful attribute and complementary data collection of the table 4. based on particular task

To sum up, the embodiment of the present invention considers potentially useful attribute and multi-source complementary data according to particular task.This can be with Us are helped to realize the surcharge of particular task.For example, we should be during decision based between data attribute Task linked character solves the sparse deficiency of data value；Meanwhile we should also consider in data acquisition with it is existing Data set has other multi-source data collection of complementary attribute, and then can reduce due to information island and data-privacy safety etc. Data attribute caused by limitation lacks problem.In short, inherent nature mining model proposed by the invention and complementary multi-source number According to mining model, it is more than available attributes and multi-source data expected from user that can be filtered out for particular task, and then can be carried The realization efficiency of high particular task.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the principle of the present invention, it can also make several improvements and retouch, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of availability of data improvement method of task orientation, which is characterized in that include the following steps：

S1, correlation and completeness based on data attribute with task attribute, the availability of data for formulating task orientation are quantitative Assessment indicator system；

S2, the correlation based on two part figures theory and data attribute and task attribute, build the potentially useful of task orientation Attribute excavation model；

S3, the data attribute correlation based on two part figures theory and task orientation, structure task orientation have complementary belong to The multi-source data mining model of property；

S4, the potentially useful of available data collection is excavated by the potentially useful attribute excavation model of constructed task orientation Attribute and complementary multi-source data；

S5, available data collection is excavated by the multi-source data mining model with complementary attribute of constructed task orientation Other multi-source data collection with complementary attribute.

2. a kind of availability of data improvement method of task orientation as described in claim 1, which is characterized in that the task is led The potentially useful attribute excavation model of tropism is built by following steps：

Input：Data attribute matrix M_DF, task attribute matrix M_TF；

Step 1：Based on bipartite graph theoretical calculation data task matrix

Step 3：For particular task T_iWith particular source D_j, it is based on data attribute matrix M_DFWith task attribute matrix M_TFIt calculates Particular source D_jPotentially useful degree；

Step 3.2：Based on certain dependent thresholds, the higher attribute of the degree of correlation is selected, and these attributes are added to task T_i's In property set；

Step 3.3：Task based access control T_iNew attribute, pass through data attribute matrix M_DFWith new task attribute matrix M_TFCalculate data Source D_jPotentially useful degree；

4th step：It repeats the above steps, until traversing all tasks.

3. a kind of availability of data improvement method of task orientation as described in claim 1, which is characterized in that the step S4 Specifically comprise the following steps：

Input：Data attribute matrix M_DF, task attribute matrix M_TF；

Step 1：Based on bipartite graph theoretical calculation data task matrix

Step 3：For specific tasks T_i(contain D with data source D_j), according to data attribute matrix M_DFWith task attribute matrix M_TF, meter Particular source D between calculation data source D_jThe complementarity of availability；

4th step：It repeats the above steps, until traversing all tasks.