CN105975631A

CN105975631A - Assessment method of data use quality of data sets

Info

Publication number: CN105975631A
Application number: CN201610389829.4A
Authority: CN
Inventors: 阮彤; 甘似禹; 叶琪; 李阳; 赵亮
Original assignee: Shanghai Yitong International Ltd By Share Ltd; East China University of Science and Technology
Current assignee: Shanghai Yitong International Ltd By Share Ltd; East China University of Science and Technology
Priority date: 2016-06-03
Filing date: 2016-06-03
Publication date: 2016-09-28

Abstract

The invention provides an assessment method of the data use quality of data sets. The assessment method includes the steps that question evaluating sets, generated when natural language questions are answered, of the data sets are obtained; summarizing and conclusion are carried out according to questions of the question evaluating sets to form a plurality of question templates; final query results are compared with right answers according to the question templates and use quality measurement, and the accuracy, the recalling rate and the comprehensive information performance of the query results are calculated so that a user can assess the data use quality of the data sets. Compared with the prior art, according to the assessment method, the questions generated when the data sets are applied to a question-answering system serve as use scenarios, each querying question corresponds to one use scenario, and the data use quality of the data sets is operably assessed through the querying building difficulty degree of querying performance measuring on the data sets and the amount of information contained by the querying results in the specific use scenarios in measuring of informativity.

Description

A kind of data for data set use the appraisal procedure of quality

Technical field

The present invention relates to a kind of data quality accessment technology, particularly relate to a kind of data for data set and use quality Appraisal procedure.

Background technology

In recent years, various data sources were in online a large amount of issues, and the example in different pieces of information source may point to real world In same entity so that different data sources is associated with each other.Such as, these data sources not only include the conventional data of encyclopaedia class Collection, also includes the data set (such as medical field, financial field etc.) of some special dimensions.But, the number in above-mentioned data source According to often there are such or such quality problems, such as, the discordance of data, imperfection or inaccuracy etc..Cause This, the quality of data understanding data set is the important prerequisite using data set.For the quality of data of data set, existing Lot of documents proposes different tolerance, such as, data complexity, link quality, label quality etc..Relevant data In the document of quality, the tolerance of the existing quality of data is summarized as 68 tolerance by it, and these tolerance are divided into several dimensions Degree, these dimensions can be the characteristic in terms of the availability of data, the inherent character of data, data expression.But, above-mentioned summary These tolerance consider from the visual angle of user, and the usability of data set is not measured practically.

Although additionally, it is that data are under application-specific scene that the quality of data is all admitted in existing mass data quality research It is suitable for this saying of usability, but the existing quality of data does not define relevant tolerance or model to this.There is mirror In this, how to design a kind of can valid metric and assessment data set in data use quality solution, in order to reflection Data characteristic during being used by a user, and then use aspect to embody the quality of data of data set from user, it is relevant The problem that technical staff faces.

Summary of the invention

According to one aspect of the present invention, it is provided that a kind of data for data set use the appraisal procedure of quality, bag Include following steps:

Obtain problem evaluation and test collection when answering natural language problem on data set；

Problem according to described problem evaluation and test collection is summarized and concludes, and forms multiple question template；And

According to described question template and use quality metric, final Query Result is contrasted with correct option, meter Calculate the precision of Query Result, recall rate and integrated information so that the data that user assesses described data set use quality.

An embodiment wherein, described use quality metric includes two dimensions: inquiry property and informedness, Qi Zhongsuo State inquiry property and on described data set, construct a correct inquiry for measure user for described natural language problem Complexity；Described informedness is for measuring the quantity of information that the Query Result in described natural language problem is comprised.

An embodiment wherein, described inquiry property comprise build inquiry difficulty or ease grade, build inquiry spend time Between, construct on territory inquiry time, construct on attribute constraint inquiry time and build inquiry number of attempt.

An embodiment wherein, described informedness comprises informedness grade, precision, recall rate and integrated information.

An embodiment wherein, described integrated information sexual satisfaction following equation:

C I = \frac{N C A}{N A} \times {(\frac{N C A}{A})}^{2} \times α \times β

Wherein, CI represents that integrated information, NCA represent the correct option quantity in Query Result, the mark of NA problem of representation The quantity of quasi-answer, A represents the sum of Query Result, and α represents the data accurateness of data set, and β represents that the data of data set can The degree of understanding, NCA/NA represents the precision of Query Result, and NCA/A represents the recall rate of Query Result.

An embodiment wherein, the data accurateness α of data set is 0.8, and data intelligibility β of data set is 0.8。

An embodiment wherein, the step of above-mentioned acquisition described problem evaluation and test collection is realized by following any one:

-from the set of the application acquisition typical problem of described data set；

-obtain problem from the network platform that described data set is relevant；

-data use the self-defined problem of appraiser of quality.

An embodiment wherein, the above-mentioned problem according to described problem evaluation and test collection is summarized and the step concluded also is wrapped Include: problem is converted into executable inquiry on data set；Inquiry is classified by the structure according to described inquiry, it is thus achieved that classification Result；And form described question template according to classification results.

An embodiment wherein, above-mentioned be converted into problem executable inquiry on data set and include: ask described in setting Territory belonging to topic, to be defined on territory the time T constructing inquiry_a；Add the attribute constraint of described problem, to be defined on attribute about The time T of structure inquiry on bundle_b；And according to the territory of described problem and attribute constraint, automatically build corresponding with described problem Inquiry and on described data set, perform described inquiry, wherein, the time T building inquiry meets following equation:

T=NOA* (T_a+T_b)；

Here, NOA represents the number of attempt of structure inquiry.

An embodiment wherein, described data set performs constructed by inquiry time, do not exist when Query Result or Time incorrect, reset the territory belonging to described problem and attribute constraint successively.

Compared to prior art, the present invention, when the data assessing data set use quality, obtains and answers on data set Problem evaluation and test collection during natural language problem, then summarizes according to the problem of problem evaluation and test collection and concludes thus formed multiple Question template, last Utilizing question template and use quality metric, contrast final Query Result with correct option, meter Calculate the precision of Query Result, recall rate and integrated information so that the data that user assesses data set use quality.Such one Coming, problem when data set is applied to question answering system by the present invention is as using scene, and each inquiry problem makes corresponding to one By scene, measured by the inquiry property using one of quality metric dimension on data set, build inquiry be difficult to journey Spend, and the Query Result measured in specific use scene by the informedness using another dimension of quality metric is wrapped The quantity of information contained, thus the data utilizing inquiry property and informedness operationally to assess data set use quality.

Accompanying drawing explanation

Reader is after the detailed description of the invention having read the present invention referring to the drawings, it will more clearly understand the present invention's Various aspects.Wherein,

Fig. 1 is shown according to one embodiment of the present invention, and the data for data set use the stream of the appraisal procedure of quality Journey block diagram.

Detailed description of the invention

In order to make techniques disclosed in this application content more detailed and complete, can refer to the following of accompanying drawing and the present invention Various specific embodiments, labelling identical in accompanying drawing represents same or analogous assembly.But, those of ordinary skill in the art Should be appreciated that embodiment provided hereinafter is not for limiting the scope that the present invention is contained.Additionally, accompanying drawing is used only for Schematically it is illustrated, and draws not according to its life size.

With reference to the accompanying drawings, the detailed description of the invention of various aspects of the present invention is described in further detail.

With reference to Fig. 1, in this embodiment, data use the appraisal procedure of quality to be achieved by step S1～S3. First, in step sl, problem evaluation and test collection when answering natural language problem on data set is obtained；Secondly, in step s 2, Problem according to acquired problem evaluation and test collection is summarized and concludes, and forms multiple question template；Finally, in step s3, According to question template and use quality metric, final Query Result is contrasted with correct option, calculates Query Result Precision, recall rate and integrated information in case user assess data set data use quality.

Acquisition problem evaluation and test collection

In the prior art, it is possible to use data set include the conventional data collection unrelated to field and relevant with field Data set.In general, conventional data collection refers to comprehensive data set, such as the data on Baidupedia.The number that field is relevant The data set of specific area is referred to, such as marine field, medical field according to collection.The scope that conventional data collection comprises the most all compares Extensively, but the fineness ratio of knowledge is thicker.And the data set of specific area is owing to focusing on a certain professional field, although knowledge wide Degree does not has conventional data collection big, but its Knowledge Granulation is then the most a lot.In existing quality of data research and data Use research is the most all the quality of data laying particular emphasis on conventional data collection, so having a lot of relevant asking towards conventional data collection Topic set is available, such as, and the problem test set in the question and answer field of conventional data collection: from Question Answering over Linked Data (QALD), another is from the WebQuestions of the NLP laboratory of Stanford. The two problem test set is all the typical problem set that data set uses.Additionally, the problem in problem test set also can be from number The network platform (forum/community that such as data set relevant) relevant according to collection obtains, or also can be by using quality evaluation personnel Self-defined problem.

The question template that the evaluation and test of acquisition problem is concentrated

After the problem of acquisition evaluation and test collection, problem therein need to be summarized and conclude, form fundamental problem template. In prior art, a lot of quality of data reviewers or user are not familiar with SQL query language, and the present invention makes for improving data With the usability of method for evaluating quality, the problem that problem evaluation and test is concentrated being summarized as specific template, each template is corresponding Of a sort SQL query.So, when user needs to build inquiry on data set, only will need to return according to concrete data set Special parameter in the template received is inserted, and i.e. can get executable inquiry on data set, is not required to reviewer again Build inquiry voluntarily.

Below by way of table 1, multiple basic templates that the problem according to problem evaluation and test collection is summarized are described

Table 1

Such as, if problem is that " " please provide the relevant information of all enterprises ", then can be summarized as territory template by this problem, right The description answered is all information inquiring about some table.And for example, if problem is " please provide certain president being born in 1945 ", then may be used This problem is summarized as particular attribute-value template, and corresponding description is to inquire about a certain field value in certain table to be equal to the reality of set-point Body (that is, the year of birth field of all presidents president equal to 1945).It will be understood by those of skill in the art that in table 1 Template is only merely schematic some basic problem templates, between these basic problem templates can with recombinant thus obtain More complicated template.(incite somebody to action when reviewer carries out instantiation according to the concrete condition of the data in data set to these templates Relevant parameter is inserted), just obtain executable SQL query.

During the data of the present invention use quality evaluation, it is thus achieved that after problem evaluation and test collection and corresponding question template, Just target data set can be carried out data and use the assessment of quality.

Definition data use the tolerance of quality

In the present invention, applicant devises data pioneeringly and uses the new tolerance of quality, and it includes two dimensions: can Inquiry property and informedness.Wherein, inquiry property constructs one for measure user for natural language problem on data set The complexity of correct inquiry.Informedness is for measuring the quantity of information that the Query Result in natural language problem is comprised.Data Use quality reflection reviewer or user's characteristic of going out of data set itself when using data set.Therefore, data make By quality corresponding to different use scenes.But, in existing data quality model and undefined what be use scene.

One important applied field of data set is question answering system, i.e. the some problem in search reality on data set Answer.The present invention is using these problems as using scene, and an inquiry problem is exactly one and uses scene.Additionally, question answering system In two significant process be inquiry and answer.In inquiry, the present invention uses inquiry property to measure and builds on data set The complexity of inquiry；On answering, the present invention uses informedness to be comprised to the Query Result measuring in natural language problem Quantity of information, reviewer or user determine the satisfaction of Query Result according to the number of quantity of information.From the foregoing, the present invention Inquiry property tolerance and informedness tolerance reflect user and the data of data set used quality, be the data under special scenes Quality, they focus on building process and the result of inquiry of inquiry.

Data set carries out the question and answer of natural language problem, is the most just based on natural language problem and constructs accordingly SQL query, mainly includes three steps: first, understands problem, finds the template of problem.Such as, " who is abe to problem The wife of lincoln？" containing the subject in inquiry and predicate, answer is object.But, " please be given all for problem Russian women spaceman " the most more complicated, first answer should be that (spaceman should be of data set to spaceman Table, is referred to as territory by inquiring about table to be performed), additionally need interpolation attribute constraint, sex is that women and nationality are for Russian；So After, find vocabulary corresponding in data set.Such as, the attribute not necessarily " wife " that " wife " is corresponding in data set, also It is probably " spouse ".The form of presentation of different its data of data set likely can be different.It addition, " please give for problem Go out all of Russia women spaceman ", corresponding territory (table to be inquired about) may be " astronauts ", it is also possible to “Russian astronauts”.The complexity of the classification of table is likely to by different data sets can be different；Finally, by front Result one SQL query of structure of face two step, the answer of the most available problem after performing on data set.Based on above to question and answer The analysis of process, be further appreciated by the present invention data use quality dimension: inquiry property and informedness.

Inquiry property: inquiry property measure user is for the difficulty using scene to construct a correct inquiry on data set Easily degree.It is preferred that inquiry property tolerance includes building the difficulty or ease grade of inquiry, building time (on the territory structure that inquiry spends Make the time of inquiry and on attribute constraint, construct the time of inquiry), build inquiry attempt number of times.

Dividing on subjective and objective, subjective measure includes the difficulty or ease grade building SQL query, and appraiser needs basis The evaluation process of oneself provides a feedback.After reviewer completes the structure of inquiry, just provide a scoring to measure it The complexity of building process.Such as, complexity is characterized as five grades: 1) be very easy to；2) easy；3) general；4) tired Difficult；5) extremely difficult.Objective metric includes building the time of inquiry cost and building the number of times that inquiry is attempted.

Specifically, the time T that the present invention uses the time T building inquiry, constructs inquiry on territory_aAnd at attribute about The time T of structure inquiry on bundle_bMeasure and build the time T that inquiry spends, and use times N OA building inquiry to weigh structure Build the number of times of trial.Wherein, the time T of inquiry is built equal to NOA* (T_a+T_b).Such as, territory constructs the time T of inquiry_aWith Attribute constraint constructs the time T of inquiry_bIt it is all the structure inquiry average time of cost on territory and on attribute constraint respectively. That is, for a problem, repeatedly inquire about if appraiser builds, T_aAnd T_bIt is then that this builds average time of inquiry several times, flat All time can measure on territory and the cost situation of time on attribute constraint well, and the time T constructing inquiry is then this Build the temporal summation that inquiry is spent several times, in order to from entirety, the cost time building inquiry is weighed.

Based on the above analysis to problem, problem is converted into executable inquiry on data set can include step: set Territory belonging to problem, to be defined on territory the time T constructing inquiry_a；The attribute constraint of interpolation problem, to be defined on attribute constraint The time T of upper structure inquiry_b；And according to the territory of problem and attribute constraint, automatically build the inquiry corresponding with problem and Inquiry constructed by performing on data set.Wherein, T_aThe complexity of the classification with data set pair table has relation, categorizing system Complicated and description class vocabulary is the most special all can cause T_aBigger；T_bThe attribute concentrated with data has the biggest relation, attribute Redundancy and the ambiguity of attribute, implication etc. can cause T_bBigger than normal.In general, the time T building inquiry is the biggest, shows The process building inquiry is the most difficult.It will be understood by those of skill in the art that the territory belonging to setting problem and attribute constraint also Not necessarily have to perform, this depends on problem itself.Such as, the problem in some problem set need not carry out attribute constraint Set and (such as, obtain the relevant information of all enterprises at conventional data collection and i.e. can get all letters of this tables of data of enterprise Breath), in this case it is not required to add any attribute constraint, it is only necessary to set this territory of enterprise.And for example, some problem Be not required to territory is set (such as, understand the headcount of some enterprise at company information data set, only need to add with The attribute constraint that headcount is corresponding).

Furthermore, it is necessary to explanation, even if user constructs the inquiry on data set, after performing inquiry, may Query Result does not returns or Query Result is the most right.Such as, the inquiry existing problems constructed, inquiry does not reaches at all appoints What result；Or, inquiry is correct, but data set does not inherently have answer.In this case, it is still desirable to user's structure again Building inquiry, until returning Query Result, or structure inquiry is attempted certain number of times and is stopped.In this sense, build The number of times of inquiry also is able to reflection and builds the complexity of inquiry, and show is more difficult to build number of times more.

Informedness: after inquiry performs on data set, obtain Query Result.The correctness of Query Result reflects number Informedness according to the data concentrated.Just because of this, informedness metrics query result is the most useful to user, comprises the most valuable Information.It is preferred that informedness tolerance includes informedness grade, precision, recall rate and integrated information.

Dividing on subjective and objective, subjective measure includes informedness grade, is that reviewer is to contained by Query Result The scoring of quantity of information.Such as, scoring has five grades equally: 1) little information；2) a small amount of information；3) some information；4) a lot Information；5) bulk information.The quantity of information that these five grades represent increases step by step.Objective metric includes precision, recall rate and comprehensive Informedness, carries out calculating thus metrics query result according to the model answer of problem.

Specifically, precision refers to that the correct result in Query Result accounts for the ratio of Query Result, Accuracy Measure inquiry knot The accurateness of fruit.Recall rate refers to that the correct result in Query Result accounts for the ratio of all correct results, recall rate metrics query The level of coverage of result.Known to the precision of Query Result and recall rate for those skilled in the art are, special below Other measure integrated information illustrates.

Integrated information (CI) is a comprehensive tolerance, and it is integrated with and affects reviewer and understand the several of Query Result Individual different factor.These factors not only include precision and the recall rate of Query Result, also include the data correctness in data set, Also the intelligibility of data in data set is included.Here, integrated information sexual satisfaction following equation:

C I = \frac{N C A}{N A} \times {(\frac{N C A}{A})}^{2} \times α \times β

Wherein, CI represents that integrated information, NCA represent the correct option quantity in Query Result, the mark of NA problem of representation The quantity of quasi-answer, A represents the sum of Query Result, and α represents the data accurateness of data set, and β represents that the data of data set can The degree of understanding, NCA/NA represents the precision of Query Result, and NCA/A represents the recall rate of Query Result.Use chi square function be for The incoherent Query Result of punishment (i.e. error result).Such as, the data accurateness α of data set can be set to 0.8.Additionally, β Being the data intelligibility in data set, whether its reflection data is readable, can be also configured as a constant 0.8.

Above, the detailed description of the invention of the present invention is described with reference to the accompanying drawings.But, those skilled in the art It is understood that in the case of without departing from the spirit and scope of the present invention, it is also possible to the detailed description of the invention of the present invention is made each Plant change and replace.These changes and replacement all fall in claims of the present invention limited range.

Claims

1. the appraisal procedure for the data use quality of data set, it is characterised in that this appraisal procedure includes following step Rapid:

According to described question template and use quality metric, final Query Result is contrasted with correct option, calculates The precision of Query Result, recall rate and integrated information are so that the data that user assesses described data set use quality.

2. appraisal procedure as claimed in claim 1, it is characterised in that described use quality metric includes two dimensions: can look into Ask property and informedness, wherein, described inquiry property for measure user for described natural language problem at described data set Construct the complexity of a correct inquiry；Described informedness is for measuring the Query Result institute in described natural language problem The quantity of information comprised.

3. appraisal procedure as claimed in claim 2, it is characterised in that described inquiry property comprises the difficulty or ease etc. building inquiry Level, structure are inquired about the time of cost, are constructed the time of inquiry, the time constructing inquiry on attribute constraint and structure on territory The number of attempt of inquiry.

4. appraisal procedure as claimed in claim 2, it is characterised in that described informedness comprises informedness grade, precision, recalls Rate and integrated information.

5. appraisal procedure as claimed in claim 4, it is characterised in that described integrated information sexual satisfaction following equation:

C I = \frac{N C A}{N A} \times {(\frac{N C A}{A})}^{2} \times α \times β

Wherein, CI represents that integrated information, NCA represent the correct option quantity in Query Result, and the standard of NA problem of representation is answered The quantity of case, A represents the sum of Query Result, and α represents the data accurateness of data set, and β represents that the data of data set are appreciated that Degree, NCA/NA represents the precision of Query Result, and NCA/A represents the recall rate of Query Result.

6. appraisal procedure as claimed in claim 5, it is characterised in that the data accurateness α of data set is 0.8, data set Data intelligibility β is 0.8.

7. appraisal procedure as claimed in claim 1, it is characterised in that the step of above-mentioned acquisition described problem evaluation and test collection by with Lower any one realizes:

-data use the self-defined problem of appraiser of quality.

8. appraisal procedure as claimed in claim 1, it is characterised in that the above-mentioned problem according to described problem evaluation and test collection is carried out always Knot and the step concluded also include:

Problem is converted into executable inquiry on data set；

Inquiry is classified by the structure according to described inquiry, it is thus achieved that classification results；And

Described question template is formed according to classification results.

9. appraisal procedure as claimed in claim 8, it is characterised in that above-mentioned problem is converted into executable on data set looking into Inquiry includes:

Set the territory belonging to described problem, to be defined on territory the time T constructing inquiry_a；

Add the attribute constraint of described problem, to be defined on attribute constraint the time T constructing inquiry_b；And

Territory according to described problem and attribute constraint, build the inquiry corresponding with described problem and automatically on described data set Perform described inquiry, wherein, build the time T of inquiry and meet following equation:

T=NOA* (T_a+T_b)；

Here, NOA represents the number of attempt of structure inquiry.

10. appraisal procedure as claimed in claim 9, it is characterised in that during inquiry constructed by performing on described data set, When Query Result does not exists or be incorrect, reset the territory belonging to described problem and attribute constraint successively.