CN115410718B

CN115410718B - Method for evaluating error of investigator in large-scale face-to-face investigation

Info

Publication number: CN115410718B
Application number: CN202110593435.1A
Authority: CN
Inventors: 赵星; 孙承媛; 刘祥; 郭冰; 肖雄
Original assignee: Sichuan University
Current assignee: Sichuan University
Filing date: 2021-05-28
Publication date: 2023-04-18
Anticipated expiration: 2041-05-28

Abstract

The invention discloses a method for evaluating error of an investigator in large-scale face-to-face investigation, which comprises the steps of obtaining questionnaire data and recording data by a base line investigation; preprocessing questionnaire data and then identifying outlier survey objects by adopting a Fast-MCD algorithm; according to the error evaluation rule of the investigator, recording the record check result of the outlier investigation object, wherein the check result is classified into five types: correct, wrong questioning mode, wrong questioning/not questioning, wrong logging and no verification, wherein the wrong questioning mode, the wrong questioning/not questioning and the wrong logging belong to the error of the investigator; and constructing an error occurrence rate index and an error contribution rate index based on the recording verification data, and evaluating the occurrence condition of the error of the investigator. The outlier detection algorithm is introduced, and recording check work is carried out on abnormal data based on the outlier detection algorithm, so that the error of investigators is found and corrected as much as possible at low cost; the contribution of each surveyor to the error of the surveyor is quantized, and the data quality is improved.

Description

Method for evaluating error of investigator in large-scale face-to-face investigation

Technical Field

The invention relates to the technical field of data quality control, in particular to a method for evaluating error of an investigator in large-scale face-to-face investigation.

Background

In large epidemiological surveys, information is often collected by means of face-to-face surveys. However, the data collection method of face-to-face investigation inevitably introduces investigator errors, and further influences the data quality and the reliability of research results. Conventional epidemiological surveys focus on data quality control through improvements in survey design, enhanced training of investigators, and the like, but the conventional data quality control measures cannot guarantee data quality due to lack of feasible data quality assessment means and limited manpower and material resources.

Disclosure of Invention

The invention aims to provide a method for evaluating error of an investigator in large-scale face-to-face investigation, which is used for solving the problems that data quality control in the prior art lacks data quality evaluation means and cannot obtain quality guarantee due to limited manpower and material resource data quality control measures.

The invention solves the problems through the following technical scheme:

a method of assessing investigator error in a large face-to-face investigation, comprising:

step S1: acquiring questionnaire data and recording data of a baseline survey through an electronic information platform, and generating indexes of the questionnaire data and the recording data according to survey objects;

step S2: after the baseline survey is finished, questionnaire data are exported, and outlier survey objects are identified by adopting a Fast-MCD algorithm after pretreatment, wherein the method specifically comprises the following steps:

step S21: the questionnaire data comprises n rows and p columns, and represents that the questionnaire data comprises n survey objects, each survey object comprises information of p variables, then h sample data are extracted from the n survey objects, wherein the value of h must satisfy

In order to give consideration to good robustness and calculation efficiency, h takes a value of 0.8n;

step S22: calculating the sample mean value of the h sample data

Covariance matrix>

Sum covariance determinant

Based on->

And &>

Mahalanobis distances for n panelists were further calculated:

step S23: sorting the Mahalanobis distances of the n survey objects from small to large, selecting h survey objects with the minimum distance, and calculating the sample mean value of the h survey objects

Covariance matrix>

Covariance determinant->

And mahalanobis distances for h panelists;

step S24: performing iterative calculation according to the steps S21 to S23, if the m time

The mean and covariance calculated from the mth sample are taken as a robust estimate of the final mean and covariance and recorded as £ er>

Step S25: based on the robust estimator, mahalanobis distances are calculated for all panelists:

step S26: judging the surveyed objects with the Mahalanobis distance larger than a preset value as outliers;

and step S3: and (3) performing record check on the outlier investigator according to the investigator error evaluation rule:

a quality controller logs in an electronic information platform, searches questionnaire data and a recording file of a survey object corresponding to the outlier according to the unique index, judges whether the questionnaire data and the recording file of the survey object are consistent, and if the questionnaire data and the recording file are not consistent, the survey object cannot accurately capture and record answers of the survey object, namely, an error of the survey object exists; recording the checking result, wherein the checking result is classified into five types: correct, wrong questioning mode, unquestioned/unquestioned, wrong input, and cannot be verified, wherein the wrong questioning mode, the unquestioned/unquestioned and wrong input belong to the existence of investigator errors;

and step S4: based on the recording check data, an error occurrence rate index and an error contribution rate index are constructed, and the occurrence condition of the error of the investigator is evaluated to obtain:

different investigators may investigate one or more investigators, and the investigation situation of the investigators is evaluated by calculating the incidence of the investigator error of different investigators, which is calculated as follows:

the number of questions with error of investigator = the number of questions with wrong questioning method + the number of questions with question/question not asked + the number of questions with wrong entry.

Further comprising step S5: based on the calculated error occurrence rate of the investigators, further analyzing the popular characteristics of the investigators in the error of different investigators, and exploring the distribution mode and the aggregation mode of the investigators; the distribution pattern is reflected by a probability density map; the aggregation mode is used to explore whether the investigator errors are concentrated on part of the investigators.

Further comprising step S6: according to the calculated error incidence rate ER of different investigators _i Further calculating investigator error contribution rate of each investigator as

Where k denotes the number of investigators, ER _i 、ER _j Respectively representing the error occurrence rate of the investigator of the ith investigator and the jth investigator; />

The larger the value, the greater the risk of the investigator error.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) Compared with the conventional means of repeatedly investigating and comparing the consistency of the investigation data, the recording and inspecting method saves manpower and material resources consumed by secondary investigation, and also avoids the problem of data difference caused by different investigation time of two times in repeated investigation.

(2) The outlier detection algorithm is originally introduced, recording verification work is carried out on abnormal data based on the outlier detection algorithm, the error of investigators is found and corrected as much as possible at low cost, and the outlier detection algorithm has a far-reaching application value in large-scale epidemiological investigation.

(3) The method quantifies the contribution of each surveyor to the error of the surveyor, is favorable for reducing the error of the surveyor by taking measures in the future and improves the data quality.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.

Example 1:

referring to fig. 1, a method for evaluating error of an investigator in a large-scale face-to-face investigation includes four steps of baseline investigation, outlier detection, recording check and analysis after check, and specifically includes the following steps:

in the first step, during the baseline survey, the data is collected in a face-to-face survey mode and recorded in the whole process. Specifically, an electronic information platform is built to realize informatization of the whole investigation process, and the electronic information platform needs to comprise the following functional modules:

1) Data acquisition module (PAD side): and acquiring questionnaire data and recording data through the PAD, and uploading the questionnaire data and the recording data under the networking condition.

2) Data management module (computer side): generating a unique index of each surveyor, and searching questionnaire data and a recording file of the surveyor through the unique index; and the survey objects meeting the conditions can be inquired through the keywords, and batch export of questionnaires and recorded data is realized.

3) Quality control module (computer side): the questionnaire data and the recording file of a specific survey object are retrieved, and the questionnaire can be checked while listening to the recording, and a quality control report can be filled in.

And step two, exporting questionnaire data after the baseline survey is finished. After deleting the repeated survey objects and the survey objects with missing values, adopting a multivariate outlier detection algorithm-Minimum Covariance Determinant (MCD) (namely Fast-MCD algorithm) to identify abnormal survey objects in the questionnaire data, which specifically comprises the following steps:

The larger the value of h is, the more efficient the MCD method operation is, but the lower the robustness of the estimator is, and the value of h is 0.8n in order to give consideration to good robustness and calculation efficiency;

step S22: calculating the sample mean value of the h sample data

Covariance matrix>

Sum covariance determinant

Based on->

And &>

Mahalanobis distances for n panelists were further calculated:

Covariance matrix>

Covariance determinant->

And mahalanobis distances for h panelists;

The mean and covariance calculated from the mth sample are taken as the final robust estimates of mean and covariance, denoted as

step S26: MD of surveyor _iMCD The larger the value, the more reasonable it is to judge it as an outlier. Since the calculated robust mahalanobis distance is approximately obeyed a chi-square distribution with p degrees of freedom

Thus, the Mahalanobis distance is made to exceed a preset value such as

The survey object of (2) is determined as an outlier;

a quality controller logs in an electronic information platform, searches questionnaire data and a recording file of a survey object corresponding to the outlier according to the unique index, judges whether the questionnaire data and the recording file of the survey object are consistent, and if the questionnaire data and the recording file are not consistent, the survey object cannot accurately capture and record answers of the survey object, namely, an error of the survey object exists; recording the checking result, wherein the checking result is classified into five types: correct, wrong questioning mode, unquestioned/unquestioned, wrong input, and cannot be verified, wherein the wrong questioning mode, the unquestioned/unquestioned and wrong input belong to the existence of investigator errors; the specific investigator error evaluation rules are shown in table 1 below:

TABLE 1 investigator error assessment rules

And step S4: based on the recorded sound checking data, an Error occurrence rate index and an Error contribution rate index are constructed, the occurrence condition of the Error of the investigator is evaluated, the total occurrence condition of the Error of the investigator and the probability of the Error of different types of the investigators are reflected by an Error Rate (ER) index, and the calculation of the index is as follows:

Further comprising step S5: based on the calculated error occurrence rate of the investigators, further analyzing the popular characteristics of the investigators in the error of different investigators, and exploring the distribution mode and the aggregation mode of the investigators; the distribution pattern is reflected by a probability density map; the aggregation mode is used to explore whether the error of the investigator is concentrated on part of the investigator.

The research firstly provides an index of error contribution rate to reflect the aggregation tendency of the error of an investigator. Considering that the number of the surveyed objects is different among different surveyors, the more the surveyors survey, the more the number of questions with surveyor errors, and the greater the contribution of the surveyor errors. Therefore, the present study proposes a standardized procedure to estimate the investigator's error contribution rate. The method specifically comprises the following steps:

according to the calculated error incidence rate ER of different investigators _i Go forward and go forwardThe error contribution rate of each surveyor is calculated in one step as

Where k denotes the number of investigators, ER _i 、ER _j Respectively representing the error occurrence rates of the investigators of the ith investigator and the jth investigator; />

The larger the value, the greater the risk of investigator error for this investigator.

And sensitivity analysis is carried out to ensure the robustness of the data quality evaluation result. The method comprises the specific steps of randomly extracting a small number of non-outlier individuals, and finishing record checking and analysis. And finally, comparing the recording checking results of the outlier sample and the non-outlier sample, and evaluating the robustness of the research result.

Although the invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be the only preferred embodiments of the invention, it is not intended that the invention be limited thereto, since many other modifications and embodiments will be apparent to those skilled in the art and will be within the spirit and scope of the principles of this disclosure.

Claims

1. A method of assessing investigator error in a large face-to-face investigation, comprising:

step S2: after the baseline survey is finished, questionnaire data are exported, and after pretreatment, outlier survey objects are identified by adopting a Fast-MCD algorithm, which specifically comprises the following steps:

step S21: the questionnaire data comprises n rows and p columns, the questionnaire data comprises n survey objects, each survey object comprises information of p variables, then h sample data are extracted from the n survey objects, wherein the value of h must satisfy

h takes the value of 0.8n;

step S22: calculating the sample mean value of the h sample data

Covariance matrix>

And covariance determinant>

Based on->

And &>

Mahalanobis distances for n panelists were further calculated:

step S23: sorting the Mahalanobis distances of the n survey objects from small to large, selecting h survey objects with the smallest distance, and calculating the sample mean value of the h survey objects

Covariance matrix>

Covariance determinant->

And mahalanobis distances for h panelists;

The mean and covariance calculated from the mth sample are taken as the final robust estimates of mean and covariance, and are recorded as

a quality controller logs in an electronic information platform, searches questionnaire data and a recording file of a survey object corresponding to the outlier according to the unique index, judges whether the questionnaire data and the recording file of the survey object are consistent, and if the questionnaire data and the recording file are not consistent, the survey object cannot accurately capture and record answers of the survey object, namely, an error of the survey object exists; recording the checking result, wherein the checking result is classified into five types: correct, wrong questioning mode, wrong questioning/not questioning, wrong logging and no verification, wherein the wrong questioning mode, the wrong questioning/not questioning and the wrong logging belong to the error of the investigator;

/>

the number of questions with investigator error = the number of questions with questioning mode error + the number of questions with question/question not asked + the number of questions with entry error.

2. The method of evaluating surveyor's errors in a large face-to-face survey according to claim 1, further comprising:

step S5: based on the calculated error occurrence rate of the investigators, further analyzing the popular characteristics of the investigators in the error of different investigators, and exploring the distribution mode and the aggregation mode of the investigators; the distribution pattern is reflected by a probability density map; the aggregation mode is used to explore whether the error of the investigator is concentrated on part of the investigator.

3. A method of assessing investigator error in a large face-to-face investigation according to claim 2, further comprising:

step S6: according to the calculated error incidence rate ER of different investigators _i Further calculating investigator error contribution rate of each investigator as

The larger the value, the greater the risk of the investigator error. />