CN107562725B

CN107562725B - Index extraction verification method and device

Info

Publication number: CN107562725B
Application number: CN201710774876.5A
Authority: CN
Inventors: 黄晓
Original assignee: New H3C Big Data Technologies Co Ltd
Current assignee: New H3C Big Data Technologies Co Ltd
Priority date: 2017-08-31
Filing date: 2017-08-31
Publication date: 2020-10-09
Anticipated expiration: 2037-08-31
Also published as: CN107562725A

Abstract

The utility model relates to a checking method and a device for index extraction, which comprises the steps of extracting indexes of each electronic text to obtain a first index of the electronic text and a corresponding first index value; extracting electronic text as a sample; for each sample, acquiring a second index and a corresponding second index value of the sample; verifying the first index and the first index value of the sample by using the second index and the second index value of the sample to obtain a verification result; the accuracy of the index extraction is evaluated based on the verification result, and the verification method and the device for the index extraction according to the embodiment of the disclosure can verify and reliably evaluate the accuracy of the extracted index.

Description

Index extraction verification method and device

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a calibration method and apparatus for index extraction.

Background

Data extraction of unstructured text is a widely studied problem. For example, in medical big data, medical indexes need to be extracted from medical unstructured free text (e.g. cases written by doctors) for data mining and analysis. Doctors write randomly, the process of extracting and analyzing the unstructured electronic medical records is complex, and the extracted indexes may have problems of missing values, wrong data total number and the like. The quality of the extracted index determines the quality of the upper layer analysis. In actual projects, the accuracy of the extracted indexes can be checked after each extraction, and the extraction method is optimized by using the checked result.

In the related technology, the accuracy of index extraction can be checked by comparing the number of data, but the accuracy of the extracted index cannot be verified, for example, whether the index extraction is omitted or not, whether the extracted index value is consistent with the original data or not, and the like.

Disclosure of Invention

In view of this, the present disclosure provides a method and an apparatus for checking index extraction, which can check and reliably evaluate the accuracy of an extracted index from an electronic text.

According to an aspect of the present disclosure, there is provided a verification method for index extraction, the method including: extracting indexes of each electronic text to obtain a first index of the electronic text and a corresponding first index value; extracting electronic text as a sample; for each sample, acquiring a second index and a corresponding second index value of the sample; verifying the first index and the first index value of the sample by using the second index and the second index value of the sample to obtain a verification result; and evaluating the accuracy of index extraction based on the verification result.

According to another aspect of the present disclosure, there is provided a verification apparatus for index extraction, the apparatus including: the extraction module is used for extracting indexes of each electronic text to obtain a first index of the electronic text and a corresponding first index value; the extraction module is used for extracting the electronic text serving as the sample; the acquisition module is used for acquiring a second index and a corresponding second index value of each sample; the verification module is used for verifying the first index and the first index value of the sample by using the second index and the second index value of the sample to obtain a verification result; and the evaluation module is used for evaluating the accuracy of index extraction based on the verification result.

Extracting a first index and a corresponding first index value of each electronic text; extracting electronic texts serving as samples, and collecting a second index and a corresponding second index value of each sample; the first index and the first index value of the sample are verified by the second index and the second index value of the sample, the accuracy of extraction is evaluated based on the verification result, the verification method and the device for index extraction according to the aspects of the disclosure can verify and reliably evaluate the accuracy of the extracted index, and meanwhile, the accuracy of index extraction is evaluated by the verification result of the sample, so that the verification time and the workload of verification can be reduced.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a flow diagram of a verification method of index extraction according to an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of a verification method of index extraction according to an embodiment of the present disclosure;

FIG. 3 shows a flow diagram of a verification method of index extraction according to an embodiment of the present disclosure;

FIG. 4a shows a histogram of the number of types of questions according to one example of the present disclosure;

FIG. 4b shows a histogram of the number of types of questions according to one example of the present disclosure;

FIG. 4c shows a histogram of the number of types of questions according to one example of the present disclosure;

FIG. 5 shows a flow diagram of a verification method of index extraction according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of a verification device for index extraction according to an embodiment of the present disclosure;

FIG. 7 shows a block diagram of a verification device for index extraction according to an embodiment of the present disclosure;

fig. 8 shows a block diagram of a verification apparatus for index extraction according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flowchart of a verification method of index extraction according to an embodiment of the present disclosure. The method can be applied to terminal devices, such as mobile terminals, computers and the like, such as mobile phones, tablet computers and the like, and as shown in fig. 1, the verification method for index extraction includes:

step S11, performing index extraction on each electronic text to obtain a first index of the electronic text and a corresponding first index value.

The electronic text can be a text described in an unstructured form such as a natural language, and the content of the electronic text can be written randomly and messy. The information in the electronic text can be automatically extracted by the information extraction technology in the related technology, and the information contained in the electronic text can be structured and changed into an organization form such as a table. The information extraction system inputs electronic texts, the output result is structured data with a specific structure format, and information is integrated together in a uniform format, so that the relationship among the information can be managed and mined more conveniently. The structured data is also called row data, can be stored in a database, and can be logically expressed by a two-dimensional table structure.

The first index and the corresponding first index value of the electronic text may represent an index and an index value extracted from the electronic text by an information extraction technique.

Taking an electronic text as an example of an electronic medical record, table 1 shows a first index and a corresponding first index value obtained by the terminal device performing index extraction on two electronic medical records with a patient ID of 0001 and a patient ID of 0003 respectively. As shown in table 1, index extraction is performed on the electronic medical record with the patient ID of 0001 to obtain two first indexes of ascites and differentiation of the electronic text, the first index value corresponding to ascites is 100, and the first index value corresponding to differentiation is high; index extraction is carried out on the electronic medical record with the patient ID of 0003, three first indexes of ascites, stage and differentiation of the electronic text are obtained, the first index value corresponding to the ascites is 156, the first index value corresponding to the stage is III, and the first index value corresponding to the differentiation is low.

TABLE 1 first index and corresponding first index value

Patient ID	Ascites (ascites)	Staging	Differentiation
				0001	100	Height of
0003	156	III	Is low in
				……

In step S12, an electronic text is extracted as a sample.

In consideration of the high cost of verifying all the electronic texts, a sampling mode can be adopted to extract part of the electronic texts as samples for verification, and the condition of index extraction of the electronic texts as the samples approximately reflects the condition of index extraction of all the electronic texts. The samples may be drawn randomly. The terminal device may extract the electronic text as a sample using a database technique in the related art.

The check part of the electronic text sometimes cannot completely represent all the electronic text, and the obtained result may have an error, which is called an estimation error. The number of electronic texts extracted at a time as a sample is called a sample size, and the larger the sample size is, the smaller the estimation error is, the more the sample can reflect the overall situation, but the higher the verification cost is. In one possible implementation, the sample size may be determined using the following equation:

wherein the confidence level threshold represents a standard error for the confidence level. The confidence level refers to the magnitude of the probability that the population can be correctly estimated when estimated in samples. For example, when the confidence level is 95%, the probability of representing a correct estimate is 95%, and the corresponding confidence level threshold, i.e., the standard error, is 1.96. The expected accuracy may not be estimated, but may be chosen to be 0.5, where the expected accuracy (1-expected accuracy) is the largest value, resulting in a more conservative sample size. In one example, the sample size is 96, calculated with a confidence level of 95% (corresponding to a confidence level cut-off of 1.96), an expected accuracy of 0.5, and an estimation error of 0.1. The sample size may be determined according to the accuracy requirement, which is not limited by this disclosure.

Step S13, for each sample, a second index and a corresponding second index value of the sample are collected.

Wherein the second index refers to an index actually existing in the sample. The second index value refers to an index value corresponding to an index actually existing in the sample. Since the second index and the second index value are actually present in the sample, the second index and the second index value may be used to verify the first index and the corresponding first index value.

In a possible implementation manner, the terminal device may provide an input interface for acquiring the second index and the corresponding second index value. In one example, the terminal device may generate a data table as the input interface. The input interface (e.g., data table) may have the same structure format as extracted in step S11 to facilitate subsequent verification.

In one possible implementation, the terminal device may sequentially display the extracted samples and mark the displayed samples based on the index keywords. For example, the terminal device may highlight the index keyword matched in the sample, so that the verifying personnel can conveniently and quickly locate the position of the index. The checker may determine an index value corresponding to the highlighted index keyword according to the index keyword, input the index keyword as a second index in an input interface such as a data table, and input the index value corresponding to the index keyword as a second index value. The index keywords can be preset or obtained according to the input of the user.

Taking the sample as an electronic medical record as an example, table 2 shows second indexes and corresponding second index values respectively acquired by the terminal device for two samples with a patient ID of 0001 and a patient ID of 0003. As shown in table 2, for a sample with a patient ID of 0001, three second indexes of ascites, stage and differentiation were collected, the second index value corresponding to ascites was 90, the second index value corresponding to stage was I, and the second index value corresponding to differentiation was high; and (3) acquiring two second indexes of ascites and stage aiming at a sample with the patient ID of 0003, wherein the second index value corresponding to the ascites is 155, and the index value corresponding to the stage is II.

TABLE 2 second index and corresponding second index value

Patient ID	Ascites (ascites)	Staging	Differentiation
				0001	90	I	Height of
0003	155	II
				……

Step S14, the first index and the first index value of the sample are verified by the second index and the second index value of the sample, and a verification result is obtained.

A sample may have one or more first criteria and one or more second criteria, with each first criterion being different and each second criterion being different. Take the first index and the corresponding first index value shown in table 1, and the second index and the corresponding second index value shown in table 2 as an example. As shown in Table 1, the sample with patient ID 0001 had two first indicators, ascites and differentiation, respectively; the sample with patient ID 0003 had three first indicators, ascites, stage and differentiation. As shown in table 2, the sample with patient ID 0001 had three first markers, ascites, stage and differentiation; the sample with patient ID 0003 had two secondary indicators, ascites and stage. In one example, the terminal device may check the first index and the first index value of the corresponding sample shown in table 1 using the second index and the second index value of one sample shown in table 2. For example, ascites and 90, stages and I, and differentiation and height of the samples with patient ID 0001 shown in Table 2 were used, and ascites and 100, and differentiation and height of the samples with patient ID 0001 shown in Table 1 were checked.

The verification result may be used to represent the second index and second index values of the sample, and the difference between the first index and first index values of the sample. In a possible implementation manner, the terminal device may compare whether the second index of the sample is the same as the first index, and whether the corresponding second index value is the same as the first index value, and use the comparison result (the same or different) as a check result of the first index and the first index value of the sample. The terminal device may also verify the first index and the first index value of the sample in other manners, and accordingly may use other contents as the verification result, for example, correct, incorrect, and the like.

And step S15, evaluating the accuracy of index extraction based on the verification result.

In one possible implementation, the terminal device may evaluate accuracy of the index extraction from the aspects of yield, integrity, and the like. In a possible implementation manner, when the terminal device extracts the electronic text as the sample, M electronic texts may be extracted as the sample every time, and N sample sets are obtained by extracting N times, where each sample set includes M samples, where N and M are both positive integers. In one example, M may take 96 and N may be determined based on the total number of electronic texts. Therefore, the index extraction conditions of all electronic texts are approximately reflected according to the index extraction conditions of the N sample sets, the evaluation deviation brought by a single sample set can be reduced, and the evaluation accuracy is improved.

Extracting a first index and a corresponding first index value of each electronic text; extracting electronic texts serving as samples, and collecting a second index and a corresponding second index value of each sample; the first index and the first index value of the sample are verified by the second index and the second index value of the sample, the accuracy of the extracted index is evaluated based on the verification result, the method for verifying the index extraction according to the embodiment of the disclosure can verify and reliably evaluate the accuracy of the extracted index, and meanwhile, the accuracy of the index extraction is evaluated through the verification result of the sample, so that the verification time and the verification workload can be reduced.

Fig. 2 shows a flowchart of a verification method of index extraction according to an embodiment of the present disclosure. As shown in fig. 2, the step S14 verifies the first index and the first index value of the sample by the second index and the second index value of the sample, and the verification result may be implemented as:

step S141, for any first index of the sample, if there is a second index identical to the first index and the corresponding first index value and second index value are identical, recording a verification result of the first index, where the type of the verification result is that the index extraction is correct.

In a possible implementation manner, the terminal device may sequentially compare any first index in the sample with one or more second indexes of the sample, and if there is a second index that is the same as the first index, determine whether the corresponding first index value and the corresponding second index value are the same. If the corresponding first index value and the second index value are the same, the first index is correctly extracted, and the terminal device can record the type of the verification result of the first index as the index is correctly extracted.

In one example, the first index and corresponding first index value shown in Table 1, and the second index and corresponding second index value shown in Table 2 are used as examples. For the sample with patient ID 0001, the first index is differentiated (as shown in Table 1), the sample with patient ID 0001 in Table 2 has the second index which is also differentiated (as shown in Table 2), and the corresponding first index value is high, the second index value is high, and the first index value and the second index value are the same. The terminal device may record the type of the differentiation verification result of the sample with the patient ID 0001 as an index to extract correctly, as shown in table 3.

Table 3 check result types of indices

Patient ID	Ascites (ascites)	Staging	Differentiation
				0001	Error in extracting index value	Loss of index	The index is extracted correctly
0003	Error in extracting index value	Error in extracting index value	Index error extraction
				……

In step S142, for any first index of the sample, if a second index identical to the first index exists and the corresponding first index value and the second index value are different, a verification result of the first index is recorded, where the type of the verification result is an index value extraction error.

In a possible implementation manner, the terminal device may sequentially compare any first index in the sample with one or more second indexes of the sample, and if there is a second index that is the same as the first index, determine whether the corresponding first index value and the corresponding second index value are the same. If the corresponding first index value and the second index value are different, indicating that the first index value corresponding to the first index is extracted incorrectly, the terminal device may record the type of the verification result of the first index as an index value extraction error.

In one example, the first index and corresponding first index value shown in Table 1, and the second index and corresponding second index value shown in Table 2 are used as examples. For ascites of the first index (as shown in Table 1) in the sample with patient ID 0001, ascites of the second index (as shown in Table 2) in the sample with patient ID 0001 in Table 2, the corresponding first index value is 100, the second index value is 90, and the first index value and the second index value are different. The terminal device may record the type of the ascites verification result of the sample with the patient ID of 0001 as an index value extraction error, as shown in table 3.

Step S143, for any first index of the sample, if there is no second index identical to the first index, recording a verification result of the first index, where the type of the verification result is an index false positive.

In a possible implementation manner, the terminal device may sequentially compare any first index in the sample with one or more second indexes of the sample, and if there is no second index that is the same as the first index, it indicates that the first index should not be extracted, but is actually extracted, and the terminal device may record the type of the verification result of the first index as an index false extraction.

In one example, the first index and corresponding first index value shown in Table 1, and the second index and corresponding second index value shown in Table 2 are used as examples. The first index for the sample with patient ID 0003 was differentiated (as shown in Table 1), and the second index for the sample with patient ID 0003 in Table 2 was not differentiated (as shown in Table 2). The terminal device may record the type of the differentiated verification result of the sample with the patient ID of 0003 as an index misinformation, as shown in table 3.

Step S144, for any second index of the sample, if there is no first index identical to the second index, recording a verification result of the second index, where the type of the verification result is index missing.

In a possible implementation manner, the terminal device may sequentially compare any second index in the sample with one or more first indexes of the sample, and if there is no first index that is the same as the second index, it indicates that the second index should be extracted, but is not actually extracted, and the terminal device may record the type of the verification result of the second index as index missing.

In one example, the first index and corresponding first index value shown in Table 1, and the second index and corresponding second index value shown in Table 2 are used as examples. For the sample with the patient ID of 0001, a second index stage exists (as shown in table 2), none of the first indexes of the sample with the patient ID of 0001 in table 1 is an stage (as shown in table 1), and the terminal device may record the type of the checking result of the stage of the sample with the patient ID of 0001 as index missing, as shown in table 3.

In a possible implementation manner, in a case that there is a second index that is the same as one first index, the terminal device may first determine types of a first index value and a second index value corresponding to the same first index and second index, respectively, and determine whether the second index value and the first index value are the same based on the types of the first index value and the second index value.

In a possible implementation manner, when the types of the first index value and the second index value are both numerical types, if the first index value is equal to the second index value, the terminal device may determine that the first index value and the second index value are the same, otherwise, the terminal device may determine that the first index value and the second index value are different. For example, the first index value corresponding to ascites of the sample with the patient ID of 0001 in table 1 is 90, the second index value corresponding to ascites of the sample with the patient ID of 0001 in table 2 is 100, and the first index value and the second index value are not equal, at this time, the terminal device may record the type of the verification result of ascites of the sample with the patient ID of 0001 as an index value extraction error.

In a possible implementation manner, when the type of at least one of the first index value and the second index value is a character type, if both the second index value and the first index value include a specified keyword or the first index value and the second index value include characters that are completely the same, the terminal device may determine that the first index value and the second index value are the same, and if at least one of the second index value and the first index value does not include a specified keyword or the first index value and the second index value have different characters, the terminal device may determine that the first index value and the second index value are different.

In one example, the first and second indicators of the sample are both pathological types, the corresponding first indicator value is "ovarian serous carcinoma" and the corresponding second indicator value is "left ovarian serous papillary adenocarcinoma".

The rule chosen is: if the characters contained in the first index value and the second index value are completely the same, determining that the first index value and the second index value are the same; if the first index value and the second index value have different characters, the terminal device may determine that the first index value and the second index value are different under the condition that the first index value and the second index value are determined to be different, and the terminal device may record the type of the check result of the pathological type of the sample as an index value extraction error.

The rule chosen is: if the second index value and the first index value both contain the specified keyword, determining that the first index value and the second index value are the same; if at least one of the second index value and the first index value does not contain the designated keyword, the first index value and the second index value are determined to be different, and under the condition that the designated keyword is 'ovary', the terminal device can determine that the first index value and the second index value are the same, and the terminal device can record the type of the case type of the sample as the index extraction is correct.

The above is only one example of a method of determining whether the first index value and the second index value are the same, and the present disclosure is not limited to the above determination method, and how to determine whether the first index value and the second index value are the same is not limited.

Fig. 3 shows a flowchart of a verification method of index extraction according to an embodiment of the present disclosure. As shown in fig. 3, the step S15 of evaluating the accuracy of the index extraction based on the verification result may be implemented as:

step S151, counting the verification results of the same index aiming at M samples in any sample set, and determining the proportion of each type of verification result corresponding to the index; the ratio of any type of check result corresponding to the index is the ratio of the number of times of the type of check result to the number of times of all the check results of the index.

According to steps S141 to S144, all indexes of one sample and the type of the verification result of each index may be recorded. Wherein all indexes of the sample are the union of all first indexes and all second indexes of the sample. For example, all first indicators of a sample include a and C, all second indicators of a sample include B, C and D, and all indicators of a sample include A, B, C and D. The types of the check result comprise four types of correct index extraction, wrong index extraction and missing index. In one example, as shown in table 3, a sample with a patient ID of 0001 has three indexes of ascites, stage and differentiation, the calibration result type of ascites is an index extraction error, the calibration result type of stage is an index deletion, and the calibration result type of differentiation is an index extraction correct.

In one example, as shown in table 3, a sample set includes M samples, such as a sample with a patient ID of 0001, a sample with a patient ID of 0003, and the verification results corresponding to ascites of all samples in the sample set are counted. And respectively determining the times of the check results of the correct index extraction, the wrong index extraction and the missing index types corresponding to the ascites, and the times of the check results of all types corresponding to the ascites. And respectively determining the ratio of the times of the check results of the correct index extraction, the wrong index extraction and the missing index types corresponding to the ascites to the times of the check results of all types corresponding to the ascites as the proportion of the check results of the correct index extraction, the wrong index extraction and the missing index types of the ascites. By the same method, the proportion of the checking results of the staged index extraction correctness, the index value extraction mistake, the index false extraction and the index missing type and the proportion of the checking results of the differentiated index extraction correctness, the index value extraction mistake, the index false extraction and the index missing type can be determined.

In a possible implementation manner, the sizes of the proportions of the index value extraction error, the index false extraction and the index missing type check result corresponding to the index can be compared, and the main problem of the index in the extraction process can be determined. Thus, the extraction method of the index can be optimized in a targeted manner.

In a possible implementation manner, the ratio of the index corresponding to the index to the correct type of check result may be determined as the yield of the index. According to the qualification rate of the index, the reliability of the index can be determined. The qualification rate of the index is high, which shows that the reliability of the index is high and the index can be used for further analysis and processing; the low qualification rate of the index indicates that the index has low reliability, cannot be used for further analysis and processing, and can be considered to be re-extracted.

In one example, 96 samples exist in the sample set, the ascites corresponding index of 3 samples is missing, the ascites corresponding index of 5 samples is extracted incorrectly, and the ascites corresponding index of other samples is extracted correctly without problems. The terminal equipment determines that the index loss proportion corresponding to the ascites is 3.1%, the index value extraction error proportion corresponding to the ascites is 5.2%, and the index extraction error proportion corresponding to the ascites is 0. The ratio of the index value extraction errors corresponding to the ascites is large, and the terminal equipment can determine that the main problem of the ascites in the extraction process is the index value extraction errors. In this way, optimization in terms of index value extraction errors can be mainly performed for ascites. The correct extraction ratio of the indexes corresponding to the ascites is 91.7%, and the qualification rate of the ascites can be determined to be 91.7%.

In a possible implementation manner, an average of ratios of the types of the check results corresponding to the indexes of the N sample sets may be respectively determined as a final ratio of the types of the check results corresponding to the indexes. Therefore, the evaluation deviation caused by a single sample set can be reduced, and the evaluation accuracy is improved. For example, two sample sets are extracted, where the yield of one sample set of the index is a% and the yield of the other sample set is B%, and the final yield of the index is (a% + B%)/2.

In one possible implementation, the qualification rate of the index and various intermediate data can be displayed in an intuitive manner such as a histogram.

For example, a histogram of the number of each problem type (index value extraction error, index false extraction, and index missing) corresponding to each index (ascites, stage, and differentiation) of the sample set according to one example is shown in fig. 4 a. According to the histogram shown in fig. 4a, the main problem in extracting each index can be intuitively found, and the extraction method can be optimized for the main problem in extracting each index.

FIG. 4b illustrates a histogram of all metrics of a sample set for each question type and the number of each question type according to one example. According to the histogram shown in fig. 4b, which problem types are more in index extraction can be intuitively found, so that the extraction method is optimized for the more problem types.

FIG. 4c illustrates a histogram of the number of problem types for each indicator of samples extracted from electronic medical records at different stages, according to an example. According to the histogram shown in fig. 4c, the types of the problems mainly occurring at each stage can be intuitively found, so that the electronic medical records are classified according to the stage, and the extraction method is optimized in different aspects for the electronic medical records of each category.

By determining the proportion of each type of check result corresponding to each index in the sample set, the main problem in the extraction of each index can be determined according to the check method for extracting the indexes disclosed by the invention, so that the extraction method of the indexes can be optimized in a targeted manner, and the indexes can be reliable according to the qualification rate of each index.

Fig. 5 shows a flowchart of a verification method of index extraction according to an embodiment of the present disclosure. As shown in fig. 5, the method further includes step S16, and the step S15 of evaluating the accuracy of the extraction based on the verification result may be implemented as step S152.

And step S16, when the verification result of the first index or the second index is recorded, recording the identification of the electronic text corresponding to the first index or the second index.

The electronic text identity may be used to distinguish between different ones of the electronic text. The electronic text identification may be an ID, a number, a name, and the like. The present disclosure does not limit electronic text labels.

One first index or second index corresponds to one sample, and one sample corresponds to one electronic text. The electronic text identification may be used to identify the unique electronic text. When the verification result of the first index or the second index is recorded, the electronic text identification corresponding to the first index or the second index is recorded, so that the index, the type of the verification result of the index and the electronic text identification corresponding to the index can be recorded in the verification result of the index.

Step S152, counting the verification results of the same electronic texts according to the identifications of the electronic texts recorded in the verification results aiming at the N sample sets, and determining the proportion of each type of verification results corresponding to the electronic texts; the ratio of any type of verification result corresponding to the electronic text is the ratio of the number of times of the type of verification result to the number of times of all verification results of the electronic text.

According to the steps S141 to S144 and S16, all indexes of any sample set and the verification result type of each index, and the electronic text identifier corresponding to each index, may be recorded.

In one example, the verification result corresponding to ascites of the electronic text corresponding to each electronic text identifier in the N sample sets, the verification result corresponding to the stage, and the verification result corresponding to the differentiation are counted. And respectively determining the times of the check results of the correct index extraction, the wrong index and the missing index types corresponding to each electronic text, and the times of the check results of all types of the electronic text. And respectively determining the ratio of the times of the check results of the correct index extraction, the wrong index and the missing index types corresponding to each electronic text to the times of the check results of all types of the electronic text, and determining the ratio of the check results of the correct index extraction, the wrong index extraction and the missing index types corresponding to the electronic text.

In a possible implementation manner, the proportion of the correct type of verification result can be extracted from the index corresponding to the electronic text, and the index integrity of the electronic text can be determined.

And determining whether the extraction of each electronic text is effective or not according to the index integrity of different electronic texts. For example, when the index integrity of an electronic text is low, the reliability of the first index extracted for the electronic text and the corresponding first index value may be considered to be low, and the first index and the corresponding first index value need to be extracted from the electronic text again. When the index integrity of the electronic text is high, the reliability of the extracted first index and the corresponding first index value for the electronic text is considered to be high, and the method can be used for subsequent data analysis and processing.

In a possible implementation manner, an average value of the index integrity of each electronic text may be determined as a first overall index integrity.

And evaluating whether the extraction method needs to be optimized or not through the integrity of the first overall index. For example, if the integrity of the first overall index is low, it may be determined that the accuracy of the first index and the first index value is low, and the reliability of the previously adopted extraction method is low, and optimization may be required for use. If the integrity of the first overall index is higher, the accuracy of the first index and the first index value can be determined to be higher, the reliability of the previously adopted extraction method is higher, and the method can be directly applied without optimization.

In a possible implementation manner, the number of electronic texts with the ratio of the correct type of verification result extracted from the indexes corresponding to the electronic texts being 1, which accounts for the total number of the electronic texts, may be determined as the second overall index integrity.

When the index integrity of the electronic text is 1, each index of the electronic text has no problem, each index and the corresponding index value are accurately extracted, and the extraction of the indexes of the electronic text is complete and error-free.

And evaluating the application range of the extraction method through the integrity of the second overall index. For example, if the second overall index integrity is low, it can be determined that the previously adopted extraction method has a good extraction effect on a small part of electronic texts, and has a poor extraction effect on a large part of electronic texts, and the application range is narrow. The extraction method with the lower integrity of the second overall index can be used for index extraction of the electronic text of a specific type. If the second overall index is higher in integrity, the previously adopted extraction method can be determined to correctly extract indexes in most electronic texts, and the application range is wider.

Fig. 6 shows a block diagram of a verification apparatus for index extraction according to an embodiment of the present disclosure. As shown in fig. 6, the index extraction verification device 70 includes:

the extracting module 71 is configured to perform index extraction on each electronic text to obtain a first index of the electronic text and a corresponding first index value.

An extraction module 72 for extracting the electronic text as a sample.

And the acquisition module 73 is configured to acquire, for each sample, a second index of the sample and a corresponding second index value.

And the checking module 74 is configured to check the first index and the first index value of the sample by using the second index and the second index value of the sample, so as to obtain a checking result.

An evaluation module 75, configured to evaluate accuracy of the index extraction based on the verification result.

Fig. 7 shows a block diagram of a verification apparatus for index extraction according to an embodiment of the present disclosure. As shown in fig. 7, in one possible implementation, the extraction module 72 includes:

the extracting unit 721 is configured to extract M electronic files as samples at a time, N times, and obtain N sample sets, where each sample set includes M samples.

In one possible implementation, the checking module 74 includes:

the first recording unit 741, configured to record, for any first index of the sample, a check result of the first index when a second index identical to the first index exists and corresponding first index value and second index value are identical, where the type of the check result is that the index is extracted correctly.

A second recording unit 742, configured to record, for any first index of the sample, a verification result of the first index when a second index identical to the first index exists and corresponding first index value and second index value are different, where the type of the verification result is an index value extraction error.

A third recording unit 743, configured to, for any first index of the sample, record a verification result of the first index when there is no second index that is the same as the first index, where the type of the verification result is an index false positive.

The fourth recording unit 744, configured to, for any second index of the sample, record a verification result of the second index when there is no first index that is the same as the second index, where the type of the verification result is index missing.

In one possible implementation, the apparatus 70 further includes:

and a type determining module 76, configured to determine the type of the first index value and the second index value corresponding to the same first index and second index, respectively.

A first determining module 77, configured to determine that the first index value and the second index value are the same when the first index value and the second index value are equal to each other, and otherwise, determine that the first index value and the second index value are different.

A second determining module 78, configured to, when the type of at least one of the first index value and the second index value is a character type, determine that the first index value and the second index value are the same when both the second index value and the first index value include the specified keyword or the characters included in the first index value and the second index value are completely the same, and determine that the first index value and the second index value are different when at least one of the second index value and the first index value does not include the specified keyword or the characters included in the first index value and the second index value include different characters.

In one possible implementation, the evaluation module 75 includes:

the first statistical unit 751 is configured to count, for M samples in any sample set, the verification results of the same index, and determine the proportion of each type of verification result corresponding to the index; the ratio of any type of check result corresponding to the index is the ratio of the number of times of the type of check result to the number of times of all the check results of the index.

In one possible implementation, the apparatus 70 further includes:

and the recording module 79 is used for recording the identification of the electronic text corresponding to the first index or the second index when the verification result of the first index or the second index is recorded.

In one possible implementation, the evaluation module 75 includes:

a second counting unit 752, configured to count, for the N sample sets, the verification results of the same electronic text according to the identifier of the electronic text recorded in the verification result, and determine a ratio of each type of verification result corresponding to the electronic text; the ratio of any type of verification result corresponding to the electronic text is the ratio of the number of times of the type of verification result to the number of times of all verification results of the electronic text.

FIG. 8 is a block diagram illustrating an apparatus 900 for verification of index extraction, according to an example embodiment. Referring to fig. 8, the apparatus 900 may include a processor 901, a machine-readable storage medium 902 having stored thereon machine-executable instructions. The processor 901 and the machine-readable storage medium 902 may communicate via a system bus 903. Also, the processor 901 performs the above-described index extraction verification method by reading machine-executable instructions in the machine-readable storage medium 902 corresponding to the index extraction verification logic.

The machine-readable storage medium 902 referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A verification method for index extraction is characterized by comprising the following steps:

extracting indexes of each electronic text to obtain a first index of the electronic text and a corresponding first index value;

extracting electronic text as a sample;

for each sample, acquiring a second index and a corresponding second index value of the sample;

verifying the first index and the first index value of the sample by using the second index and the second index value of the sample to obtain a verification result;

and evaluating the accuracy of index extraction based on the verification result.

2. The method of claim 1, wherein the extracting electronic text as a sample comprises:

and extracting M electronic texts as samples every time, and extracting N times to obtain N sample sets, wherein each sample set comprises M samples.

3. The method of claim 2, wherein the checking the first index and the first index value of the sample with the second index and the second index value of the sample to obtain a checking result comprises:

for any first index of the sample, if a second index identical to the first index exists and the corresponding first index value and the corresponding second index value are identical, recording a verification result of the first index, wherein the type of the verification result is that the index extraction is correct;

for any first index of the sample, if a second index identical to the first index exists and corresponding first index value and second index value are different, recording a verification result of the first index, wherein the type of the verification result is an index value extraction error;

for any first index of the sample, if a second index which is the same as the first index does not exist, recording a verification result of the first index, wherein the type of the verification result is index false extraction;

and for any second index of the sample, if the first index which is the same as the second index does not exist, recording a verification result of the second index, wherein the type of the verification result is index missing.

4. The method of claim 3, further comprising:

determining the types of a first index value and a second index value respectively corresponding to the same first index and second index;

when the types of the first index value and the second index value are numerical types, if the first index value is equal to the second index value, determining that the first index value is the same as the second index value, and otherwise, determining that the first index value is different from the second index value;

when the type of at least one of the first index value and the second index value is a character type, if the second index value and the first index value both contain the specified keyword or the characters contained in the first index value and the second index value are completely the same, the first index value and the second index value are determined to be the same, and if at least one of the second index value and the first index value does not contain the specified keyword or the characters of the first index value and the second index value are different, the first index value and the second index value are determined to be different.

5. The method of claim 3, wherein evaluating the accuracy of the index extraction based on the verification result comprises:

counting the check results of the same index aiming at M samples in any sample set, and determining the proportion of each type of check result corresponding to the index; the ratio of any type of check result corresponding to the index is the ratio of the number of times of the type of check result to the number of times of all the check results of the index.

6. The method of claim 3, further comprising:

and when the verification result of the first index or the second index is recorded, recording the identification of the electronic text corresponding to the first index or the second index.

7. The method of claim 6, wherein evaluating the accuracy of the extraction based on the verification result comprises:

counting the verification results of the same electronic texts according to the identifications of the electronic texts recorded in the verification results aiming at the N sample sets, and determining the proportion of each type of verification results corresponding to the electronic texts; the ratio of any type of verification result corresponding to the electronic text is the ratio of the number of times of the type of verification result to the number of times of all verification results of the electronic text.

8. A calibration apparatus for index extraction, comprising:

the extraction module is used for extracting indexes of each electronic text to obtain a first index of the electronic text and a corresponding first index value;

the extraction module is used for extracting the electronic text serving as the sample;

the acquisition module is used for acquiring a second index and a corresponding second index value of each sample;

the verification module is used for verifying the first index and the first index value of the sample by using the second index and the second index value of the sample to obtain a verification result;

and the evaluation module is used for evaluating the accuracy of index extraction based on the verification result.

9. The apparatus of claim 8, wherein the extraction module comprises:

and the extraction unit is used for extracting M electronic texts as samples every time, and extracting N times to obtain N sample sets, wherein each sample set comprises M samples.

10. The apparatus of claim 9, wherein the verification module comprises:

the first recording unit is used for recording a verification result of any first index of the sample when a second index which is the same as the first index exists and the corresponding first index value and the corresponding second index value are the same, wherein the type of the verification result is correct for index extraction;

the second recording unit is used for recording a verification result of any first index of the sample when a second index identical to the first index exists and the corresponding first index value and the corresponding second index value are different, wherein the type of the verification result is an index value extraction error;

the third recording unit is used for recording a verification result of any first index of the sample when a second index which is the same as the first index does not exist, wherein the type of the verification result is index false extraction;

and the fourth recording unit is used for recording the verification result of any second index of the sample when the first index identical to the second index does not exist, and the type of the verification result is index missing.

11. The apparatus of claim 10, further comprising:

the type determining module is used for determining the types of the first index value and the second index value corresponding to the same first index and second index respectively;

the first determination module is used for determining that the first index value and the second index value are the same when the first index value and the second index value are equal under the condition that the types of the first index value and the second index value are numerical types, or else, determining that the first index value and the second index value are different;

and the second determining module is used for determining that the first index value and the second index value are the same when the type of at least one of the first index value and the second index value is a character type and the second index value and the first index value both contain the specified keyword or the characters contained in the first index value and the second index value are completely the same, and determining that the first index value and the second index value are different when at least one of the second index value and the first index value does not contain the specified keyword or the characters contained in the first index value and the second index value have different characters.

12. The apparatus of claim 10, wherein the evaluation module comprises:

the first statistical unit is used for counting the verification results of the same index aiming at the M samples in any sample set and determining the proportion of each type of verification result corresponding to the index; the ratio of any type of check result corresponding to the index is the ratio of the number of times of the type of check result to the number of times of all the check results of the index.

13. The apparatus of claim 10, further comprising:

and the recording module is used for recording the identification of the electronic text corresponding to the first index or the second index when the checking result of the first index or the second index is recorded.

14. The apparatus of claim 13, wherein the evaluation module further comprises:

the second statistical unit is used for counting the verification results of the same electronic texts according to the identifications of the electronic texts recorded in the verification results aiming at the N sample sets, and determining the proportion of each type of verification results corresponding to the electronic texts; the ratio of any type of verification result corresponding to the electronic text is the ratio of the number of times of the type of verification result to the number of times of all verification results of the electronic text.