CN112256823A

CN112256823A - Corpus data sampling method and system based on adjacency density

Info

Publication number: CN112256823A
Application number: CN202011185039.7A
Authority: CN
Inventors: 张伯政; 吴军; 樊昭磊; 何彬彬
Original assignee: Shandong Msunhealth Technology Group Co Ltd
Current assignee: Shandong Msunhealth Technology Group Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-01-22
Anticipated expiration: 2040-10-29
Also published as: CN112256823B

Abstract

The utility model provides a corpus data sampling method and system based on adjacency density, which comprises the following steps of carrying out regularization processing on corpus data to obtain standardized corpus data; calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method; calculating corpus data sample approximate distribution based on the adjacency density; sampling samples according to the similar distribution of the corpus data samples to obtain a temporary sampling result; performing iteration according to a preset iteration rule to solve an optimal hyper-parameter value, and obtaining a final corpus data sampling result according to the determined optimal hyper-parameter value; according to the scheme, a method for measuring the adjacent area by density is adopted, less sampling at a dense part and more sampling at a sparse part of a sample can be realized, the method is suitable for a data screening process before a natural language corpus labeling task, and the problems of too many approximate samples and too few sparse samples are avoided; meanwhile, effective substitute samples of the original samples are searched through repeated iterative search, and the comprehensiveness of the sampled samples is improved.

Description

Corpus data sampling method and system based on adjacency density

Technical Field

The present disclosure relates to the field of data sampling technologies, and in particular, to a corpus data sampling method and system based on neighboring density.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The traditional medical field accumulates huge amount of patient case text information, and the application of a natural language processing supervised learning algorithm in the medical field, including Named Entity Recognition (NER), Relation Extraction (RE), syntactic analysis and the like, is very dependent on the original sample data labeling result. However, manual labeling is often performed for data labeling, so that repeated labeling and invalid labeling are avoided, the labeling time and the manual loss are reduced, and the labeling quality is improved. Thinking about how to extract corpus data with wide coverage and acceptable quantity from the original data set, and carrying out effective data labeling, analysis and mining on the basis of the corpus data is a problem to be solved urgently by the current supervised training method.

Text data in the natural language field often has the characteristics of high abstract characteristics, information mixing, information repetition and the like, a sampling method needs to eliminate repeated information samples, effective information needs to be kept as comprehensively and accurately as possible, and effective samples are provided for data labeling of tasks such as named entity identification, relation extraction and the like. The short length of the case text in the hospital is dozens of characters, the long length is thousands of characters, the case writing often has specific format expression, and is full of a large amount of homogeneous information, such as the writing of 'the past history': the history of hypertension and cerebral infarction. The physician usually adopts a case template during writing a case due to time, so that the writing modes of case texts are different greatly, and the case text sample points in the case text sample space are relatively tight, the difference is not obvious, and the homogenization is serious.

The inventor finds that the current data sampling method is various, wherein a random sampling algorithm is a simple and direct sampling method, but due to the complexity of text data, certain manual intervention is required for labeling data, and it is not reasonable to strictly follow a random principle. The unequal proportion hierarchical sampling method can analyze the difference of texts to a certain degree, but the problem of how to properly pre-classify the texts is a problem. The existing traditional sampling method can not meet the requirement of data extraction in the natural language field, can not distinguish the homogeneous part of a sample, and can cause the problems of unmarked rare samples and repeated marking in actual data marking.

Disclosure of Invention

The method adopts a density measurement method to measure the adjacent area, can realize less sampling at a dense part and more sampling at a sparse part of a sample, is suitable for a data screening process before a natural language corpus labeling task, and effectively avoids the problems of too many approximate samples and too few sparse samples.

According to a first aspect of the embodiments of the present disclosure, there is provided a corpus data sampling method based on adjacency density, including:

carrying out regularization processing on the corpus data to obtain standardized corpus data;

calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method;

calculating corpus data sample approximate distribution based on the adjacency density;

sampling samples according to the similar distribution of the corpus data samples to obtain a temporary sampling result;

and carrying out iteration according to a preset iteration rule to solve the optimal hyper-parameter value, and obtaining a final corpus data sampling result according to the determined optimal hyper-parameter value.

Further, performing regularization processing on the corpus data is to perform mathematical representation on the corpus data, and specifically includes: defining the corpus data as a text sequence set in advance, wherein the text sequence set comprises a plurality of sample sets, and each sample combination is composed of a plurality of single characters; secondly, index representation is carried out on the single characters in each sample by using a coding index algorithm, and vector representation of a text sequence set is obtained.

Further, the regularization process further includes setting a weight matrix and a word embedding matrix, and converting the vector representation of the text sequence set into a word embedding vector representation of a text training set.

Further, the calculating of the adjacent density includes calculating a density matrix of the sample points in the normalized corpus data by using a distance measurement method, and selecting a mean value of the first K data with the density values sorted from large to small for each sample point based on the density matrix, that is, the adjacent density of the sample point.

Further, the calculation of the corpus data sample approximate distribution adopts a method of contraction and translation, and the specific formula is as follows:

the super parameter B is a distributed bias coefficient, N is the total number of texts in the original data set, N is the number of quasi-samples, i and N in the formula are not more than N and are positive integers, wherein

Theta represents the adjacency density of sample points in corpus data,

the approximation obeys a uniform distribution of U (a, b), where a, b are the parameters to be fitted, and y ═ y₁,y₂,...,y_nDenotes the approximate distribution of the original data set, which approximately obeys a uniform distribution of U (0, 1).

Further, the sampling method specifically comprises the following steps: generation of pseudo random number xi uniformly distributed in the range of 0-1 and subject to U (0,1) by random number generator₁，ξ₂，...,ξ_n(ii) a The following sampling rule is adopted: define Mark, if xi_i＜y_i，Mark_iSelecting the sample as 1; otherwise, Mark_iDiscard the sample as 0; wherein i is not more than n and is a positive integer, and n is the number of samples and is a positive integer; therefore, the sample where Mark is constantly equal to 1 can be obtained, i.e. the temporary sampling result:

further, the specific step of performing iterative solution on the hyper-parameter according to the preset iteration rule includes:

initializing a hyperparameter K_lAnd B_lIteratively calculating a frequency histogram a and a temporary sampling result frequency histogram b of the approximate distribution of the corpus data samples; calculating the similarity of the graph a and the graph b, and performing the following operations according to the similarity result:

(1) if the similarity calculation results of the graph a and the graph b do not exceed a preset threshold value, stopping iteration;

(2) if the frequency maximum value in the graph b is larger than the frequency maximum value in the graph a and exceeds the preset threshold value, increasing K_lIs marked as K_l+1And vice versa; if the abscissa of the position of the axis of symmetry of the image in graph B is greater than the abscissa of the axis of symmetry of the image in graph a, B is decreased_lIs marked as B_l+1And vice versa; wherein K_lAre integers. Updating l, K_l、B_lAre respectively l +1 and K_l+1、B_l+1；

(3) If the iteration number L is equal to L, the iteration is stopped.

According to a second aspect of the embodiments of the present disclosure, there is provided a corpus data sampling system based on adjacency density, including:

the data preprocessing module is used for carrying out regularization processing on the corpus data to obtain standardized corpus data;

the hyper-parameter optimization solving module is used for calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method; calculating corpus data sample approximate distribution based on the adjacency density; sampling samples according to the similar distribution of the corpus data samples to obtain a temporary sampling result; carrying out iteration solving on the hyper-parameters according to a preset iteration rule;

and the data sampling module is used for obtaining a final corpus data sampling result according to the determined optimal hyper-parameter value.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the memory, wherein the processor implements the method for sampling corpus data based on neighboring density when executing the program.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor implements the method for sampling corpus data based on neighboring density.

Compared with the prior art, the beneficial effect of this disclosure is:

(1) according to the sample sampling method provided by the scheme, the density measurement method for the adjacent area is adopted, less sampling at the dense part and more sampling at the sparse part of the sample can be realized, the method is suitable for the data screening process before the natural language corpus labeling task, and the problems of too many approximate samples and too few sparse samples are avoided.

(2) According to the scheme, effective substitute samples of the original samples are searched through multiple iterative searches, and the comprehensiveness of the sampled samples is improved.

(3) The scheme disclosed by the invention utilizes a distance measurement method to calculate the adjacency density of the sample points in the standardized corpus data, and effectively solves the problems of relatively compact sample points, unobvious differences and serious homogenization in the sample space sampled by the conventional sampling method.

Advantages of additional aspects of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a flowchart illustrating a corpus data sampling method based on neighboring density according to a first embodiment of the present disclosure;

fig. 2(a) is a frequency histogram of an approximate distribution of original samples according to a first embodiment of the disclosure;

fig. 2(b) is a frequency histogram of the temporary iterative sampling result in the first embodiment of the disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The first embodiment is as follows:

the embodiment aims to provide a corpus data sampling method based on adjacent density.

The disclosure provides a corpus data sampling method based on adjacent density aiming at the problem that the existing data sampling method can not solve the homogeneous sample sampling problem, and the concrete steps are as follows

S101: defining the text sequence set as D, D ═ D₁，D₂，...，D_nWhere n represents the total number of samples and is a positive integer, D_iDenotes the ith sample set, D_i＝{S_i,1,S_i,2,...S_i,m}，S_i,jThe method is characterized in that j represents the j individual character in the ith sample set, m represents the maximum number of the individual characters in all samples of the set D and is a positive integer, namely the maximum number of sentence lengths, wherein i and j are positive integers, i is less than or equal to n, and j is less than or equal to m.

S102, defining vector representation V of text sequence set, wherein V is { V ═ V }₁,V₂,...,V_n}∈R^n×mIn which V is_i∈R^mA one-hot code index representing the ith text, wherein the one-hot code is based on a customized dictionary, and the dictionary comprises 4754 Chinese commonly used single words; for example, a single word has a one-hot code index of 3, which represents a vector of 4754 dimensions, with the 3 rd position being 1 and the other positions being 0.

S103: word-embedded vector representation E defining a set of text sequences, E ═ E₁,E₂,...,E_nIn which E_i∈R^emb ^_dimEmd _ dim is the dimension of word embedding and is a positive integer, n is the number of samples and is a positive integer; in particular, the method comprises the following steps of,

E_i＝Mean(W*Embedding(V_i))。

wherein, Embedding (V)_i)∈R^m×emb_dimRepresenting a word embedding matrix, W ∈ R^m×emb_dimRepresenting a weight matrix, wherein the numerical value of the weight matrix can be selected from TF-IDF weight, m is the maximum number of sentence lengths, and operation represents dot multiplication; generating a Word embedding matrix, wherein the Word embedding matrix can be generated by using a pre-trained weight coefficient or by selecting a mature pre-trained model for retraining, such as a Word2Vec model, a BERT model, an ALBERT model and the like; operation of mean (S) epsilon R^emd_dimExpressing the mean function operation, the formula is:

the mean operation here operates on the first dimension of the tensor, where i is a positive integer and i is ≦ m.

S104: data normalization, which takes the form:

where Mean represents the Mean function and σ (z) represents the standard deviation of the vector zAnd e represents an infinitesimal quantity; z here represents the result of the processing in step S103, and after data normalization, the data dimension remains unchanged, i.e. R^n×emb_dim(ii) a The reason for adopting data standardization is mainly to prevent the data from being too discrete and too large in variance after being transformed.

S105, defining a density matrix DE (density matrix) of the text sample. Corpus data sampling follows the following principle: where the samples are dense (homogeneous samples) there is as little sampling as possible and where the samples are sparse there is as much sampling as possible. The method for measuring the sparsity of the sample adopts a distance measurement method, for example, MSE (mean Squared error), according to a sampling principle, the probability that a compact point is taken is smaller than that of a sparse point, and the MSE value of a sample point can indirectly measure the area around the sample point, namely the probability. MSE is defined as follows:

wherein n represents the degree of sample dimension, y represents the sample, i, j are positive integers, i is less than or equal to n, j is less than or equal to n; the dimension of the density matrix DE calculated by the result of the processing in the step S104 is R^n×emb_dim(ii) a The measurement method can also adopt other methods similar to MSE, such as Manhattan distance, Minkowski distance, Cosine distance and the like, and the principle is similar.

S106, defining the sample density theta, and enabling DE to be { eta } on the basis of the existing density matrix DE₁,η₂,...,η_nEta in which_i∈R^emb ^_dimDefining a hyperparameter K, then

θ_i＝top_mean_K(η_i)

Wherein top mean_KThe data are sorted in the reverse direction (from large to small) and then the average value of the first K data is taken, so that the sample density is represented as theta

θ＝{θ₁,θ₂,...,θ_n}

Wherein i is not more than n and is a positive integer, and n is the number of samples and is a positive integer.

S107: defining the sampling probability and sample approximate distribution

Then

The approximation obeys a uniform distribution of U (a, b), where a, b are the parameters to be fitted.

Defining a super parameter B as a distributed bias coefficient, defining the total quantity of texts in an original data set as N and the number of quasi-samples as N, and operating the original distribution by adopting a scaling and translation method. Order to

Then we y ═ y₁,y₂,...,y_nThere is a uniform distribution that obeys approximately U (0,1), where i, N ≦ N and is a positive integer, and N is the number of samples and is a positive integer.

S108: samples are drawn according to the approximate distribution of samples, and n random numbers xi uniformly distributed in the range of 0-1 and subject to U (0,1) are generated by a random number generator according to the description in step S107₁，ξ₂，...,ξ_n. The following sampling rule is adopted: define Mark, if xi_i＜y_i，Mark_iSelecting the sample as 1; otherwise, Mark_iDiscard the sample as 0; wherein i is not more than n and is a positive integer, and n is the number of samples and is a positive integer; so we can get samples where Mark is constantly equal to 1, i.e.

This is taken as a provisional sampling result, where M is some positive integer close to the initial number of samples N.

S109: initializing a hyperparameter K₀1, B, an ultra-parameter₀And setting the iteration number L as 0, wherein L is more than 0 and less than or equal to L.

S110: for the l iteration, the following steps are performed in sequence:

step S106 is executed to obtain

And

as the sample density;

step S107 is executed to obtain

As an approximate distribution of the original samples;

step S108 is executed to obtain

As a result of the temporal iterative sampling;

s111: output of

And

the frequency histograms (denoted as graph a and graph b) of (a), determining the image contrast of graph a and graph b, and executing the following steps:

i) if the frequency histogram images of the two frequency histogram images are approximately the same, stopping iteration;

ii) stopping the iteration if the iteration number L is equal to L;

iii) otherwise, adjusting the value of the hyperparameter. The adjustment strategy is as follows, if the maximum value of the image in the graph b is larger than the maximum value in the graph a and the difference is larger, K is increased_lIs marked as K_l+1And vice versa; if the abscissa of the position of the axis of symmetry of the image in graph B is greater than the abscissa of the axis of symmetry of the image in graph a, B is decreased_lIs marked as B_l+1And vice versa; wherein K_lAnd B_lThe increase and decrease range depends on the actual situation, but K needs to be ensured_lAnd are integers. Updating l, K_l、B_lAre respectively l +1 and K_l+1、B_l+1Step S110 is performed.

In general, the hyper-parameter K is related to the maximum value of the sample distribution, and the hyper-parameter B is related to the symmetry axis of the sample distribution.

S112, after the numerical values of the hyper-parameters K and B are determined, outputting according to the step S108

And Mark data as the final sampling result of the original data set.

According to the sample sampling method provided by the disclosure, a method for measuring the adjacent area by density is adopted, less sampling at a dense part and more sampling at a sparse part of a sample can be realized, the method is suitable for a data screening process before a natural language corpus labeling task, and the problems of too many approximate samples and too few sparse samples are avoided; meanwhile, according to the scheme disclosed by the invention, effective substitute samples of the original samples are searched through multiple iterative searches, so that the comprehensiveness of the sampled samples is improved.

Further, in order to prove the feasibility of the scheme disclosed by the present disclosure, in this embodiment, the scheme disclosed by the present disclosure is verified by taking a sample of a medical case text as an example, and the specific steps are as follows:

124582 existing medical case texts, wherein 9500 pieces of data are to be extracted, and the following processing procedures of subsequent data annotation tasks are all processed by adopting a Python program.

(1) Defining a text sequence set, wherein n is 124582, the lengths of case texts are different, and the case text length is selected to be 800 after preliminary statistics.

(2) Defining a vector representation V of a text set, and generating V ═ V based on a Chinese word dictionary₁,V₂,...,V_n}∈R^n×mThe length of the selected single word dictionary is 4754.

(3) Defining a word embedding vector representation E of the text set, and converting the vector in the step (2) into R based on an 80-dimensional pre-trained word embedding weight matrix^n×80Vector quantity; the word embedding weight matrix is obtained by training original corpora such as textbooks and case texts in the medical field based on a double-layer LSTM network layer.

(4) Data normalization was performed.

(5) Step S105 is executed, and the adjacency density of the sample points is calculated based on the MSE index, and the dimension of the density matrix DE is R^n×80。

(6) Initializing a hyperparameter k₀1, B, an ultra-parameter ₀0; the iteration times are set to be more than 0 and less than or equal to 20.

(7) For iteration 1, step S106 above is performed, resulting in

And

as the sample density;

(8) the above step S107 is executed to obtain

As an approximate distribution of the original samples;

(9) the above step S108 is executed to obtain

As a result of the temporal iterative sampling;

(10) the number of extracted samples N is 9500, which is obtained in step (9)

And

drawing frequency histograms of the two by using a Python program, comparing the frequency histograms with the frequency histograms, adjusting numerical values of the hyper-parameters according to the adjustment strategy in the step 11) in the step 3, and finally generating an original sample y image through multiple iterations as shown in fig. 2 (a). Finally, when the super-parameters K is 10 and B is 0.03, that is to say

When the temperature of the water is higher than the set temperature,

as shown in fig. 2(b), M is the final sample number, and M is 9618; and the images are similar intuitively.

(11) After the values of the hyper-parameters K and B are determined, the output is performed according to the step S108

And Mark data as a final sampling result with the original data set.

After the numerical values of the hyper-parameters K and B are determined in the same original data set through a multi-step iteration process, the sampling number can be modified for multiple times according to the actual required number, and only the sampling number in the step (10) needs to be modified, and the complete process does not need to be executed again.

The method is mainly used for solving the problems of excessive sampling of similar samples and insufficient sampling of differential samples in the early preparation work of the corpus data annotation, so that the sampling result is more comprehensive and representative, and the efficiency and the quality of data annotation are improved. In the actual processing process, images of an original sample and a sampled sample need to be compared, the hyper-parameters are dynamically adjusted, and the representativeness of the sampled sample is improved; the examples presented in this disclosure are exemplary cases of text sampling of medical cases, and this idea and method can be applied to corpus sampling aspects of other fields of text. Other embodiments obtained without departing from the principles, methods, and teachings of the present disclosure are within the scope of the present disclosure.

Example two:

the embodiment aims to provide a corpus data sampling system based on adjacent density.

A corpus data sampling system based on neighborhood density, comprising:

Example three:

the embodiment aims at providing an electronic device.

An electronic device comprising a memory, a processor and a computer program stored in the memory for execution, wherein the processor implements a method for sampling corpus data based on neighboring density as described above when executing the computer program, comprising:

Example four:

it is an object of the present embodiments to provide a non-transitory computer-readable storage medium.

A non-transitory computer-readable storage medium, on which a computer program is stored, the program, when executed by a processor, implementing a method for sampling corpus data based on neighboring densities as described above, comprising:

The corpus data sampling method and system based on the adjacent density can be completely realized, and have wide application prospects.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A corpus data sampling method based on adjacency density is characterized by comprising the following steps:

and carrying out iteration solution on the hyperparameters according to a preset iteration rule, and obtaining a final corpus data sampling result according to the determined optimal hyperparametric value.

2. The method as claimed in claim 1, wherein the regularizing of the corpus data is a mathematical representation of the corpus data, and specifically comprises: defining the corpus data as a text sequence set in advance, wherein the text sequence set comprises a plurality of sample sets, and each sample combination is composed of a plurality of single characters; secondly, index representation is carried out on the single characters in each sample by using a coding index algorithm, and vector representation of a text sequence set is obtained.

3. The method for sampling corpus data according to claim 1, wherein said regularizing further comprises setting a weight matrix and a word embedding matrix, and converting vector representations of said text sequence set into word embedding vector representations of a text training set.

4. The method as claimed in claim 1, wherein the calculating of the neighboring density includes calculating a density matrix of sample points in the normalized corpus data by using a distance metric method, and selecting, for each sample point, a mean value of the first K data whose density values are sorted from large to small based on the density matrix, i.e. the neighboring density of the sample point.

5. The method as claimed in claim 1, wherein the calculation of the approximate distribution of the corpus data samples is performed by a method of square reduction and translation, and the concrete formula is as follows:

Theta represents a sample in corpus dataThe contiguous density of the dots is such that,

6. The method for sampling corpus data according to claim 1, wherein said step of sampling samples comprises: generation of pseudo random number xi uniformly distributed in the range of 0-1 and subject to U (0,1) by random number generator₁，ξ₂，...,ξ_n(ii) a The following sampling rule is adopted: define Mark, if xi_i＜y_i，Mark_iSelecting the sample as 1; otherwise, Mark_iDiscard the sample as 0; wherein i is not more than n and is a positive integer, and n is the number of samples and is a positive integer; therefore, the sample where Mark is constantly equal to 1 can be obtained, i.e. the temporary sampling result:

7. the method for sampling corpus data according to claim 1, wherein the step of iteratively solving the hyperparameter according to the predetermined iteration rule comprises:

(2) if the frequency exceeds the preset threshold value, the value of the super parameter is adjusted, the specific adjustment strategy is as follows, if the maximum frequency value in the graph b is larger than the frequency in the graph aIf the maximum value exceeds the preset threshold value, increasing K_lIs marked as K_l+1And vice versa; if the abscissa of the position of the axis of symmetry of the image in graph B is greater than the abscissa of the axis of symmetry of the image in graph a, B is decreased_lIs marked as B_l+1And vice versa; wherein K_lIs an integer, update l, K_l、B_lAre respectively l +1 and K_l+1、B_l+1；

(3) If the iteration number L is equal to L, the iteration is stopped.

8. A corpus data sampling system based on neighborhood density, comprising:

the hyper-parameter optimization solving module is used for calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method; calculating a sampling probability and an approximate distribution of the original samples based on the adjacency densities; sampling samples according to the approximate distribution of the original samples to obtain a temporary sampling result; carrying out iteration solving on the hyper-parameters according to a preset iteration rule;

and the data sampling module is used for obtaining a final sampling result according to the determined super parameter value.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory for execution by the processor, wherein the processor implements a method for contiguous density based corpus data sampling according to any of claims 1-7.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements a method for contiguous density based corpus data sampling according to any one of claims 1-7.