CN112256823A - Corpus data sampling method and system based on adjacency density - Google Patents

Corpus data sampling method and system based on adjacency density Download PDF

Info

Publication number
CN112256823A
CN112256823A CN202011185039.7A CN202011185039A CN112256823A CN 112256823 A CN112256823 A CN 112256823A CN 202011185039 A CN202011185039 A CN 202011185039A CN 112256823 A CN112256823 A CN 112256823A
Authority
CN
China
Prior art keywords
corpus data
sampling
samples
density
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011185039.7A
Other languages
Chinese (zh)
Other versions
CN112256823B (en
Inventor
张伯政
吴军
樊昭磊
何彬彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Msunhealth Technology Group Co Ltd
Original Assignee
Shandong Msunhealth Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Msunhealth Technology Group Co Ltd filed Critical Shandong Msunhealth Technology Group Co Ltd
Priority to CN202011185039.7A priority Critical patent/CN112256823B/en
Publication of CN112256823A publication Critical patent/CN112256823A/en
Application granted granted Critical
Publication of CN112256823B publication Critical patent/CN112256823B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Complex Calculations (AREA)

Abstract

The utility model provides a corpus data sampling method and system based on adjacency density, which comprises the following steps of carrying out regularization processing on corpus data to obtain standardized corpus data; calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method; calculating corpus data sample approximate distribution based on the adjacency density; sampling samples according to the similar distribution of the corpus data samples to obtain a temporary sampling result; performing iteration according to a preset iteration rule to solve an optimal hyper-parameter value, and obtaining a final corpus data sampling result according to the determined optimal hyper-parameter value; according to the scheme, a method for measuring the adjacent area by density is adopted, less sampling at a dense part and more sampling at a sparse part of a sample can be realized, the method is suitable for a data screening process before a natural language corpus labeling task, and the problems of too many approximate samples and too few sparse samples are avoided; meanwhile, effective substitute samples of the original samples are searched through repeated iterative search, and the comprehensiveness of the sampled samples is improved.

Description

Corpus data sampling method and system based on adjacency density
Technical Field
The present disclosure relates to the field of data sampling technologies, and in particular, to a corpus data sampling method and system based on neighboring density.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The traditional medical field accumulates huge amount of patient case text information, and the application of a natural language processing supervised learning algorithm in the medical field, including Named Entity Recognition (NER), Relation Extraction (RE), syntactic analysis and the like, is very dependent on the original sample data labeling result. However, manual labeling is often performed for data labeling, so that repeated labeling and invalid labeling are avoided, the labeling time and the manual loss are reduced, and the labeling quality is improved. Thinking about how to extract corpus data with wide coverage and acceptable quantity from the original data set, and carrying out effective data labeling, analysis and mining on the basis of the corpus data is a problem to be solved urgently by the current supervised training method.
Text data in the natural language field often has the characteristics of high abstract characteristics, information mixing, information repetition and the like, a sampling method needs to eliminate repeated information samples, effective information needs to be kept as comprehensively and accurately as possible, and effective samples are provided for data labeling of tasks such as named entity identification, relation extraction and the like. The short length of the case text in the hospital is dozens of characters, the long length is thousands of characters, the case writing often has specific format expression, and is full of a large amount of homogeneous information, such as the writing of 'the past history': the history of hypertension and cerebral infarction. The physician usually adopts a case template during writing a case due to time, so that the writing modes of case texts are different greatly, and the case text sample points in the case text sample space are relatively tight, the difference is not obvious, and the homogenization is serious.
The inventor finds that the current data sampling method is various, wherein a random sampling algorithm is a simple and direct sampling method, but due to the complexity of text data, certain manual intervention is required for labeling data, and it is not reasonable to strictly follow a random principle. The unequal proportion hierarchical sampling method can analyze the difference of texts to a certain degree, but the problem of how to properly pre-classify the texts is a problem. The existing traditional sampling method can not meet the requirement of data extraction in the natural language field, can not distinguish the homogeneous part of a sample, and can cause the problems of unmarked rare samples and repeated marking in actual data marking.
Disclosure of Invention
The method adopts a density measurement method to measure the adjacent area, can realize less sampling at a dense part and more sampling at a sparse part of a sample, is suitable for a data screening process before a natural language corpus labeling task, and effectively avoids the problems of too many approximate samples and too few sparse samples.
According to a first aspect of the embodiments of the present disclosure, there is provided a corpus data sampling method based on adjacency density, including:
carrying out regularization processing on the corpus data to obtain standardized corpus data;
calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method;
calculating corpus data sample approximate distribution based on the adjacency density;
sampling samples according to the similar distribution of the corpus data samples to obtain a temporary sampling result;
and carrying out iteration according to a preset iteration rule to solve the optimal hyper-parameter value, and obtaining a final corpus data sampling result according to the determined optimal hyper-parameter value.
Further, performing regularization processing on the corpus data is to perform mathematical representation on the corpus data, and specifically includes: defining the corpus data as a text sequence set in advance, wherein the text sequence set comprises a plurality of sample sets, and each sample combination is composed of a plurality of single characters; secondly, index representation is carried out on the single characters in each sample by using a coding index algorithm, and vector representation of a text sequence set is obtained.
Further, the regularization process further includes setting a weight matrix and a word embedding matrix, and converting the vector representation of the text sequence set into a word embedding vector representation of a text training set.
Further, the calculating of the adjacent density includes calculating a density matrix of the sample points in the normalized corpus data by using a distance measurement method, and selecting a mean value of the first K data with the density values sorted from large to small for each sample point based on the density matrix, that is, the adjacent density of the sample point.
Further, the calculation of the corpus data sample approximate distribution adopts a method of contraction and translation, and the specific formula is as follows:
Figure BDA0002751175110000031
the super parameter B is a distributed bias coefficient, N is the total number of texts in the original data set, N is the number of quasi-samples, i and N in the formula are not more than N and are positive integers, wherein
Figure BDA0002751175110000032
Theta represents the adjacency density of sample points in corpus data,
Figure BDA0002751175110000033
the approximation obeys a uniform distribution of U (a, b), where a, b are the parameters to be fitted, and y ═ y1,y2,...,ynDenotes the approximate distribution of the original data set, which approximately obeys a uniform distribution of U (0, 1).
Further, the sampling method specifically comprises the following steps: generation of pseudo random number xi uniformly distributed in the range of 0-1 and subject to U (0,1) by random number generator1,ξ2,...,ξn(ii) a The following sampling rule is adopted: define Mark, if xii<yi,MarkiSelecting the sample as 1; otherwise, MarkiDiscard the sample as 0; wherein i is not more than n and is a positive integer, and n is the number of samples and is a positive integer; therefore, the sample where Mark is constantly equal to 1 can be obtained, i.e. the temporary sampling result:
Figure BDA0002751175110000034
further, the specific step of performing iterative solution on the hyper-parameter according to the preset iteration rule includes:
initializing a hyperparameter KlAnd BlIteratively calculating a frequency histogram a and a temporary sampling result frequency histogram b of the approximate distribution of the corpus data samples; calculating the similarity of the graph a and the graph b, and performing the following operations according to the similarity result:
(1) if the similarity calculation results of the graph a and the graph b do not exceed a preset threshold value, stopping iteration;
(2) if the frequency maximum value in the graph b is larger than the frequency maximum value in the graph a and exceeds the preset threshold value, increasing KlIs marked as Kl+1And vice versa; if the abscissa of the position of the axis of symmetry of the image in graph B is greater than the abscissa of the axis of symmetry of the image in graph a, B is decreasedlIs marked as Bl+1And vice versa; wherein KlAre integers. Updating l, Kl、BlAre respectively l +1 and Kl+1、Bl+1
(3) If the iteration number L is equal to L, the iteration is stopped.
According to a second aspect of the embodiments of the present disclosure, there is provided a corpus data sampling system based on adjacency density, including:
the data preprocessing module is used for carrying out regularization processing on the corpus data to obtain standardized corpus data;
the hyper-parameter optimization solving module is used for calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method; calculating corpus data sample approximate distribution based on the adjacency density; sampling samples according to the similar distribution of the corpus data samples to obtain a temporary sampling result; carrying out iteration solving on the hyper-parameters according to a preset iteration rule;
and the data sampling module is used for obtaining a final corpus data sampling result according to the determined optimal hyper-parameter value.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the memory, wherein the processor implements the method for sampling corpus data based on neighboring density when executing the program.
According to a fourth aspect of the embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor implements the method for sampling corpus data based on neighboring density.
Compared with the prior art, the beneficial effect of this disclosure is:
(1) according to the sample sampling method provided by the scheme, the density measurement method for the adjacent area is adopted, less sampling at the dense part and more sampling at the sparse part of the sample can be realized, the method is suitable for the data screening process before the natural language corpus labeling task, and the problems of too many approximate samples and too few sparse samples are avoided.
(2) According to the scheme, effective substitute samples of the original samples are searched through multiple iterative searches, and the comprehensiveness of the sampled samples is improved.
(3) The scheme disclosed by the invention utilizes a distance measurement method to calculate the adjacency density of the sample points in the standardized corpus data, and effectively solves the problems of relatively compact sample points, unobvious differences and serious homogenization in the sample space sampled by the conventional sampling method.
Advantages of additional aspects of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.
FIG. 1 is a flowchart illustrating a corpus data sampling method based on neighboring density according to a first embodiment of the present disclosure;
fig. 2(a) is a frequency histogram of an approximate distribution of original samples according to a first embodiment of the disclosure;
fig. 2(b) is a frequency histogram of the temporary iterative sampling result in the first embodiment of the disclosure.
Detailed Description
The present disclosure is further described with reference to the following drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The first embodiment is as follows:
the embodiment aims to provide a corpus data sampling method based on adjacent density.
The disclosure provides a corpus data sampling method based on adjacent density aiming at the problem that the existing data sampling method can not solve the homogeneous sample sampling problem, and the concrete steps are as follows
S101: defining the text sequence set as D, D ═ D1,D2,...,DnWhere n represents the total number of samples and is a positive integer, DiDenotes the ith sample set, Di={Si,1,Si,2,...Si,m},Si,jThe method is characterized in that j represents the j individual character in the ith sample set, m represents the maximum number of the individual characters in all samples of the set D and is a positive integer, namely the maximum number of sentence lengths, wherein i and j are positive integers, i is less than or equal to n, and j is less than or equal to m.
S102, defining vector representation V of text sequence set, wherein V is { V ═ V }1,V2,...,Vn}∈Rn×mIn which V isi∈RmA one-hot code index representing the ith text, wherein the one-hot code is based on a customized dictionary, and the dictionary comprises 4754 Chinese commonly used single words; for example, a single word has a one-hot code index of 3, which represents a vector of 4754 dimensions, with the 3 rd position being 1 and the other positions being 0.
S103: word-embedded vector representation E defining a set of text sequences, E ═ E1,E2,...,EnIn which Ei∈Remb _dimEmd _ dim is the dimension of word embedding and is a positive integer, n is the number of samples and is a positive integer; in particular, the method comprises the following steps of,
Ei=Mean(W*Embedding(Vi))。
wherein, Embedding (V)i)∈Rm×emb_dimRepresenting a word embedding matrix, W ∈ Rm×emb_dimRepresenting a weight matrix, wherein the numerical value of the weight matrix can be selected from TF-IDF weight, m is the maximum number of sentence lengths, and operation represents dot multiplication; generating a Word embedding matrix, wherein the Word embedding matrix can be generated by using a pre-trained weight coefficient or by selecting a mature pre-trained model for retraining, such as a Word2Vec model, a BERT model, an ALBERT model and the like; operation of mean (S) epsilon Remd_dimExpressing the mean function operation, the formula is:
Figure BDA0002751175110000061
the mean operation here operates on the first dimension of the tensor, where i is a positive integer and i is ≦ m.
S104: data normalization, which takes the form:
Figure BDA0002751175110000062
where Mean represents the Mean function and σ (z) represents the standard deviation of the vector zAnd e represents an infinitesimal quantity; z here represents the result of the processing in step S103, and after data normalization, the data dimension remains unchanged, i.e. Rn×emb_dim(ii) a The reason for adopting data standardization is mainly to prevent the data from being too discrete and too large in variance after being transformed.
S105, defining a density matrix DE (density matrix) of the text sample. Corpus data sampling follows the following principle: where the samples are dense (homogeneous samples) there is as little sampling as possible and where the samples are sparse there is as much sampling as possible. The method for measuring the sparsity of the sample adopts a distance measurement method, for example, MSE (mean Squared error), according to a sampling principle, the probability that a compact point is taken is smaller than that of a sparse point, and the MSE value of a sample point can indirectly measure the area around the sample point, namely the probability. MSE is defined as follows:
Figure BDA0002751175110000071
wherein n represents the degree of sample dimension, y represents the sample, i, j are positive integers, i is less than or equal to n, j is less than or equal to n; the dimension of the density matrix DE calculated by the result of the processing in the step S104 is Rn×emb_dim(ii) a The measurement method can also adopt other methods similar to MSE, such as Manhattan distance, Minkowski distance, Cosine distance and the like, and the principle is similar.
S106, defining the sample density theta, and enabling DE to be { eta } on the basis of the existing density matrix DE12,...,ηnEta in whichi∈Remb _dimDefining a hyperparameter K, then
θi=top_meanKi)
Wherein top meanKThe data are sorted in the reverse direction (from large to small) and then the average value of the first K data is taken, so that the sample density is represented as theta
θ={θ12,...,θn}
Wherein i is not more than n and is a positive integer, and n is the number of samples and is a positive integer.
S107: defining the sampling probability and sample approximate distribution
Figure BDA0002751175110000072
Then
Figure BDA0002751175110000073
The approximation obeys a uniform distribution of U (a, b), where a, b are the parameters to be fitted.
Defining a super parameter B as a distributed bias coefficient, defining the total quantity of texts in an original data set as N and the number of quasi-samples as N, and operating the original distribution by adopting a scaling and translation method. Order to
Figure BDA0002751175110000074
Then we y ═ y1,y2,...,ynThere is a uniform distribution that obeys approximately U (0,1), where i, N ≦ N and is a positive integer, and N is the number of samples and is a positive integer.
S108: samples are drawn according to the approximate distribution of samples, and n random numbers xi uniformly distributed in the range of 0-1 and subject to U (0,1) are generated by a random number generator according to the description in step S1071,ξ2,...,ξn. The following sampling rule is adopted: define Mark, if xii<yi,MarkiSelecting the sample as 1; otherwise, MarkiDiscard the sample as 0; wherein i is not more than n and is a positive integer, and n is the number of samples and is a positive integer; so we can get samples where Mark is constantly equal to 1, i.e.
Figure BDA0002751175110000081
This is taken as a provisional sampling result, where M is some positive integer close to the initial number of samples N.
S109: initializing a hyperparameter K01, B, an ultra-parameter0And setting the iteration number L as 0, wherein L is more than 0 and less than or equal to L.
S110: for the l iteration, the following steps are performed in sequence:
step S106 is executed to obtain
Figure BDA0002751175110000082
And
Figure BDA0002751175110000083
as the sample density;
step S107 is executed to obtain
Figure BDA0002751175110000084
As an approximate distribution of the original samples;
step S108 is executed to obtain
Figure BDA0002751175110000085
As a result of the temporal iterative sampling;
s111: output of
Figure BDA0002751175110000086
And
Figure BDA0002751175110000087
the frequency histograms (denoted as graph a and graph b) of (a), determining the image contrast of graph a and graph b, and executing the following steps:
i) if the frequency histogram images of the two frequency histogram images are approximately the same, stopping iteration;
ii) stopping the iteration if the iteration number L is equal to L;
iii) otherwise, adjusting the value of the hyperparameter. The adjustment strategy is as follows, if the maximum value of the image in the graph b is larger than the maximum value in the graph a and the difference is larger, K is increasedlIs marked as Kl+1And vice versa; if the abscissa of the position of the axis of symmetry of the image in graph B is greater than the abscissa of the axis of symmetry of the image in graph a, B is decreasedlIs marked as Bl+1And vice versa; wherein KlAnd BlThe increase and decrease range depends on the actual situation, but K needs to be ensuredlAnd are integers. Updating l, Kl、BlAre respectively l +1 and Kl+1、Bl+1Step S110 is performed.
In general, the hyper-parameter K is related to the maximum value of the sample distribution, and the hyper-parameter B is related to the symmetry axis of the sample distribution.
S112, after the numerical values of the hyper-parameters K and B are determined, outputting according to the step S108
Figure BDA0002751175110000093
And Mark data as the final sampling result of the original data set.
According to the sample sampling method provided by the disclosure, a method for measuring the adjacent area by density is adopted, less sampling at a dense part and more sampling at a sparse part of a sample can be realized, the method is suitable for a data screening process before a natural language corpus labeling task, and the problems of too many approximate samples and too few sparse samples are avoided; meanwhile, according to the scheme disclosed by the invention, effective substitute samples of the original samples are searched through multiple iterative searches, so that the comprehensiveness of the sampled samples is improved.
Further, in order to prove the feasibility of the scheme disclosed by the present disclosure, in this embodiment, the scheme disclosed by the present disclosure is verified by taking a sample of a medical case text as an example, and the specific steps are as follows:
124582 existing medical case texts, wherein 9500 pieces of data are to be extracted, and the following processing procedures of subsequent data annotation tasks are all processed by adopting a Python program.
(1) Defining a text sequence set, wherein n is 124582, the lengths of case texts are different, and the case text length is selected to be 800 after preliminary statistics.
(2) Defining a vector representation V of a text set, and generating V ═ V based on a Chinese word dictionary1,V2,...,Vn}∈Rn×mThe length of the selected single word dictionary is 4754.
(3) Defining a word embedding vector representation E of the text set, and converting the vector in the step (2) into R based on an 80-dimensional pre-trained word embedding weight matrixn×80Vector quantity; the word embedding weight matrix is obtained by training original corpora such as textbooks and case texts in the medical field based on a double-layer LSTM network layer.
(4) Data normalization was performed.
(5) Step S105 is executed, and the adjacency density of the sample points is calculated based on the MSE index, and the dimension of the density matrix DE is Rn×80
(6) Initializing a hyperparameter k01, B, an ultra-parameter 00; the iteration times are set to be more than 0 and less than or equal to 20.
(7) For iteration 1, step S106 above is performed, resulting in
Figure BDA0002751175110000091
And
Figure BDA0002751175110000092
as the sample density;
(8) the above step S107 is executed to obtain
Figure BDA0002751175110000101
As an approximate distribution of the original samples;
(9) the above step S108 is executed to obtain
Figure BDA0002751175110000102
As a result of the temporal iterative sampling;
(10) the number of extracted samples N is 9500, which is obtained in step (9)
Figure BDA0002751175110000103
And
Figure BDA0002751175110000104
drawing frequency histograms of the two by using a Python program, comparing the frequency histograms with the frequency histograms, adjusting numerical values of the hyper-parameters according to the adjustment strategy in the step 11) in the step 3, and finally generating an original sample y image through multiple iterations as shown in fig. 2 (a). Finally, when the super-parameters K is 10 and B is 0.03, that is to say
Figure BDA0002751175110000105
When the temperature of the water is higher than the set temperature,
Figure BDA0002751175110000106
as shown in fig. 2(b), M is the final sample number, and M is 9618; and the images are similar intuitively.
(11) After the values of the hyper-parameters K and B are determined, the output is performed according to the step S108
Figure BDA0002751175110000107
And Mark data as a final sampling result with the original data set.
After the numerical values of the hyper-parameters K and B are determined in the same original data set through a multi-step iteration process, the sampling number can be modified for multiple times according to the actual required number, and only the sampling number in the step (10) needs to be modified, and the complete process does not need to be executed again.
The method is mainly used for solving the problems of excessive sampling of similar samples and insufficient sampling of differential samples in the early preparation work of the corpus data annotation, so that the sampling result is more comprehensive and representative, and the efficiency and the quality of data annotation are improved. In the actual processing process, images of an original sample and a sampled sample need to be compared, the hyper-parameters are dynamically adjusted, and the representativeness of the sampled sample is improved; the examples presented in this disclosure are exemplary cases of text sampling of medical cases, and this idea and method can be applied to corpus sampling aspects of other fields of text. Other embodiments obtained without departing from the principles, methods, and teachings of the present disclosure are within the scope of the present disclosure.
Example two:
the embodiment aims to provide a corpus data sampling system based on adjacent density.
A corpus data sampling system based on neighborhood density, comprising:
the data preprocessing module is used for carrying out regularization processing on the corpus data to obtain standardized corpus data;
the hyper-parameter optimization solving module is used for calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method; calculating corpus data sample approximate distribution based on the adjacency density; sampling samples according to the similar distribution of the corpus data samples to obtain a temporary sampling result; carrying out iteration solving on the hyper-parameters according to a preset iteration rule;
and the data sampling module is used for obtaining a final corpus data sampling result according to the determined optimal hyper-parameter value.
Example three:
the embodiment aims at providing an electronic device.
An electronic device comprising a memory, a processor and a computer program stored in the memory for execution, wherein the processor implements a method for sampling corpus data based on neighboring density as described above when executing the computer program, comprising:
carrying out regularization processing on the corpus data to obtain standardized corpus data;
calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method;
calculating corpus data sample approximate distribution based on the adjacency density;
sampling samples according to the similar distribution of the corpus data samples to obtain a temporary sampling result;
and carrying out iteration according to a preset iteration rule to solve the optimal hyper-parameter value, and obtaining a final corpus data sampling result according to the determined optimal hyper-parameter value.
Example four:
it is an object of the present embodiments to provide a non-transitory computer-readable storage medium.
A non-transitory computer-readable storage medium, on which a computer program is stored, the program, when executed by a processor, implementing a method for sampling corpus data based on neighboring densities as described above, comprising:
carrying out regularization processing on the corpus data to obtain standardized corpus data;
calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method;
calculating corpus data sample approximate distribution based on the adjacency density;
sampling samples according to the similar distribution of the corpus data samples to obtain a temporary sampling result;
and carrying out iteration according to a preset iteration rule to solve the optimal hyper-parameter value, and obtaining a final corpus data sampling result according to the determined optimal hyper-parameter value.
The corpus data sampling method and system based on the adjacent density can be completely realized, and have wide application prospects.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims (10)

1. A corpus data sampling method based on adjacency density is characterized by comprising the following steps:
carrying out regularization processing on the corpus data to obtain standardized corpus data;
calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method;
calculating corpus data sample approximate distribution based on the adjacency density;
sampling samples according to the similar distribution of the corpus data samples to obtain a temporary sampling result;
and carrying out iteration solution on the hyperparameters according to a preset iteration rule, and obtaining a final corpus data sampling result according to the determined optimal hyperparametric value.
2. The method as claimed in claim 1, wherein the regularizing of the corpus data is a mathematical representation of the corpus data, and specifically comprises: defining the corpus data as a text sequence set in advance, wherein the text sequence set comprises a plurality of sample sets, and each sample combination is composed of a plurality of single characters; secondly, index representation is carried out on the single characters in each sample by using a coding index algorithm, and vector representation of a text sequence set is obtained.
3. The method for sampling corpus data according to claim 1, wherein said regularizing further comprises setting a weight matrix and a word embedding matrix, and converting vector representations of said text sequence set into word embedding vector representations of a text training set.
4. The method as claimed in claim 1, wherein the calculating of the neighboring density includes calculating a density matrix of sample points in the normalized corpus data by using a distance metric method, and selecting, for each sample point, a mean value of the first K data whose density values are sorted from large to small based on the density matrix, i.e. the neighboring density of the sample point.
5. The method as claimed in claim 1, wherein the calculation of the approximate distribution of the corpus data samples is performed by a method of square reduction and translation, and the concrete formula is as follows:
Figure FDA0002751175100000021
the super parameter B is a distributed bias coefficient, N is the total number of texts in the original data set, N is the number of quasi-samples, i and N in the formula are not more than N and are positive integers, wherein
Figure FDA0002751175100000022
Theta represents a sample in corpus dataThe contiguous density of the dots is such that,
Figure FDA0002751175100000023
the approximation obeys a uniform distribution of U (a, b), where a, b are the parameters to be fitted, and y ═ y1,y2,...,ynDenotes the approximate distribution of the original data set, which approximately obeys a uniform distribution of U (0, 1).
6. The method for sampling corpus data according to claim 1, wherein said step of sampling samples comprises: generation of pseudo random number xi uniformly distributed in the range of 0-1 and subject to U (0,1) by random number generator1,ξ2,...,ξn(ii) a The following sampling rule is adopted: define Mark, if xii<yi,MarkiSelecting the sample as 1; otherwise, MarkiDiscard the sample as 0; wherein i is not more than n and is a positive integer, and n is the number of samples and is a positive integer; therefore, the sample where Mark is constantly equal to 1 can be obtained, i.e. the temporary sampling result:
Figure FDA0002751175100000024
7. the method for sampling corpus data according to claim 1, wherein the step of iteratively solving the hyperparameter according to the predetermined iteration rule comprises:
initializing a hyperparameter KlAnd BlIteratively calculating a frequency histogram a and a temporary sampling result frequency histogram b of the approximate distribution of the corpus data samples; calculating the similarity of the graph a and the graph b, and performing the following operations according to the similarity result:
(1) if the similarity calculation results of the graph a and the graph b do not exceed a preset threshold value, stopping iteration;
(2) if the frequency exceeds the preset threshold value, the value of the super parameter is adjusted, the specific adjustment strategy is as follows, if the maximum frequency value in the graph b is larger than the frequency in the graph aIf the maximum value exceeds the preset threshold value, increasing KlIs marked as Kl+1And vice versa; if the abscissa of the position of the axis of symmetry of the image in graph B is greater than the abscissa of the axis of symmetry of the image in graph a, B is decreasedlIs marked as Bl+1And vice versa; wherein KlIs an integer, update l, Kl、BlAre respectively l +1 and Kl+1、Bl+1
(3) If the iteration number L is equal to L, the iteration is stopped.
8. A corpus data sampling system based on neighborhood density, comprising:
the data preprocessing module is used for carrying out regularization processing on the corpus data to obtain standardized corpus data;
the hyper-parameter optimization solving module is used for calculating the adjacency density of the sample points in the standardized corpus data by using a distance measurement method; calculating a sampling probability and an approximate distribution of the original samples based on the adjacency densities; sampling samples according to the approximate distribution of the original samples to obtain a temporary sampling result; carrying out iteration solving on the hyper-parameters according to a preset iteration rule;
and the data sampling module is used for obtaining a final sampling result according to the determined super parameter value.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory for execution by the processor, wherein the processor implements a method for contiguous density based corpus data sampling according to any of claims 1-7.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements a method for contiguous density based corpus data sampling according to any one of claims 1-7.
CN202011185039.7A 2020-10-29 2020-10-29 Corpus data sampling method and system based on adjacency density Active CN112256823B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011185039.7A CN112256823B (en) 2020-10-29 2020-10-29 Corpus data sampling method and system based on adjacency density

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011185039.7A CN112256823B (en) 2020-10-29 2020-10-29 Corpus data sampling method and system based on adjacency density

Publications (2)

Publication Number Publication Date
CN112256823A true CN112256823A (en) 2021-01-22
CN112256823B CN112256823B (en) 2023-06-20

Family

ID=74268831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011185039.7A Active CN112256823B (en) 2020-10-29 2020-10-29 Corpus data sampling method and system based on adjacency density

Country Status (1)

Country Link
CN (1) CN112256823B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590764A (en) * 2021-09-27 2021-11-02 智者四海(北京)技术有限公司 Training sample construction method and device, electronic equipment and storage medium
CN116821647A (en) * 2023-08-25 2023-09-29 中国电子科技集团公司第十五研究所 Optimization method, device and equipment for data annotation based on sample deviation evaluation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977793A (en) * 2019-03-04 2019-07-05 东南大学 Trackside image pedestrian's dividing method based on mutative scale multiple features fusion convolutional network
CN111368304A (en) * 2020-03-31 2020-07-03 绿盟科技集团股份有限公司 Malicious sample category detection method, device and equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977793A (en) * 2019-03-04 2019-07-05 东南大学 Trackside image pedestrian's dividing method based on mutative scale multiple features fusion convolutional network
CN111368304A (en) * 2020-03-31 2020-07-03 绿盟科技集团股份有限公司 Malicious sample category detection method, device and equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张明等: "一种基于混合采样的非均衡数据集分类算法", 《小型微型计算机系统》 *
苏国韶等: "边坡可靠度分析的高斯过程方法", 《岩土工程学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590764A (en) * 2021-09-27 2021-11-02 智者四海(北京)技术有限公司 Training sample construction method and device, electronic equipment and storage medium
CN113590764B (en) * 2021-09-27 2021-12-21 智者四海(北京)技术有限公司 Training sample construction method and device, electronic equipment and storage medium
CN116821647A (en) * 2023-08-25 2023-09-29 中国电子科技集团公司第十五研究所 Optimization method, device and equipment for data annotation based on sample deviation evaluation
CN116821647B (en) * 2023-08-25 2023-12-05 中国电子科技集团公司第十五研究所 Optimization method, device and equipment for data annotation based on sample deviation evaluation

Also Published As

Publication number Publication date
CN112256823B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN112084790B (en) Relation extraction method and system based on pre-training convolutional neural network
Wang et al. Incorporating gan for negative sampling in knowledge representation learning
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
Lin et al. Learning cross-aligned latent embeddings for zero-shot cross-modal retrieval
CN109376242A (en) Text classification algorithm based on Recognition with Recurrent Neural Network variant and convolutional neural networks
CN114582470B (en) Model training method and device and medical image report labeling method
US11861925B2 (en) Methods and systems of field detection in a document
CN112256823A (en) Corpus data sampling method and system based on adjacency density
CN111209749A (en) Method for applying deep learning to Chinese word segmentation
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN113378563B (en) Case feature extraction method and device based on genetic variation and semi-supervision
CN112070139A (en) Text classification method based on BERT and improved LSTM
CN113987183A (en) Power grid fault handling plan auxiliary decision-making method based on data driving
Shan et al. Robust encoder-decoder learning framework towards offline handwritten mathematical expression recognition based on multi-scale deep neural network
CN114897167A (en) Method and device for constructing knowledge graph in biological field
CN114328939B (en) Natural language processing model construction method based on big data
CN113360582A (en) Relation classification method and system based on BERT model fusion multi-element entity information
CN117131449A (en) Data management-oriented anomaly identification method and system with propagation learning capability
CN113032573B (en) Large-scale text classification method and system combining topic semantics and TF-IDF algorithm
Wang et al. Tibetan word segmentation method based on bilstm_ crf model
CN111507103B (en) Self-training neural network word segmentation model using partial label set
Kong et al. Lena: Locality-expanded neural embedding for knowledge base completion
CN115546801A (en) Method for extracting paper image data features of test document
CN112270185A (en) Text representation method based on topic model
CN117891958B (en) Standard data processing method based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 12 / F, building 1, Aosheng building, 1166 Xinluo street, hi tech Zone, Jinan City, Shandong Province

Applicant after: Zhongyang Health Technology Group Co.,Ltd.

Address before: 12 / F, building 1, Aosheng building, 1166 Xinluo street, hi tech Zone, Jinan City, Shandong Province

Applicant before: SHANDONG MSUNHEALTH TECHNOLOGY GROUP Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant