Disclosure of Invention
The invention solves the problems of urgently needing a method for simplifying event materials, improving the office efficiency of workers and simplifying the process of public affairs.
The invention provides a method for discovering repeated materials of a theme integration service, which comprises the following steps:
acquiring a material name and text information of a file material of the theme integration service;
extracting global semantic information characteristics of the material name and the text information based on a characteristic extractor;
and processing the global semantic information characteristics according to a logistic regression algorithm, and judging whether the materials are repetitive materials.
Further, the extracting global semantic information features of the material name and the text information based on the feature extractor specifically includes:
removing the material name of the file material, the region name and the special symbol of the text information, and acquiring the processed text information;
adding a flag bit cls to a word segmentation module of a BERT model, splicing processed text information of two document materials into spliced text information, performing word segmentation processing by using the word segmentation module of the BERT model, splicing the text information, and acquiring global semantic information characteristics, wherein the spliced text information comprises an ith material x
i And the jth material x
j The global semantic information features comprise one-dimensional feature vectors of the spliced text information
I is not less than 1, j is not equal to 1:
wherein,
represents a consistent transfromer encoded embedded vector of 12-layer structure, [0 ]]Representing a first dimension vector.
Further, the classifying the global semantic information features according to a logistic regression algorithm to determine whether the material is a repeated material, including calculating a text information similarity probability, and determining whether the material is a repeated material according to the text information similarity probability, wherein the calculating of the text information similarity probability specifically includes:
wherein exp is an exponential function with a natural constant e as the base,
is an algorithm weight vector, and P is a text information similarity probability.
Further, the classification processing of the global semantic information features according to a logistic regression algorithm to determine whether the material is a repetitive material further includes an active learning method, specifically:
setting a text information similarity probability threshold, wherein the text information similarity probability threshold comprises 0.8 and 0.2;
the text information similarity probability is greater than or equal to 0.8, and the text information similarity probability is a repeated material; the text information similarity probability is less than or equal to 0.2, and the text information similarity probability is non-repetitive materials; and the materials with the text information similarity probability threshold value smaller than 0.8 and larger than 0.2 are classified wrongly, and the materials with the classification wrongly are retrained.
Further, the retraining comprises:
obtaining misclassified materials by a machine learning method;
manually marking whether the misclassified material data is repeatedly labeled or not, and performing fine tuning learning on the misclassified data again, wherein the method specifically comprises the following steps:
adjusting a pre-training weight and a logistic regression algorithm, using cross entropy as a loss function L, and updating the weight by using an Adam gradient descent method:
wherein y is the manually labeled whether or not the two materials are repeated labels,
l measures the difference degree between the model predicted value and the actual value for classifying the model predicted value.
Further, the method further comprises:
the materials constituting the theme integration service file are combined to form a union
Representing a union of the materials of the theme integration service file, wherein n is the total number of the materials;
randomly selecting two materials from the n materials, synthesizing the two materials into a group, and judging whether the two materials are repeated:
wherein 0 represents two non-repeating materials and 1 represents a repeating material;
extracting all material combinations of f (= 1) for output, and deleting any material x in the combinations i Or x j ,x i Represents the ith material, x j Representing the jth material.
The present invention also provides a subject integration service repeated material discovery system, the system comprising:
the system comprises a material name and text information acquisition unit, a document integration service acquisition unit and a document information acquisition unit, wherein the material name and text information acquisition unit is used for acquiring the material name and text information of the document material of the subject integration service;
the global semantic information feature acquisition unit is used for extracting global semantic information features of the material names and the text information based on the feature extractor;
and the repeated material judging unit is used for classifying and processing the global semantic information characteristics according to a logistic regression algorithm and judging whether the repeated material is the repeated material.
Further, the global semantic information feature obtaining unit includes:
the processed text information acquisition module is used for removing the material name of the file material, the region name and the special symbol of the text information and acquiring the processed text information;
a global semantic information characteristic acquisition module used for adding a zone bit cls to a word segmentation module of the BERT model, splicing the processed text information of the two file materials into spliced text information, carrying out word segmentation processing by utilizing the word segmentation module of the BERT model, splicing the text information and acquiring global semantic information characteristics, wherein the spliced text information comprises the ith material x
i And the jth material x
j The global semantic information features comprise one-dimensional feature vectors of the spliced text information
I is not less than 1, j is not equal to j:
wherein,
represents a consistent transfromer-encoded embedded vector of 12-layer structure, [0 ]]Representing a first dimension vector.
The present invention also provides a computer device comprising a memory having a computer program stored therein and a processor that executes the subject integration service repeated material discovery method according to any one of the above when the processor runs the computer program stored in the memory.
The present invention also provides a computer-readable storage medium for storing a computer program for executing the above-mentioned method for discovering repetitive materials of a theme integration service.
The invention has the advantages that:
the invention solves the problems of urgently needing a method for simplifying event materials, improving the office efficiency of workers and simplifying the process of public affairs.
1. The repeated material discovery method for the theme integrated service provided by the invention can reduce manual matching and checking time, improve working efficiency while reducing human resources, reduce error rate of repeated material evaluation according to a threshold value, further form a list of joint matters and materials by the repeated material discovery method for the theme integrated service, and simplify the process of public affairs.
2. And extracting a feature vector of the material by using a BERT model, and calculating a repetition probability through a logistic regression model to classify. And an active learning mechanism based on TONE is introduced, so that the demand of manual labeling data is reduced. By using the method, all theme integrated service materials can be intelligently detected, repeated materials can be found, and the service quality can be improved.
3. Extracting semantic feature vectors from material names and text contents of the materials of the theme integration service by a feature extractor, sending the semantic feature vectors into a classifier to judge whether the materials are repetitive materials, judging whether the classification scores are correctly classified by a decision threshold theta, if the classification scores are correctly classified, outputting a result, and if the classification scores are wrong, requiring a professional to evaluate the material pairs and label the material pairs to output labeled data, and meanwhile, re-learning the data of the material group by using a TONE mechanism through a trainer, and adjusting a model to improve the classification accuracy.
4. The adoption of the BERT model of the feature extractor can accurately express text information, the use of logistic regression can accept the input of the feature extractor, the two are used as a whole to be matched with a TONE and an active learning mechanism for fine tuning learning, the training time is short, the anti-noise capability is strong, the model accuracy can be rapidly improved on the premise of a small number of samples, the model is applied to actual tasks, and the labor cost is reduced.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention.
In a first embodiment, the method for discovering repeated materials of a theme integration service in the first embodiment includes:
acquiring a material name and text information of a file material of the theme integration service;
extracting global semantic information characteristics of the material name and the text information based on the characteristic extractor;
and processing the global semantic information characteristics according to a logistic regression algorithm, and judging whether the materials are repetitive materials.
Specifically, in the embodiment, for a material of the theme integration service, a material name and corresponding text information of the material are extracted. Extracting feature vectors of the materials through a feature extractor, calculating repetition probability through a logistic regression algorithm for classification, wherein the logistic regression algorithm is used for processing the global semantic information features, obtaining text information similarity probability, and judging whether the materials are the repeated materials according to the text information similarity probability.
In this embodiment, the feature extractor uses a BERT model, calculates Self-Attention to obtain information representation of sentence level by adjusting weights of the pre-trained BERT model, and captures context semantic information in a specific environment by using Fine-tune (transfer learning).
In a second embodiment, the method for discovering repeated materials of a topic integration service according to the first embodiment is further limited, where the extracting global semantic information features of material names and text information based on the feature extractor specifically includes:
removing the material name of the file material and the region name and the special symbol of the text information to obtain the processed text information;
adding a flag bit cls to a word segmentation module of a BERT model, splicing processed text information of two document materials into spliced text information, performing word segmentation processing by using the word segmentation module of the BERT model, splicing the text information, and acquiring global semantic information characteristics, wherein the spliced text information comprises an ith material x
i And the jth material x
j The global semantic information features comprise one-dimensional feature vectors of the spliced text information
I is not less than 1, j is not equal to 1:
wherein,
represents a consistent transfromer-encoded embedded vector of 12-layer structure, [0 ]]Representing a first dimension vector.
In practical application, text features are main factors influencing the classification effect, and particularly in the judgment of the situation that the material names are inconsistent but the same material, the text semantics of the text features need to be analyzed and the features need to be extracted.
Specifically, preprocessing is carried out on material names and original text information of two file materials (region names and special symbols are removed), the processed text information of the two file materials is spliced into spliced text information, word segmentation (segmentation token) processing is carried out through a BERT dictionary to process the spliced text information, the text information is converted into an embedded vector E with fixed 768 dimensions through an embedded layer of a BERT model, a mark bit cls is added before all tokens before conversion to calculate the global information, wherein the mark bit cls is a first dimension vector and is used for calculating the semantic representation of the global information. Obtaining BERT model output by superposition of 12-layer structure uniform transform coding embedded vectors, wherein the single-layer transform coding embedded vector calculation formula is as follows:
wherein, multihead represents a multi-head Attention mechanism, concat represents splicing, attention represents a self-Attention matrix, Q, K and V represent three matrixes for calculating self-Attention, and head represents
i Representative is the ith self-attention matrix,
each represents a mapping of the ith input vector,
represents the dimension of K, W
0 Represents the parameter matrix in the linear mapping, W is the parameter matrix, and d represents the matrix dimension. Three linear mappings are carried out on the input matrix to obtain three state matrixes for calculating self attention
And obtaining full-text semantic information by calculating self attention, accelerating convergence by residual connection and layer normalization, and outputting the information as an input coding vector of the next layer of transform. And the accuracy rate is improved by overlapping the depth mining information with the multi-layer codes. In the embodiment, a cls vector of the last layer of encoder is taken and used as a first dimension vector of an output vector, and a final global semantic information characteristic is output
:
The cls vector (the first dimension of the output vector) of the last layer of encoder is taken as the final global semantic information characteristic, because the encoder is more focused after encoding 12 layers.
In a third embodiment, the method for discovering repeated materials of the theme integration service according to the second embodiment is further limited, the global semantic information features are classified and processed according to a logistic regression algorithm, and whether the repeated materials are the repeated materials is judged, including calculating a text information similarity probability, and whether the repeated materials are the repeated materials is judged according to the text information similarity probability, where the calculating of the text information similarity probability specifically includes:
wherein exp is an exponential function with a natural constant e as the base,
and P is a text information similarity probability.
In practical application, logistic regression algorithm, vector machine, PCA, LDA, neural network and the like can be adopted for material classification.
The classification model based on supervised learning comprises a Bayesian model, a logistic regression, a decision tree, a support vector machine, a neural network and other algorithms. The neural network is a more common classification algorithm, the current task uses a pre-training model based on Self Attention, sentence-level information representation is obtained through a Self-Attention mechanism, and after fine-tune, a BERT model can accurately express text information. Therefore, only one layer of simple neural network is needed to perform classification well, the structure of the neural network is equivalent to logistic regression, and the logistic regression saves processing time compared with the neural network. Because the logistic regression model is simple, the training time is short and efficient, a large number of characteristics can be processed, the property that the numerical value between 0 and 1 has probability is output, the threshold value self-defined classification can be conveniently determined according to the two types of classification losses, and the anti-noise capability is strong. Logistic regression models also have the ability to be combined with a pre-trained, fine-tune framework, as compared to bayesian models, decision trees, support vector machines, and the like. In summary, the present embodiment selects a logistic regression algorithm to classify the material.
In a fourth embodiment, the method for discovering the repeated materials of the theme integration service according to the third embodiment is further limited, the global semantic information features are classified according to a logistic regression algorithm, and whether the repeated materials are the repeated materials is judged, and the method further includes an active learning method, specifically:
setting a text information similarity probability threshold, wherein the text information similarity probability threshold comprises 0.8 and 0.2;
the text information similarity probability is greater than or equal to 0.8, and the text information similarity probability is a repeated material; the text information similarity probability is less than or equal to 0.2, and the text information similarity probability is non-repetitive material; and materials with the text information similarity probability smaller than 0.8 and larger than 0.2 are in error classification, and the materials in error classification are retrained.
In this embodiment, a threshold value of 0.5 should be set in the normal process according to the system determination standard, and when the similarity score is greater than 0.5, the default is a repeated material, and when the similarity score is less than 0.5, the default is a non-repeated material. However, due to the problems that the system training is insufficient, the feature extraction is not complete, and the like, the problem that the material name cannot be repeated exists only when the actual similarity value is found to be greater than 0.8 in the previous manual verification preparation work, so that the absolute value of 0.3 is added on the basis of 0.5, and the threshold values are changed into 0.8 and 0.2.
In a fifth embodiment, the method for discovering repetitive materials of a theme integration service according to the first embodiment is further defined, wherein the retraining comprises:
obtaining misclassified materials by a machine learning method;
manually marking whether the misclassified material data is repeatedly labeled or not, and performing fine tuning learning on the misclassified material data again, wherein the method specifically comprises the following steps:
adjusting a pre-training weight and a logistic regression algorithm, using cross entropy as a loss function L, and updating the weight by using an Adam gradient descent method:
wherein y is the manually labeled whether or not the two materials are repeated labels,
for the classification model prediction value, L measures the difference degree between the model prediction value and the actual value.
In practical application, the number of early-stage manual labeling data sets is still less than that of deep network model training, and in order to reduce subsequent labor cost and apply the early-stage manual labeling data sets, a TONE and active learning method is referred to, and the performance of the method is checked and improved in practice. The TONE is fully called Train On or Near Error, a score limit is preset at first, and when the material score is within the score limit, fine tuning learning is needed again even if the judgment is correct. The general idea of Active Learning is to manually label the misclassified material data, specifically: and acquiring sample data which are difficult to classify by a machine learning method, manually confirming and auditing the sample data again, and then training the manually marked data by using a supervised learning model or a semi-supervised learning model again, so that the effect of the model is gradually improved, and the manual experience is integrated into the machine learning model. When the method predicts the score, a limit theta is preset, so that the materials outside the limit can be accurately predicted, more accurate repeated material judgment is obtained, and under the condition that the materials are uncertain in the limit, whether the materials are repeated materials needs to be marked by a professional, the text contents and the labels of the two materials are learned, and the TONE method is used during learning, so that even if the method judgment is correct in the limit, training is still performed, and the accuracy is improved.
Sixth, the present embodiment is further limited to the method for discovering repetitive materials of a theme integration service according to the first embodiment, and the method further includes:
the materials constituting the theme integration service file are combined to form a union
Representing a union of the material of the theme integration service files, wherein n is the total number of the material;
randomly selecting two materials from the n materials, synthesizing into a group, and judging whether the two materials are repeated:
wherein 0 represents two materials not repeating, 1 represents a material repeating;
extracting all material combinations of f (= 1) for output, and deleting any material x in the combinations i Or x j ,x i Represents the ith material, x j Representing the jth material.
Specifically, the repeated material output in all material combinations is extracted, and any part of material x in the combinations is deleted i Or x j Ensuring that the subject integration service is free of duplicate material.
Seventh, the subject integration service repeated material discovery system according to the present embodiment, includes:
a material name and text information acquisition unit for acquiring a material name and text information of a document material of the subject integration service;
the global semantic information characteristic acquisition unit is used for extracting the global semantic information characteristics of the material names and the text information based on the characteristic extractor;
and the repeated material judging unit is used for classifying and processing the global semantic information characteristics according to a logistic regression algorithm and judging whether the repeated material is the repeated material.
An eighth embodiment is the recognition-based simplified duplicate event discovery system according to the sixth embodiment, wherein the global semantic information feature obtaining unit includes:
the processed text information acquisition module is used for removing the material name of the file material, the region name and the special symbol of the text information and acquiring the processed text information;
a global semantic information characteristic acquisition module for adding a flag bit cls to a word segmentation module of the BERT model, splicing the processed text information of the two document materials into spliced text information, and performing word segmentation processing by using the word segmentation module of the BERT model to splice the text information and acquire global semantic information characteristics, wherein the spliced text information comprises an ith material x
i And the jth material x
j The global semantic information features comprise one-dimensional feature vectors of the spliced text information
I is not less than 1, j is not equal to 1:
wherein,
represents a consistent transfromer encoded embedded vector of 12-layer structure, [0 ]]Representing a first dimension vector.
Ninth embodiment, a computer device according to the present embodiment, includes a memory in which a computer program is stored, and a processor, and when the processor executes the computer program stored in the memory, the processor executes the method for discovering repetitive materials of the theme integration service according to any one of the first to sixth embodiments.
Embodiment ten is a computer-readable storage medium according to this embodiment, which is used to store a computer program that executes the method for discovering repetitive materials of a theme integration service according to any one of embodiment one to embodiment six.
The eleventh embodiment is described with reference to fig. 1, and this embodiment provides a specific embodiment to the method for discovering repetitive materials of a subject integration service described in the first embodiment, and is used to explain embodiments one to six, specifically:
acquiring a material name and text information of a file material of the theme integration service;
extracting global semantic information characteristics of the material name and the text information based on a characteristic extractor;
and processing the global semantic information characteristics according to a logistic regression algorithm, and judging whether the material is a repeated material.
Preprocessing a material name and original text information (removing a region name and special symbols), dividing characters (segmenting tokens) through a BERT dictionary, converting the text information into embedded vectors E with fixed 768 dimensions by using an embedded layer of a BERT model, adding a flag bit cls before all tokens before conversion to calculate global information, and superposing coded embedded vectors of 12 layers of uniform transformers to obtain output of the BERT model, wherein a single-layer transformer coded embedded vector calculation formula is as follows:
wherein, multihead represents a multi-head Attention mechanism, concat represents splicing, attention represents a self-Attention matrix, Q, K and V represent three matrixes for calculating self-Attention, and head represents
i The representation is the ith self-attention matrix,
each represents a mapping of the ith input vector,
represents the dimension of K, W
0 Represents the parameter matrix in the linear mapping, W is the parameter matrix, and d represents the matrix dimension. Three linear mappings are carried out on the input matrix to obtain three state matrixes for calculating self attention
And obtaining full-text semantic information by calculating self attention, accelerating convergence by residual connection and layer normalization, and outputting the semantic information as an input coding vector of the next layer of transform. And the accuracy rate is improved by overlapping the depth mining information with the multi-layer codes. Because the token which does not belong to the input text is added in advance, the token and all tokens are calculated during attention calculation, and the cls vector (the first dimension of the output vector) of the last layer of encoder is taken as the final global semantic information characteristic:
the embodiment adopts a logic classification algorithm to calculate the text information similarity probability:
wherein exp is an exponential function with a natural constant e as the base,
and P is a text information similarity probability.
The advantage of using the logic classification algorithm is that the structure is simple, the interpretability is strong, and the influence of different characteristics on the final result can be seen from the weight of the characteristics; the training speed is high, the calculated amount is only related to the number of the features during classification, the calculation accuracy is only related to the effect of the features, and the occupied calculation resources are small; the output result is convenient to adjust, the result is probability, and the threshold value is convenient to adjust.
The number of early-stage manual labeling data sets is still less than that of deep network model training, and in order to reduce subsequent labor cost and apply the early-stage manual labeling data sets, a TONE and active learning method is adopted, and the performance of the method is tested and improved in practice. The TONE is fully called Train On or Near Error, a score limit is preset at first, and when the material score is within the score limit, fine tuning learning is needed again even if the judgment is correct. The general idea of Active Learning is to acquire sample data which is not easy to classify by a machine Learning method and label the sample data, and to perform manual reconfirmation and audit, to train the data obtained by manual labeling by using a supervised Learning model or a semi-supervised Learning model again, to gradually improve the effect of the model, and to integrate manual experience into the machine Learning model. When the method predicts the score, a threshold value theta is preset, so that accurate prediction can be obtained outside the boundary, more accurate repeated material judgment is obtained, when the condition is uncertain in the boundary, whether the repeated material is the repeated material or not is manually marked, the text content and the label of the two materials are learned, a TONE method is used during learning, even if the method is judged to be correct in the boundary, training is still carried out, and the accuracy is improved.
In the present embodiment, the threshold values are 0.8 and 0.2. The text information similarity probability threshold value is greater than or equal to 0.8, and the text information similarity probability threshold value is a repeated material; the text information similarity probability threshold value is less than or equal to 0.2 and is a non-repetitive material; and the materials with the text information similarity probability smaller than 0.8 and larger than 0.2 are in error classification, and the materials in error classification are retrained.
In practice, because the data set is not large enough, it takes much time and has a general effect to restart training the neural network to obtain the desired high-level feature representation, so a publicly published pre-trained pretrainin (a large amount of unlabelled corpus is used for text reconstruction of two tasks, namely MLM and interphrase relation NSP, so that the model has strong characterization capability) model is used to be combined with a classifier, the classifier is realized by adopting a logistic regression algorithm, and the fine-tune is trained on the material data set (fine tuning is carried out according to specific downstream tasks on the basis of the trained feature extractor).
The retraining comprises:
obtaining misclassified materials by a machine learning method;
manually marking whether the misclassified material data is repeatedly labeled or not, and performing fine tuning learning on the misclassified material data again, wherein the method specifically comprises the following steps:
after the classifier loads the feature extractor weight, the pre-training weight and the logistic regression algorithm are adjusted, and the cross entropy is used as a loss function L:
the weight update uses the Adam gradient descent method.
In the training and learning process, the updating time epoch is set to be 3, the learning rate is 2e-5 (the learning rate is a value obtained by a scientific counting method, and an absolute value can be calculated through abs), the exponential decay rate of the first moment estimation is 0.9, the exponential decay rate of the second moment estimation is 0.999, and Adam optimization can ensure that the parameter updating size does not change along with the scaling of the gradient size, so that the optimal solution can be found more stably.
The method also comprises the following steps of merging materials required to be provided and forming all matters of the theme integration service, representing the union of the materials of the theme integration service files, wherein n is the total number of the materials and is as follows:
{x 1 ,x 2 ,x 3 ,...,x n };
the object of the subject integrated repeat material discovery method is to examine the repeat material in all materials, i.e., to traverse all materials, select two different materials from n materials each time to combine into a group, and determine whether two materials are repeated:
wherein 0 means two materials are not repeated, 1 means material is repeated, all the materials with f =1 are combined, and any one part of the materials (x) is combined i Or x j ) Abandon, ensure theme integration services no duplicate material.
The training process dynamically plans the real learning rate, and the specific algorithm f is as follows:
a material set D = { () } (1 is less than or equal to i and less than or equal to n,1 is less than or equal to j and less than or equal to n, and i is not equal to j) to be judged;
wherein the learning rate is eta, learning batchT, the first moment estimate has an exponential decay rate of
The second order moment is estimated with an exponential decay rate of
;
V/initialize classifier weights
// weight matrix: feature extractor weights and classifier weights
V/calculating Material characteristics
If not, then request manual labeling
else// below is the classification section
end for
Train (T, epoch)// input the material set to be learned and the number of updates
Calculating loss update weights for predicted material combinations
while True:
Pre _ MS (D)// predicting the Material set to be determined
iflen (T) = = 0:// classification is accurate, end task, output result
break
Train (T, epoch)// there is a classification inaccuracy result, the weight is updated
Wherein,
is 0. Specifically, the first judgment of the program is used for judging whether the text similarity probability is within a threshold range, and the retraining is required when the text information similarity probability is less than 0.8 and greater than 0.2, and the second judgment is not performed within the range; the second judgment is used for processing according to the first judgment result, namely when the text information similarity probability does not fall within the range of less than 0.8 and more than 0.2, the file judgment result is output: textThe information similarity probability of more than 0.5 is the repeated material (i.e. when the text similarity is 0.8, the repeated material belongs to), and the text information similarity probability of less than 0.5 is the non-repeated material (i.e. when the text similarity is 0.2, the repeated material does not belong to).
The technical solutions provided by the present invention are further described in detail with reference to the drawings, for the purpose of highlighting advantages and benefits, and are not intended to limit the present invention, and any modifications, combinations of embodiments, improvements, equivalents, etc. based on the spirit of the present invention should be included in the protection scope of the present invention.