CN115357718B - Method, system, device and storage medium for discovering repeated materials of theme integration service - Google Patents

Method, system, device and storage medium for discovering repeated materials of theme integration service Download PDF

Info

Publication number
CN115357718B
CN115357718B CN202211282962.1A CN202211282962A CN115357718B CN 115357718 B CN115357718 B CN 115357718B CN 202211282962 A CN202211282962 A CN 202211282962A CN 115357718 B CN115357718 B CN 115357718B
Authority
CN
China
Prior art keywords
text information
materials
repeated
integration service
global semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211282962.1A
Other languages
Chinese (zh)
Other versions
CN115357718A (en
Inventor
齐浩亮
苗晓刚
韩咏
孔蕾蕾
韩中元
曹霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan University
Original Assignee
Foshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan University filed Critical Foshan University
Priority to CN202211282962.1A priority Critical patent/CN115357718B/en
Publication of CN115357718A publication Critical patent/CN115357718A/en
Application granted granted Critical
Publication of CN115357718B publication Critical patent/CN115357718B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method, a system, equipment and a storage medium for discovering repeated materials of a theme integration service, and relates to the field of information processing. The method comprises the following steps: acquiring a material name and text information of a file material of the theme integration service; extracting global semantic information characteristics of the material name and the text information based on a characteristic extractor; and processing the global semantic information characteristics according to a logistic regression algorithm, and judging whether the materials are repetitive materials. The method can improve the working efficiency and reduce the error rate of repeated material evaluation.

Description

Method, system, device and storage medium for discovering repeated materials of theme integration service
Technical Field
The present invention relates to the field of information processing, and in particular, to a method, system, device, and storage medium for discovering repeated materials of a theme integration service.
Background
At present, government affair services are handled in a centralized mode, and an online and offline integrated mode of the government affair services is adopted, so that the effect of continuously improving the efficiency of the government services is achieved. The main method is to collect and process the materials by gathering the affairs processed by a plurality of departments, thereby simplifying the processing flow, which accelerates the construction and the perfection of the digital government theme integrated service system.
However, the material merging stage of the subject integration service system generally uses the material merging of a government affairs integrated platform, and the merging method is to perform duplicate removal if the material names are found to be the same, and meanwhile, a manual duplicate removal mode is provided for material merging. However, the current de-duplication method is not good: materials with different names but substantially the same name cannot be automatically found, such as 'real estate right certification' and 'management place house property legal certification'; the manual duplicate removal is large in workload, and omission easily occurs in long-time repetitive inspection. And each single item has its own affiliation department and transaction materials. Including a plurality of materials of the same name as the different materials of the applicant's identification material, identification card or passport. Repeated comparisons of material names are unreliable, and the topic service involves multiple levels of government services, with numerous events, on average approaching hundreds of materials per event, which is time consuming, laborious and prone to missing duplicate materials.
Therefore, there is an urgent need for a method for simplifying event materials, improving the working efficiency of workers, and simplifying the flow of public affairs.
Disclosure of Invention
The invention solves the problems of urgently needing a method for simplifying event materials, improving the office efficiency of workers and simplifying the process of public affairs.
The invention provides a method for discovering repeated materials of a theme integration service, which comprises the following steps:
acquiring a material name and text information of a file material of the theme integration service;
extracting global semantic information characteristics of the material name and the text information based on a characteristic extractor;
and processing the global semantic information characteristics according to a logistic regression algorithm, and judging whether the materials are repetitive materials.
Further, the extracting global semantic information features of the material name and the text information based on the feature extractor specifically includes:
removing the material name of the file material, the region name and the special symbol of the text information, and acquiring the processed text information;
adding a flag bit cls to a word segmentation module of a BERT model, splicing processed text information of two document materials into spliced text information, performing word segmentation processing by using the word segmentation module of the BERT model, splicing the text information, and acquiring global semantic information characteristics, wherein the spliced text information comprises an ith material x i And the jth material x j The global semantic information features comprise one-dimensional feature vectors of the spliced text information
Figure DEST_PATH_IMAGE001
I is not less than 1, j is not equal to 1:
Figure DEST_PATH_IMAGE002
,
wherein,
Figure DEST_PATH_IMAGE003
represents a consistent transfromer encoded embedded vector of 12-layer structure, [0 ]]Representing a first dimension vector.
Further, the classifying the global semantic information features according to a logistic regression algorithm to determine whether the material is a repeated material, including calculating a text information similarity probability, and determining whether the material is a repeated material according to the text information similarity probability, wherein the calculating of the text information similarity probability specifically includes:
Figure DEST_PATH_IMAGE004
wherein exp is an exponential function with a natural constant e as the base,
Figure DEST_PATH_IMAGE005
is an algorithm weight vector, and P is a text information similarity probability.
Further, the classification processing of the global semantic information features according to a logistic regression algorithm to determine whether the material is a repetitive material further includes an active learning method, specifically:
setting a text information similarity probability threshold, wherein the text information similarity probability threshold comprises 0.8 and 0.2;
the text information similarity probability is greater than or equal to 0.8, and the text information similarity probability is a repeated material; the text information similarity probability is less than or equal to 0.2, and the text information similarity probability is non-repetitive materials; and the materials with the text information similarity probability threshold value smaller than 0.8 and larger than 0.2 are classified wrongly, and the materials with the classification wrongly are retrained.
Further, the retraining comprises:
obtaining misclassified materials by a machine learning method;
manually marking whether the misclassified material data is repeatedly labeled or not, and performing fine tuning learning on the misclassified data again, wherein the method specifically comprises the following steps:
adjusting a pre-training weight and a logistic regression algorithm, using cross entropy as a loss function L, and updating the weight by using an Adam gradient descent method:
Figure DEST_PATH_IMAGE006
wherein y is the manually labeled whether or not the two materials are repeated labels,
Figure DEST_PATH_IMAGE007
l measures the difference degree between the model predicted value and the actual value for classifying the model predicted value.
Further, the method further comprises:
the materials constituting the theme integration service file are combined to form a union
Figure DEST_PATH_IMAGE008
Representing a union of the materials of the theme integration service file, wherein n is the total number of the materials;
randomly selecting two materials from the n materials, synthesizing the two materials into a group, and judging whether the two materials are repeated:
Figure DEST_PATH_IMAGE009
wherein 0 represents two non-repeating materials and 1 represents a repeating material;
extracting all material combinations of f (= 1) for output, and deleting any material x in the combinations i Or x j ,x i Represents the ith material, x j Representing the jth material.
The present invention also provides a subject integration service repeated material discovery system, the system comprising:
the system comprises a material name and text information acquisition unit, a document integration service acquisition unit and a document information acquisition unit, wherein the material name and text information acquisition unit is used for acquiring the material name and text information of the document material of the subject integration service;
the global semantic information feature acquisition unit is used for extracting global semantic information features of the material names and the text information based on the feature extractor;
and the repeated material judging unit is used for classifying and processing the global semantic information characteristics according to a logistic regression algorithm and judging whether the repeated material is the repeated material.
Further, the global semantic information feature obtaining unit includes:
the processed text information acquisition module is used for removing the material name of the file material, the region name and the special symbol of the text information and acquiring the processed text information;
a global semantic information characteristic acquisition module used for adding a zone bit cls to a word segmentation module of the BERT model, splicing the processed text information of the two file materials into spliced text information, carrying out word segmentation processing by utilizing the word segmentation module of the BERT model, splicing the text information and acquiring global semantic information characteristics, wherein the spliced text information comprises the ith material x i And the jth material x j The global semantic information features comprise one-dimensional feature vectors of the spliced text information
Figure DEST_PATH_IMAGE010
I is not less than 1, j is not equal to j:
Figure 79876DEST_PATH_IMAGE002
,
wherein,
Figure 262596DEST_PATH_IMAGE003
represents a consistent transfromer-encoded embedded vector of 12-layer structure, [0 ]]Representing a first dimension vector.
The present invention also provides a computer device comprising a memory having a computer program stored therein and a processor that executes the subject integration service repeated material discovery method according to any one of the above when the processor runs the computer program stored in the memory.
The present invention also provides a computer-readable storage medium for storing a computer program for executing the above-mentioned method for discovering repetitive materials of a theme integration service.
The invention has the advantages that:
the invention solves the problems of urgently needing a method for simplifying event materials, improving the office efficiency of workers and simplifying the process of public affairs.
1. The repeated material discovery method for the theme integrated service provided by the invention can reduce manual matching and checking time, improve working efficiency while reducing human resources, reduce error rate of repeated material evaluation according to a threshold value, further form a list of joint matters and materials by the repeated material discovery method for the theme integrated service, and simplify the process of public affairs.
2. And extracting a feature vector of the material by using a BERT model, and calculating a repetition probability through a logistic regression model to classify. And an active learning mechanism based on TONE is introduced, so that the demand of manual labeling data is reduced. By using the method, all theme integrated service materials can be intelligently detected, repeated materials can be found, and the service quality can be improved.
3. Extracting semantic feature vectors from material names and text contents of the materials of the theme integration service by a feature extractor, sending the semantic feature vectors into a classifier to judge whether the materials are repetitive materials, judging whether the classification scores are correctly classified by a decision threshold theta, if the classification scores are correctly classified, outputting a result, and if the classification scores are wrong, requiring a professional to evaluate the material pairs and label the material pairs to output labeled data, and meanwhile, re-learning the data of the material group by using a TONE mechanism through a trainer, and adjusting a model to improve the classification accuracy.
4. The adoption of the BERT model of the feature extractor can accurately express text information, the use of logistic regression can accept the input of the feature extractor, the two are used as a whole to be matched with a TONE and an active learning mechanism for fine tuning learning, the training time is short, the anti-noise capability is strong, the model accuracy can be rapidly improved on the premise of a small number of samples, the model is applied to actual tasks, and the labor cost is reduced.
Drawings
Fig. 1 is a schematic diagram illustrating an operation of a repeated material discovery system of a theme integration service according to an eleventh embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention.
In a first embodiment, the method for discovering repeated materials of a theme integration service in the first embodiment includes:
acquiring a material name and text information of a file material of the theme integration service;
extracting global semantic information characteristics of the material name and the text information based on the characteristic extractor;
and processing the global semantic information characteristics according to a logistic regression algorithm, and judging whether the materials are repetitive materials.
Specifically, in the embodiment, for a material of the theme integration service, a material name and corresponding text information of the material are extracted. Extracting feature vectors of the materials through a feature extractor, calculating repetition probability through a logistic regression algorithm for classification, wherein the logistic regression algorithm is used for processing the global semantic information features, obtaining text information similarity probability, and judging whether the materials are the repeated materials according to the text information similarity probability.
In this embodiment, the feature extractor uses a BERT model, calculates Self-Attention to obtain information representation of sentence level by adjusting weights of the pre-trained BERT model, and captures context semantic information in a specific environment by using Fine-tune (transfer learning).
In a second embodiment, the method for discovering repeated materials of a topic integration service according to the first embodiment is further limited, where the extracting global semantic information features of material names and text information based on the feature extractor specifically includes:
removing the material name of the file material and the region name and the special symbol of the text information to obtain the processed text information;
adding a flag bit cls to a word segmentation module of a BERT model, splicing processed text information of two document materials into spliced text information, performing word segmentation processing by using the word segmentation module of the BERT model, splicing the text information, and acquiring global semantic information characteristics, wherein the spliced text information comprises an ith material x i And the jth material x j The global semantic information features comprise one-dimensional feature vectors of the spliced text information
Figure 245595DEST_PATH_IMAGE010
I is not less than 1, j is not equal to 1:
Figure 906384DEST_PATH_IMAGE002
,
wherein,
Figure 924018DEST_PATH_IMAGE003
represents a consistent transfromer-encoded embedded vector of 12-layer structure, [0 ]]Representing a first dimension vector.
In practical application, text features are main factors influencing the classification effect, and particularly in the judgment of the situation that the material names are inconsistent but the same material, the text semantics of the text features need to be analyzed and the features need to be extracted.
Specifically, preprocessing is carried out on material names and original text information of two file materials (region names and special symbols are removed), the processed text information of the two file materials is spliced into spliced text information, word segmentation (segmentation token) processing is carried out through a BERT dictionary to process the spliced text information, the text information is converted into an embedded vector E with fixed 768 dimensions through an embedded layer of a BERT model, a mark bit cls is added before all tokens before conversion to calculate the global information, wherein the mark bit cls is a first dimension vector and is used for calculating the semantic representation of the global information. Obtaining BERT model output by superposition of 12-layer structure uniform transform coding embedded vectors, wherein the single-layer transform coding embedded vector calculation formula is as follows:
Figure DEST_PATH_IMAGE011
wherein, multihead represents a multi-head Attention mechanism, concat represents splicing, attention represents a self-Attention matrix, Q, K and V represent three matrixes for calculating self-Attention, and head represents i Representative is the ith self-attention matrix,
Figure DEST_PATH_IMAGE012
each represents a mapping of the ith input vector,
Figure DEST_PATH_IMAGE013
represents the dimension of K, W 0 Represents the parameter matrix in the linear mapping, W is the parameter matrix, and d represents the matrix dimension. Three linear mappings are carried out on the input matrix to obtain three state matrixes for calculating self attention
Figure DEST_PATH_IMAGE014
And obtaining full-text semantic information by calculating self attention, accelerating convergence by residual connection and layer normalization, and outputting the information as an input coding vector of the next layer of transform. And the accuracy rate is improved by overlapping the depth mining information with the multi-layer codes. In the embodiment, a cls vector of the last layer of encoder is taken and used as a first dimension vector of an output vector, and a final global semantic information characteristic is output
Figure 155935DEST_PATH_IMAGE010
Figure 688548DEST_PATH_IMAGE002
The cls vector (the first dimension of the output vector) of the last layer of encoder is taken as the final global semantic information characteristic, because the encoder is more focused after encoding 12 layers.
In a third embodiment, the method for discovering repeated materials of the theme integration service according to the second embodiment is further limited, the global semantic information features are classified and processed according to a logistic regression algorithm, and whether the repeated materials are the repeated materials is judged, including calculating a text information similarity probability, and whether the repeated materials are the repeated materials is judged according to the text information similarity probability, where the calculating of the text information similarity probability specifically includes:
Figure DEST_PATH_IMAGE015
wherein exp is an exponential function with a natural constant e as the base,
Figure DEST_PATH_IMAGE016
and P is a text information similarity probability.
In practical application, logistic regression algorithm, vector machine, PCA, LDA, neural network and the like can be adopted for material classification.
The classification model based on supervised learning comprises a Bayesian model, a logistic regression, a decision tree, a support vector machine, a neural network and other algorithms. The neural network is a more common classification algorithm, the current task uses a pre-training model based on Self Attention, sentence-level information representation is obtained through a Self-Attention mechanism, and after fine-tune, a BERT model can accurately express text information. Therefore, only one layer of simple neural network is needed to perform classification well, the structure of the neural network is equivalent to logistic regression, and the logistic regression saves processing time compared with the neural network. Because the logistic regression model is simple, the training time is short and efficient, a large number of characteristics can be processed, the property that the numerical value between 0 and 1 has probability is output, the threshold value self-defined classification can be conveniently determined according to the two types of classification losses, and the anti-noise capability is strong. Logistic regression models also have the ability to be combined with a pre-trained, fine-tune framework, as compared to bayesian models, decision trees, support vector machines, and the like. In summary, the present embodiment selects a logistic regression algorithm to classify the material.
In a fourth embodiment, the method for discovering the repeated materials of the theme integration service according to the third embodiment is further limited, the global semantic information features are classified according to a logistic regression algorithm, and whether the repeated materials are the repeated materials is judged, and the method further includes an active learning method, specifically:
setting a text information similarity probability threshold, wherein the text information similarity probability threshold comprises 0.8 and 0.2;
the text information similarity probability is greater than or equal to 0.8, and the text information similarity probability is a repeated material; the text information similarity probability is less than or equal to 0.2, and the text information similarity probability is non-repetitive material; and materials with the text information similarity probability smaller than 0.8 and larger than 0.2 are in error classification, and the materials in error classification are retrained.
In this embodiment, a threshold value of 0.5 should be set in the normal process according to the system determination standard, and when the similarity score is greater than 0.5, the default is a repeated material, and when the similarity score is less than 0.5, the default is a non-repeated material. However, due to the problems that the system training is insufficient, the feature extraction is not complete, and the like, the problem that the material name cannot be repeated exists only when the actual similarity value is found to be greater than 0.8 in the previous manual verification preparation work, so that the absolute value of 0.3 is added on the basis of 0.5, and the threshold values are changed into 0.8 and 0.2.
In a fifth embodiment, the method for discovering repetitive materials of a theme integration service according to the first embodiment is further defined, wherein the retraining comprises:
obtaining misclassified materials by a machine learning method;
manually marking whether the misclassified material data is repeatedly labeled or not, and performing fine tuning learning on the misclassified material data again, wherein the method specifically comprises the following steps:
adjusting a pre-training weight and a logistic regression algorithm, using cross entropy as a loss function L, and updating the weight by using an Adam gradient descent method:
Figure DEST_PATH_IMAGE017
wherein y is the manually labeled whether or not the two materials are repeated labels,
Figure DEST_PATH_IMAGE018
for the classification model prediction value, L measures the difference degree between the model prediction value and the actual value.
In practical application, the number of early-stage manual labeling data sets is still less than that of deep network model training, and in order to reduce subsequent labor cost and apply the early-stage manual labeling data sets, a TONE and active learning method is referred to, and the performance of the method is checked and improved in practice. The TONE is fully called Train On or Near Error, a score limit is preset at first, and when the material score is within the score limit, fine tuning learning is needed again even if the judgment is correct. The general idea of Active Learning is to manually label the misclassified material data, specifically: and acquiring sample data which are difficult to classify by a machine learning method, manually confirming and auditing the sample data again, and then training the manually marked data by using a supervised learning model or a semi-supervised learning model again, so that the effect of the model is gradually improved, and the manual experience is integrated into the machine learning model. When the method predicts the score, a limit theta is preset, so that the materials outside the limit can be accurately predicted, more accurate repeated material judgment is obtained, and under the condition that the materials are uncertain in the limit, whether the materials are repeated materials needs to be marked by a professional, the text contents and the labels of the two materials are learned, and the TONE method is used during learning, so that even if the method judgment is correct in the limit, training is still performed, and the accuracy is improved.
Sixth, the present embodiment is further limited to the method for discovering repetitive materials of a theme integration service according to the first embodiment, and the method further includes:
the materials constituting the theme integration service file are combined to form a union
Figure DEST_PATH_IMAGE019
Representing a union of the material of the theme integration service files, wherein n is the total number of the material;
randomly selecting two materials from the n materials, synthesizing into a group, and judging whether the two materials are repeated:
Figure DEST_PATH_IMAGE020
;
wherein 0 represents two materials not repeating, 1 represents a material repeating;
extracting all material combinations of f (= 1) for output, and deleting any material x in the combinations i Or x j ,x i Represents the ith material, x j Representing the jth material.
Specifically, the repeated material output in all material combinations is extracted, and any part of material x in the combinations is deleted i Or x j Ensuring that the subject integration service is free of duplicate material.
Seventh, the subject integration service repeated material discovery system according to the present embodiment, includes:
a material name and text information acquisition unit for acquiring a material name and text information of a document material of the subject integration service;
the global semantic information characteristic acquisition unit is used for extracting the global semantic information characteristics of the material names and the text information based on the characteristic extractor;
and the repeated material judging unit is used for classifying and processing the global semantic information characteristics according to a logistic regression algorithm and judging whether the repeated material is the repeated material.
An eighth embodiment is the recognition-based simplified duplicate event discovery system according to the sixth embodiment, wherein the global semantic information feature obtaining unit includes:
the processed text information acquisition module is used for removing the material name of the file material, the region name and the special symbol of the text information and acquiring the processed text information;
a global semantic information characteristic acquisition module for adding a flag bit cls to a word segmentation module of the BERT model, splicing the processed text information of the two document materials into spliced text information, and performing word segmentation processing by using the word segmentation module of the BERT model to splice the text information and acquire global semantic information characteristics, wherein the spliced text information comprises an ith material x i And the jth material x j The global semantic information features comprise one-dimensional feature vectors of the spliced text information
Figure 497235DEST_PATH_IMAGE010
I is not less than 1, j is not equal to 1:
Figure DEST_PATH_IMAGE021
,
wherein,
Figure 634955DEST_PATH_IMAGE003
represents a consistent transfromer encoded embedded vector of 12-layer structure, [0 ]]Representing a first dimension vector.
Ninth embodiment, a computer device according to the present embodiment, includes a memory in which a computer program is stored, and a processor, and when the processor executes the computer program stored in the memory, the processor executes the method for discovering repetitive materials of the theme integration service according to any one of the first to sixth embodiments.
Embodiment ten is a computer-readable storage medium according to this embodiment, which is used to store a computer program that executes the method for discovering repetitive materials of a theme integration service according to any one of embodiment one to embodiment six.
The eleventh embodiment is described with reference to fig. 1, and this embodiment provides a specific embodiment to the method for discovering repetitive materials of a subject integration service described in the first embodiment, and is used to explain embodiments one to six, specifically:
acquiring a material name and text information of a file material of the theme integration service;
extracting global semantic information characteristics of the material name and the text information based on a characteristic extractor;
and processing the global semantic information characteristics according to a logistic regression algorithm, and judging whether the material is a repeated material.
Preprocessing a material name and original text information (removing a region name and special symbols), dividing characters (segmenting tokens) through a BERT dictionary, converting the text information into embedded vectors E with fixed 768 dimensions by using an embedded layer of a BERT model, adding a flag bit cls before all tokens before conversion to calculate global information, and superposing coded embedded vectors of 12 layers of uniform transformers to obtain output of the BERT model, wherein a single-layer transformer coded embedded vector calculation formula is as follows:
Figure DEST_PATH_IMAGE022
wherein, multihead represents a multi-head Attention mechanism, concat represents splicing, attention represents a self-Attention matrix, Q, K and V represent three matrixes for calculating self-Attention, and head represents i The representation is the ith self-attention matrix,
Figure DEST_PATH_IMAGE023
each represents a mapping of the ith input vector,
Figure DEST_PATH_IMAGE024
represents the dimension of K, W 0 Represents the parameter matrix in the linear mapping, W is the parameter matrix, and d represents the matrix dimension. Three linear mappings are carried out on the input matrix to obtain three state matrixes for calculating self attention
Figure DEST_PATH_IMAGE025
And obtaining full-text semantic information by calculating self attention, accelerating convergence by residual connection and layer normalization, and outputting the semantic information as an input coding vector of the next layer of transform. And the accuracy rate is improved by overlapping the depth mining information with the multi-layer codes. Because the token which does not belong to the input text is added in advance, the token and all tokens are calculated during attention calculation, and the cls vector (the first dimension of the output vector) of the last layer of encoder is taken as the final global semantic information characteristic:
Figure DEST_PATH_IMAGE026
the embodiment adopts a logic classification algorithm to calculate the text information similarity probability:
Figure DEST_PATH_IMAGE027
wherein exp is an exponential function with a natural constant e as the base,
Figure DEST_PATH_IMAGE028
and P is a text information similarity probability.
The advantage of using the logic classification algorithm is that the structure is simple, the interpretability is strong, and the influence of different characteristics on the final result can be seen from the weight of the characteristics; the training speed is high, the calculated amount is only related to the number of the features during classification, the calculation accuracy is only related to the effect of the features, and the occupied calculation resources are small; the output result is convenient to adjust, the result is probability, and the threshold value is convenient to adjust.
The number of early-stage manual labeling data sets is still less than that of deep network model training, and in order to reduce subsequent labor cost and apply the early-stage manual labeling data sets, a TONE and active learning method is adopted, and the performance of the method is tested and improved in practice. The TONE is fully called Train On or Near Error, a score limit is preset at first, and when the material score is within the score limit, fine tuning learning is needed again even if the judgment is correct. The general idea of Active Learning is to acquire sample data which is not easy to classify by a machine Learning method and label the sample data, and to perform manual reconfirmation and audit, to train the data obtained by manual labeling by using a supervised Learning model or a semi-supervised Learning model again, to gradually improve the effect of the model, and to integrate manual experience into the machine Learning model. When the method predicts the score, a threshold value theta is preset, so that accurate prediction can be obtained outside the boundary, more accurate repeated material judgment is obtained, when the condition is uncertain in the boundary, whether the repeated material is the repeated material or not is manually marked, the text content and the label of the two materials are learned, a TONE method is used during learning, even if the method is judged to be correct in the boundary, training is still carried out, and the accuracy is improved.
In the present embodiment, the threshold values are 0.8 and 0.2. The text information similarity probability threshold value is greater than or equal to 0.8, and the text information similarity probability threshold value is a repeated material; the text information similarity probability threshold value is less than or equal to 0.2 and is a non-repetitive material; and the materials with the text information similarity probability smaller than 0.8 and larger than 0.2 are in error classification, and the materials in error classification are retrained.
In practice, because the data set is not large enough, it takes much time and has a general effect to restart training the neural network to obtain the desired high-level feature representation, so a publicly published pre-trained pretrainin (a large amount of unlabelled corpus is used for text reconstruction of two tasks, namely MLM and interphrase relation NSP, so that the model has strong characterization capability) model is used to be combined with a classifier, the classifier is realized by adopting a logistic regression algorithm, and the fine-tune is trained on the material data set (fine tuning is carried out according to specific downstream tasks on the basis of the trained feature extractor).
The retraining comprises:
obtaining misclassified materials by a machine learning method;
manually marking whether the misclassified material data is repeatedly labeled or not, and performing fine tuning learning on the misclassified material data again, wherein the method specifically comprises the following steps:
after the classifier loads the feature extractor weight, the pre-training weight and the logistic regression algorithm are adjusted, and the cross entropy is used as a loss function L:
Figure DEST_PATH_IMAGE029
the weight update uses the Adam gradient descent method.
In the training and learning process, the updating time epoch is set to be 3, the learning rate is 2e-5 (the learning rate is a value obtained by a scientific counting method, and an absolute value can be calculated through abs), the exponential decay rate of the first moment estimation is 0.9, the exponential decay rate of the second moment estimation is 0.999, and Adam optimization can ensure that the parameter updating size does not change along with the scaling of the gradient size, so that the optimal solution can be found more stably.
The method also comprises the following steps of merging materials required to be provided and forming all matters of the theme integration service, representing the union of the materials of the theme integration service files, wherein n is the total number of the materials and is as follows:
{x 1 ,x 2 ,x 3 ,...,x n };
the object of the subject integrated repeat material discovery method is to examine the repeat material in all materials, i.e., to traverse all materials, select two different materials from n materials each time to combine into a group, and determine whether two materials are repeated:
Figure DEST_PATH_IMAGE030
wherein 0 means two materials are not repeated, 1 means material is repeated, all the materials with f =1 are combined, and any one part of the materials (x) is combined i Or x j ) Abandon, ensure theme integration services no duplicate material.
The training process dynamically plans the real learning rate, and the specific algorithm f is as follows:
a material set D = { () } (1 is less than or equal to i and less than or equal to n,1 is less than or equal to j and less than or equal to n, and i is not equal to j) to be judged;
wherein the learning rate is eta, learning batchT, the first moment estimate has an exponential decay rate of
Figure DEST_PATH_IMAGE031
The second order moment is estimated with an exponential decay rate of
Figure DEST_PATH_IMAGE032
Figure DEST_PATH_IMAGE033
V/initialize classifier weights
Figure DEST_PATH_IMAGE034
// weight matrix: feature extractor weights and classifier weights
Figure DEST_PATH_IMAGE035
Figure DEST_PATH_IMAGE036
V/calculating Material characteristics
Figure DEST_PATH_IMAGE037
Figure DEST_PATH_IMAGE038
// TONE-based trainer
Figure DEST_PATH_IMAGE039
If not, then request manual labeling
Figure DEST_PATH_IMAGE040
else// below is the classification section
Figure DEST_PATH_IMAGE041
end for
Train (T, epoch)// input the material set to be learned and the number of updates
Figure DEST_PATH_IMAGE042
Figure DEST_PATH_IMAGE043
Calculating loss update weights for predicted material combinations
Figure DEST_PATH_IMAGE044
Figure DEST_PATH_IMAGE045
while True:
Pre _ MS (D)// predicting the Material set to be determined
iflen (T) = = 0:// classification is accurate, end task, output result
break
Train (T, epoch)// there is a classification inaccuracy result, the weight is updated
Wherein,
Figure DEST_PATH_IMAGE046
is 0. Specifically, the first judgment of the program is used for judging whether the text similarity probability is within a threshold range, and the retraining is required when the text information similarity probability is less than 0.8 and greater than 0.2, and the second judgment is not performed within the range; the second judgment is used for processing according to the first judgment result, namely when the text information similarity probability does not fall within the range of less than 0.8 and more than 0.2, the file judgment result is output: textThe information similarity probability of more than 0.5 is the repeated material (i.e. when the text similarity is 0.8, the repeated material belongs to), and the text information similarity probability of less than 0.5 is the non-repeated material (i.e. when the text similarity is 0.2, the repeated material does not belong to).
The technical solutions provided by the present invention are further described in detail with reference to the drawings, for the purpose of highlighting advantages and benefits, and are not intended to limit the present invention, and any modifications, combinations of embodiments, improvements, equivalents, etc. based on the spirit of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A method for discovery of repeated materials for a subject integration service, the method comprising:
acquiring a material name and text information of a file material of the theme integration service;
extracting global semantic information characteristics of the material name and the text information based on a characteristic extractor;
processing the global semantic information characteristics according to a logistic regression algorithm, and judging whether the materials are repetitive materials;
the feature extractor-based global semantic information feature extraction for extracting material names and text information specifically comprises the following steps:
removing the material name of the file material, the region name and the special symbol of the text information, and acquiring the processed text information;
adding a zone bit cls to a word segmentation module of a BERT model, splicing the processed text information of two document materials into spliced text information, performing word segmentation processing on the spliced text information by using the word segmentation module of the BERT model, and acquiring global semantic information characteristics, wherein the spliced text information comprises an ith material x i And the jth material x j The global semantic information features comprise one-dimensional feature vectors of the spliced text information
Figure 611521DEST_PATH_IMAGE001
I is not less than 1, j is not equal to 1:
Figure 979048DEST_PATH_IMAGE002
,
wherein,
Figure 220674DEST_PATH_IMAGE003
represents a consistent transfromer encoded embedded vector of 12-layer structure, [0 ]]Representing a first dimension vector.
2. The method for discovering the repeating material of the subject integration service according to claim 1, wherein the global semantic information features are classified according to a logistic regression algorithm, and whether the repeating material is determined comprises calculating a text information similarity probability, and whether the repeating material is determined according to the text information similarity probability, wherein the calculating of the text information similarity probability specifically comprises:
Figure 682748DEST_PATH_IMAGE004
wherein exp is an exponential function with a natural constant e as the base,
Figure 263902DEST_PATH_IMAGE005
and P is a text information similarity probability.
3. The method for discovering the repeated materials of the subject integration service according to claim 2, wherein the global semantic information features are classified and processed according to a logistic regression algorithm to determine whether the repeated materials are the repeated materials, and further comprising an active learning method, specifically comprising:
setting a text information similarity probability threshold, wherein the text information similarity probability threshold comprises 0.8 and 0.2;
the text information similarity probability is greater than or equal to 0.8, and the text information similarity probability is a repeated material; the text information similarity probability is less than or equal to 0.2, and the text information similarity probability is non-repetitive material; and materials with the text information similarity probability smaller than 0.8 and larger than 0.2 are in error classification, and the materials in error classification are retrained.
4. The method of claim 3, wherein the retraining comprises:
obtaining misclassified materials by a machine learning method;
manually marking whether the misclassified material data is repeatedly labeled or not, and performing fine tuning learning on the misclassified data again, wherein the method specifically comprises the following steps:
adjusting a pre-training weight and a logistic regression algorithm, using cross entropy as a loss function L, and updating the weight by using an Adam gradient descent method:
Figure 777230DEST_PATH_IMAGE006
wherein y is the manually labeled whether or not the two materials are repeated labels,
Figure 443835DEST_PATH_IMAGE007
l measures the difference degree between the model predicted value and the actual value for classifying the model predicted value.
5. The subject integration service repeated material discovery method of claim 1, wherein the method further comprises:
the materials constituting the theme integration service file are combined to form a union
Figure 709600DEST_PATH_IMAGE008
Representing a union of the material of the theme integration service files, wherein n is the total number of the material;
randomly selecting two materials from the n materials, synthesizing into a group, and judging whether the two materials are repeated:
Figure 207577DEST_PATH_IMAGE009
wherein 0 represents two non-repeating materials and 1 represents a repeating material;
extracting all material combinations of f (= 1) for output, and deleting any material x in the combinations i Or x j ,x i Represents the ith material, x j Representing the jth material.
6. A subject integration service repeated material discovery system, the system comprising:
the system comprises a material name and text information acquisition unit, a document integration service acquisition unit and a document information acquisition unit, wherein the material name and text information acquisition unit is used for acquiring the material name and text information of the document material of the subject integration service;
the global semantic information characteristic acquisition unit is used for extracting global semantic information characteristics of the material names and the text information based on the characteristic extractor;
the repeated material judging unit is used for classifying and processing the global semantic information characteristics according to a logistic regression algorithm and judging whether the repeated material exists;
the global semantic information feature acquisition unit includes:
the processed text information acquisition module is used for removing the material name of the file material, the region name and the special symbol of the text information and acquiring the processed text information;
a global semantic information characteristic acquisition module for adding a flag bit cls to a word segmentation module of the BERT model, splicing the processed text information of the two document materials into spliced text information, and performing word segmentation processing by using the word segmentation module of the BERT model to splice the text information and acquire global semantic information characteristics, wherein the spliced text information comprises an ith material x i And the jth material x j The global semantic information features comprise one-dimensional feature vectors of the spliced text information
Figure 916907DEST_PATH_IMAGE001
I is not less than 1, j is not equal to 1:
Figure 54496DEST_PATH_IMAGE002
,
wherein,
Figure 874685DEST_PATH_IMAGE003
represents a consistent transfromer-encoded embedded vector of 12-layer structure, [0 ]]Representing a first dimension vector.
7. A computer device, characterized by: comprising a memory having a computer program stored therein and a processor that, when executing the computer program stored by the memory, performs the subject integration service repeated material discovery method as claimed in any one of claims 1 to 5.
8. A computer-readable storage medium for storing a computer program for executing the subject integration service repeated material discovery method as set forth in any one of claims 1 to 5.
CN202211282962.1A 2022-10-20 2022-10-20 Method, system, device and storage medium for discovering repeated materials of theme integration service Active CN115357718B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211282962.1A CN115357718B (en) 2022-10-20 2022-10-20 Method, system, device and storage medium for discovering repeated materials of theme integration service

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211282962.1A CN115357718B (en) 2022-10-20 2022-10-20 Method, system, device and storage medium for discovering repeated materials of theme integration service

Publications (2)

Publication Number Publication Date
CN115357718A CN115357718A (en) 2022-11-18
CN115357718B true CN115357718B (en) 2023-01-24

Family

ID=84007874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211282962.1A Active CN115357718B (en) 2022-10-20 2022-10-20 Method, system, device and storage medium for discovering repeated materials of theme integration service

Country Status (1)

Country Link
CN (1) CN115357718B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116109114B (en) * 2023-04-13 2023-09-15 佛山科学技术学院 Normalized government service data processing method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832290A (en) * 2020-05-25 2020-10-27 北京三快在线科技有限公司 Model training method and device for determining text relevancy, electronic equipment and readable storage medium
CN111914061A (en) * 2020-07-13 2020-11-10 上海乐言信息科技有限公司 Radius-based uncertainty sampling method and system for text classification active learning
CN113160917A (en) * 2021-05-18 2021-07-23 山东健康医疗大数据有限公司 Electronic medical record entity relation extraction method
CN115017879A (en) * 2022-05-27 2022-09-06 深圳证券信息有限公司 Text comparison method, computer device and computer storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239700A (en) * 2021-04-27 2021-08-10 哈尔滨理工大学 Text semantic matching device, system, method and storage medium for improving BERT

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832290A (en) * 2020-05-25 2020-10-27 北京三快在线科技有限公司 Model training method and device for determining text relevancy, electronic equipment and readable storage medium
CN111914061A (en) * 2020-07-13 2020-11-10 上海乐言信息科技有限公司 Radius-based uncertainty sampling method and system for text classification active learning
CN113160917A (en) * 2021-05-18 2021-07-23 山东健康医疗大数据有限公司 Electronic medical record entity relation extraction method
CN115017879A (en) * 2022-05-27 2022-09-06 深圳证券信息有限公司 Text comparison method, computer device and computer storage medium

Also Published As

Publication number Publication date
CN115357718A (en) 2022-11-18

Similar Documents

Publication Publication Date Title
CN110188195B (en) Text intention recognition method, device and equipment based on deep learning
CN111428028A (en) Information classification method based on deep learning and related equipment
CN112732871A (en) Multi-label classification method for acquiring client intention label by robot
CN113378563B (en) Case feature extraction method and device based on genetic variation and semi-supervision
CN114997169B (en) Entity word recognition method and device, electronic equipment and readable storage medium
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN114416979A (en) Text query method, text query equipment and storage medium
CN113868422A (en) Multi-label inspection work order problem traceability identification method and device
CN112270187A (en) Bert-LSTM-based rumor detection model
CN115357718B (en) Method, system, device and storage medium for discovering repeated materials of theme integration service
CN112685374B (en) Log classification method and device and electronic equipment
CN114691525A (en) Test case selection method and device
CN115392254A (en) Interpretable cognitive prediction and discrimination method and system based on target task
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN114881173A (en) Resume classification method and device based on self-attention mechanism
CN114610888A (en) Automatic monitoring and synthesizing method for defect report of developer group chat
CN114266252A (en) Named entity recognition method, device, equipment and storage medium
CN107766560B (en) Method and system for evaluating customer service flow
CN111523301B (en) Contract document compliance checking method and device
CN113722494A (en) Equipment fault positioning method based on natural language understanding
CN110362828B (en) Network information risk identification method and system
CN117372144A (en) Wind control strategy intelligent method and system applied to small sample scene
CN114610882A (en) Abnormal equipment code detection method and system based on electric power short text classification
CN112749530B (en) Text encoding method, apparatus, device and computer readable storage medium
CN117389821A (en) Log abnormality detection method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Qi Haoliang

Inventor after: Miao Xiaogang

Inventor after: Han Yong

Inventor after: Qi Sixian

Inventor after: Kong Leilei

Inventor after: Han Zhongyuan

Inventor after: Cao Xia

Inventor before: Qi Haoliang

Inventor before: Miao Xiaogang

Inventor before: Han Yong

Inventor before: Kong Leilei

Inventor before: Han Zhongyuan

Inventor before: Cao Xia

CB03 Change of inventor or designer information
CP03 Change of name, title or address

Address after: No.18, Jiangwan 1st Road, Foshan, Guangdong 528011

Patentee after: Foshan University

Country or region after: China

Address before: No.18, Jiangwan 1st Road, Foshan, Guangdong 528011

Patentee before: FOSHAN University

Country or region before: China