CN115563512A - Semantic matching model generation method and system based on remote supervision - Google Patents

Semantic matching model generation method and system based on remote supervision Download PDF

Info

Publication number
CN115563512A
CN115563512A CN202211166854.8A CN202211166854A CN115563512A CN 115563512 A CN115563512 A CN 115563512A CN 202211166854 A CN202211166854 A CN 202211166854A CN 115563512 A CN115563512 A CN 115563512A
Authority
CN
China
Prior art keywords
semantic
vector
text data
model
original text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211166854.8A
Other languages
Chinese (zh)
Inventor
程栋
谭锐
潘希尧
张泽宏
王晔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Big Data Co ltd
Original Assignee
Shanghai Big Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Big Data Co ltd filed Critical Shanghai Big Data Co ltd
Priority to CN202211166854.8A priority Critical patent/CN115563512A/en
Publication of CN115563512A publication Critical patent/CN115563512A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention provides a semantic matching model generation method and system based on remote supervision, which relate to the technical field of machine learning and comprise the following steps: acquiring a plurality of original text data of a preset application scene, and performing data enhancement on each original text data to obtain a plurality of enhanced text data, wherein each enhanced text data is associated with a standard semantic category label corresponding to the preset application scene; carrying out weight fine adjustment on the pre-trained language model according to each enhanced text data and the associated standard semantic category label to obtain a fine-adjusted model; sentence embedding is carried out on each original text data according to the fine-tuned model to obtain a plurality of semantic vectors, and each semantic vector is automatically labeled to obtain a plurality of semantic matching vector pairs and a pair of semantic mismatching vector pairs; and training according to each semantic matching vector pair and each semantic mismatching vector pair to obtain a semantic matching model. The method has the advantages of effectively reducing the cost of a large number of manual labels and relieving the problem of difficulty in cold start of the semantic matching model to a certain extent.

Description

Semantic matching model generation method and system based on remote supervision
Technical Field
The invention relates to the technical field of machine learning, in particular to a semantic matching model generation method and system based on remote supervision.
Background
The most direct goal of semantic matching is to judge whether two dialects express the same meaning, and the main tasks of semantic matching for sentences are two items: 1. searching a reasonable word embedding mode for describing key characteristics; 2. and performing semantic similarity two-classification judgment on the semantic vectors of the two sentences.
The semantic matching technology commonly used in the industry generally implements vector representation and calculates the similarity between sentences by word frequency dimensionality, or word embedding on a text, or sentence-level embedding by using a pre-training language model. However, the similarity can be judged as semantic matching, and only a threshold value can be passed, so that certain limitation is realized. The other scheme is that semantic matching is used as a binary classification task of texts of sentence pairs, which can be realized based on word2vec + lstm or a pre-training language model, but basically needs labeled semantic matching training data with a considerable scale, and labeled training data are often less or incomplete, so that the model is difficult to start at a cold state. Further, if satisfactory accuracy requirements are to be achieved, the amount of training data required is at least 20 tens of thousands or more. Manual labeling of data of this order is too costly and the amount of data of this order can be stressful for computational performance.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a semantic matching model generation method based on remote supervision, which comprises the following steps:
the method comprises the following steps of S1, obtaining a plurality of original text data of a preset application scene, and performing data enhancement on the original text data to obtain a plurality of enhanced text data, wherein each enhanced text data is associated with a standard semantic category label corresponding to the preset application scene;
s2, carrying out weight fine adjustment on a pre-trained language model according to each enhanced text data and the associated standard semantic category label to obtain a fine-adjusted model;
s3, sentence embedding is carried out on each original text data according to the fine-tuned model to obtain a plurality of semantic vectors, and each semantic vector is automatically labeled to obtain a plurality of semantic matching vector pairs and a plurality of semantic mismatching vector pairs;
and S4, training according to each semantic matching vector pair and each semantic mismatching vector pair to obtain a semantic matching model.
Preferably, the step S1 includes:
step S11, acquiring each original text data from each service system associated with the preset application scene, wherein each original text data is associated with an original semantic category label defined in the corresponding service system;
step S12, correspondingly standardizing each original semantic category label into the standard semantic category label in the preset application scene, and performing data fusion on each original text data with the same standard semantic category label;
step S13, performing category scoring on each standard semantic category label according to each original text data after data fusion to obtain a corresponding score value, and judging whether the score value is greater than a preset score threshold value:
if yes, adding each original text data associated with the corresponding standard semantic category label into a high-quality data set, and then turning to step S14;
if not, adding each original text data associated with the corresponding standard semantic category label into a low-quality data set, and then turning to the step S15;
step S14, performing word extraction on each original text data in the high-quality data set, and constructing a semantic dictionary based on each semantic representative word obtained by the word extraction;
step S15, configuring the standard semantic category label for each original text data in the low-quality data set according to the semantic dictionary;
step S16, performing data enhancement on each original text data associated with each standard semantic category label to obtain each enhanced text data.
Preferably, in step S13, category scoring is performed according to a ratio of the data length and the data amount of each original text data in each standard semantic category to all the original text data to obtain the corresponding score value.
Preferably, the step S14 includes:
step S141, extracting keywords and subject words from each original text data in the high-quality data set;
and S142, performing duplication elimination processing on the extracted keywords and the extracted subject words, and screening the keywords and the subject words after the duplication elimination processing to obtain each semantic representative word.
Preferably, in step S16, the data enhancement mode includes:
performing non-keyword replacement on each original text data associated with each standard semantic category label to take each original text data and new text data formed by performing non-keyword replacement as the enhanced text data; and/or
And selectively and randomly sampling the original text data associated with each standard semantic category label, and taking the original text data obtained by random sampling as the enhanced text data associated with the standard semantic category label.
Preferably, the step S3 includes:
step S31, embedding sentences of the original text data according to the fine-tuned model to obtain a plurality of semantic vectors, and adding the semantic vectors into a vector set;
step S32, performing text clustering on each semantic vector in the vector set to obtain a centroid of a plurality of clustering clusters formed after text clustering, respectively calculating the distance between the semantic vector serving as the centroid and other semantic vectors for each clustering cluster, and judging whether the distance is within a preset threshold range:
if yes, adding the corresponding semantic vector into a fault set, and then turning to the step S34;
if not, go to step S33;
step S33, determining whether the distance is not greater than the lower limit of the threshold range:
if yes, adding the corresponding semantic vector into a credible set, and then turning to the step S34;
if not, adding the corresponding semantic vector into an untrusted set, and then turning to step S34;
step S34, judging whether the credible set is an empty set:
if yes, quitting;
if not, automatically labeling the semantic vectors in the credible set in the same clustering cluster as a plurality of semantic matching vector pairs, and automatically labeling the semantic vectors in the credible set and the non-credible set of the same clustering cluster and the semantic vectors in the credible set and the semantic vectors in other clustering clusters as a plurality of semantic mismatching vector pairs;
step S35, removing the semantic vector in each trusted set from the vector set, and then returning to step S32.
Preferably, the semantic matching model comprises a Bert model, the embedded layer of the Bert model is added with an anti-disturbance layer, the output of the Bert is connected with a softmax layer, the input of the Bert model is used as the input of the semantic matching model, and the output of the softmax layer is used as the output of the semantic matching model.
The invention also provides a generation system of a semantic matching model based on remote supervision, which applies the generation method and comprises the following steps:
the data enhancement module is used for acquiring a plurality of original text data of a preset application scene, and performing data enhancement on the original text data to obtain a plurality of enhanced text data, wherein each enhanced text data is associated with a standard semantic category label corresponding to the preset application scene;
the weight fine-tuning module is connected with the data enhancement module and used for carrying out weight fine-tuning on a pre-trained language model according to each enhanced text data and the associated standard semantic category label to obtain a fine-tuned model;
the sample construction module is respectively connected with the data enhancement module and the weight fine-tuning module and is used for embedding sentences of the original text data according to the fine-tuned model to obtain a plurality of semantic vectors and automatically labeling the semantic vectors to obtain a plurality of semantic matching vector pairs and a plurality of semantic mismatching vector pairs;
and the model training module is connected with the sample construction module and used for training according to each semantic matching vector pair and each semantic mismatching vector pair to obtain a semantic matching model.
Preferably, the sample construction module comprises:
a semantic vector generating unit, configured to perform sentence embedding on each original text data according to the fine-tuned model to obtain a plurality of semantic vectors, and add each semantic vector to a vector set;
a semantic vector clustering unit connected to the semantic vector generating unit, configured to perform text clustering on each semantic vector in the vector set to obtain a centroid of multiple clustering clusters formed after text clustering, calculate, for each clustering cluster, a distance between the semantic vector serving as the centroid and another semantic vector, add the corresponding semantic vector to a fault set when the distance is within a preset threshold range, add the corresponding semantic vector to a trusted set when the distance is not greater than a lower limit of the threshold range, and add the corresponding semantic vector to an untrusted set when the distance is not within the threshold range and is greater than the lower limit;
a judging unit, connected to the semantic vector clustering unit, configured to, when the trusted set is not an empty set, automatically label, as a plurality of semantic matching vector pairs, between the semantic vectors in the trusted set in the same cluster, automatically label, as a plurality of semantic mismatching vector pairs, between the semantic vectors in the trusted set and the semantic vectors in the untrusted set in the same cluster, and between the semantic vectors in the trusted set and the semantic vectors in other clusters;
the iteration unit is connected with the judging unit and is used for removing the semantic vectors in each credible set from the vector sets;
and the semantic vector clustering unit iteratively performs text clustering on the removed vector set until the judging unit judges that the credible set is an empty set.
Preferably, the semantic matching model comprises a Bert model, the embedded layer of the Bert model is added with an anti-disturbance layer, the output of the Bert is connected with a softmax layer, the input of the Bert model is used as the input of the semantic matching model, and the output of the softmax layer is used as the output of the semantic matching model.
The technical scheme has the following advantages or beneficial effects:
1) The semantic matching vector pair and the semantic mismatching vector pair are automatically labeled through remote supervision, so that the cost of a large amount of manual labeling is effectively reduced, and the problem of difficulty in cold start of a semantic matching model is relieved to a certain extent;
2) The training of the semantic matching algorithm model is realized in the form of multi-sentence binary task through the pre-trained bert model, and a disturbance resisting layer structure is assisted, so that the trained semantic matching model can better cope with the boundary distinguishing of semantic matching.
Drawings
FIG. 1 is a flow chart illustrating a method for generating a semantic matching model based on remote supervision according to a preferred embodiment of the present invention;
FIG. 2 is a sub-flowchart of step S1 according to a preferred embodiment of the present invention;
FIG. 3 is a sub-flowchart of step S14 according to a preferred embodiment of the present invention;
FIG. 4 is a sub-flowchart of step S3 according to a preferred embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a system for generating a semantic matching model based on remote supervision according to a preferred embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present invention is not limited to the embodiment, and other embodiments may be included in the scope of the present invention as long as the gist of the present invention is satisfied.
In a preferred embodiment of the present invention, based on the above problems in the prior art, there is provided a method for generating a semantic matching model based on remote supervision, as shown in fig. 1, including:
the method comprises the following steps of S1, obtaining a plurality of original text data of a preset application scene, and performing data enhancement on each original text data to obtain a plurality of enhanced text data, wherein each enhanced text data is associated with a standard semantic category label corresponding to the preset application scene;
s2, carrying out weight fine adjustment on the pre-trained language model according to each enhanced text data and the associated standard semantic category label to obtain a fine-adjusted model;
s3, embedding sentences of the original text data according to the fine-tuned model to obtain a plurality of semantic vectors, and automatically labeling the semantic vectors to obtain a plurality of semantic matching vector pairs and a pair of semantic mismatching vector pairs;
and S4, training according to each semantic matching vector pair and each semantic mismatching vector pair to obtain a semantic matching model.
Specifically, the preset application scenario includes, but is not limited to, a city-oriented operation scenario, and the corresponding original text data includes, but is not limited to, 12345 citizen hotline data and city emergency data. In order to solve the problem that the manual labeling cost of the data set required by training the semantic matching model is too high, in the embodiment, a remote supervision mode is adopted to automatically label the semantic matching vector pair and the semantic mismatching vector pair. Before automatic labeling, because the data quality of the original text data which is usually directly obtained is relatively low, data enhancement needs to be performed on each original text data to prepare data for remote supervised automatic labeling. The effectiveness of the data enhancement may determine whether the entire remote supervision process has persistent low loss capabilities.
More specifically, the data enhancement process is shown in fig. 2, and step S1 includes:
s11, acquiring original text data from each service system associated with a preset application scene, wherein each original text data is associated with an original semantic category label defined in the corresponding service system;
s12, correspondingly standardizing each original semantic category label into a standard semantic category label in a preset application scene, and performing data fusion on each original text data with the same standard semantic category label;
step S13, performing category scoring on each standard semantic category label according to each original text data after data fusion to obtain a corresponding score value, and judging whether the score value is greater than a preset score threshold value:
if yes, adding each original text data associated with the corresponding standard semantic category label into a high-quality data set, and then turning to the step S14;
if not, adding each original text data associated with the corresponding standard semantic category label into the low-quality data set, and then turning to the step S15;
step S14, extracting words from each original text data in the high-quality data set, and constructing a semantic dictionary based on each semantic representative word obtained by extracting the words;
step S15, configuring standard semantic category labels for each original text data in the low-quality data set according to the semantic dictionary;
and S16, performing data enhancement on each original text data associated with each standard semantic category label to obtain each enhanced text data.
Specifically, in this embodiment, by taking a preset application scenario as an example of a city-oriented operation scenario, the service systems include, but are not limited to, a 12345 citizen hotline platform and an emergency incident processing platform, and the corresponding original text data are 12345 citizen hotline data and city emergency incident data, respectively. Considering that the original semantic category label possibly defined in the 12345 civil hotline data of the same type of event is a personnel injury category, and the original semantic category label defined in the urban emergency incident data is a personnel injury category, category fusion needs to be performed on the data first, so as to expand the breadth of the semantic categories, for example, if the standard semantic category label is defined as a personnel injury category, including the personnel injury category and the personnel injury category.
Further specifically, the category of each standard semantic category label is scored to obtain a corresponding score value, and preferably, the category scoring is performed according to the ratio of the data length and the data amount of each original text data in each standard semantic category to all the original text data to obtain the corresponding score value. It can be understood that the longer the data length of each original text data in the standard semantic category is, the larger the data volume proportion is, the higher the corresponding score value is, and the specific scoring rule can be customized according to the requirement. The higher the score value is, the better the category quality of each original text data corresponding to the score value is, the higher the category quality can be, the quantization of the category quality can be performed by configuring a score threshold in advance, the text data with high quality is considered when the preferred score value is greater than the score threshold, and otherwise the text data with low quality is considered. For high-quality original text data, words with reasonable degree and high semantic representation degree are extracted and used as semantic representative words to construct a semantic dictionary, and then standard semantic category labels are configured for each original text data in a low-quality data set according to the semantic dictionary, so that the purpose of labeling events which are obviously classified by a business system in an error mode or are not classified is achieved, and the semantic categories are enriched.
The above process of extracting reasonable words with high semantic representation is shown in fig. 3, and step S14 includes:
step S141, extracting keywords and subject words from each original text data in the high-quality data set;
and S142, performing duplication elimination processing on the extracted keywords and the extracted subject words, and screening the keywords and the subject words after the duplication elimination processing to obtain each semantic representative word.
Specifically, in this embodiment, after the extracted keywords and subject terms are subjected to deduplication processing, the keywords and subject terms that are unreasonable or whose expression modes do not conform to the existing expression modes can be obtained through manual screening, and then each semantic representative term is obtained, so as to further improve the effect of subsequent data enhancement.
In the process of category fusion, the distribution of some data categories may be less or more, which results in the extremely uneven distribution of event description categories after fusion, and therefore, the quality and distribution of category labels need to be further optimized by means of data enhancement. Specifically, the data enhancement mode includes:
performing non-keyword replacement on each original text data associated with each standard semantic category label to take each original text data and new text data formed by performing non-keyword replacement as enhanced text data; and/or
And selectively and randomly sampling the original text data associated with the standard semantic category labels, and taking the original text data obtained by random sampling as the enhanced text data associated with the standard semantic category labels.
Specifically, in this embodiment, the non-keyword may be replaced for the standard semantic category with a small data size to increase the number of samples, where the non-keyword is a word that does not affect the core semantics, and may be an address entity or a time entity. The number of samples can be reduced by the random sampling mode aiming at the standard semantic categories with larger data quantity, and finally, the number of events in each standard semantic category is uniformly distributed.
After the data enhancement is finished, the weight fine adjustment can be carried out on the pre-trained language model, so that the weight parameters of the pre-trained language model originally used for the full scene are more suitable for the emergency event text classification task facing the urban operation scene. The pre-trained language model is preferably a Bert model. The remote monitoring method is characterized in that the fine adjustment of sentence embedding weight is carried out on the basis of a bert task model facing the classification of event text business categories by means of the existing relationship between an event text entity and a business category entity in a business system knowledge base, semantic vector representation is realized, and the semantic matching and unmatching relationship between every two event text entities is constructed through the result of semantic vector iterative clustering, so that the marking of a semantic matching data set is completed.
Specifically, as shown in fig. 4, step S3 includes:
step S31, embedding sentences of the original text data according to the fine-tuned model to obtain a plurality of semantic vectors, and adding the semantic vectors into a vector set;
step S32, performing text clustering on each semantic vector in the vector set to obtain the centroid of a plurality of clustering clusters formed after the text clustering, respectively calculating the distance between the semantic vector serving as the centroid and other semantic vectors for each clustering cluster, and judging whether the distance is within a preset threshold range:
if yes, adding the corresponding semantic vector into the fault set, and then turning to the step S34;
if not, go to step S33;
step S33, determining whether the distance is not greater than the lower limit of the threshold range:
if yes, adding the corresponding semantic vector into the credible set, and then turning to the step S34;
if not, adding the corresponding semantic vector into the untrusted set, and then turning to step S34;
step S34, judging whether the credible set is an empty set:
if yes, quitting;
if not, automatically labeling semantic vectors in a credible set in the same cluster as a plurality of semantic matching vector pairs, and automatically labeling semantic vectors in a credible set and the credible set of the same cluster as well as semantic vectors in the credible set and semantic vectors in other cluster as a plurality of semantic mismatching vector pairs;
and step S35, removing the semantic vectors in each credible set from the vector set, and then returning to the step S32.
Specifically, in this embodiment, a weight for emergency and 12345 hot-line semantic scene classification can be obtained through weight fine tuning, and a semantic vector obtained by sentence embedding of original text data based on the weight has a relatively clear capability of semantic division between 12345 and emergency services in semantic space expression. And then, a semantic vector generated by word embedding is sent into a text clustering model based on cosine similarity to obtain the centroid of each cluster, and the centroid can be simply understood as the semantic center under each semantic cluster. Calculating the distance between each sample and the cluster center, setting an unreliable threshold and a reliable threshold, manually recording and sampling, finding a proper threshold range through repeated tests, dividing a reliable set, a fault set and an unreliable set, and finally obtaining the reliable sample (the distance between the sample and the cluster center is less than or equal to the lower limit value of the threshold range), the unreliable sample (the distance between the sample and the cluster center is greater than or equal to the upper limit value of the threshold range) and the fault set which are positioned in the reliable set. And forming semantic vector matching pairs among samples in the cluster credible set, forming semantic vector unmatching pairs before the same cluster credible set and non-credible set samples, and forming semantic vector unmatching pairs by the cluster credible set samples and any samples in other clusters. And continuously iterating, removing the credible set obtained in the previous iteration, re-clustering, and calculating the cluster center to obtain a new credible set, an incredible set and a fault set. And if the credible samples do not exist in the credible set any more, ending the iteration.
According to the method, word embedding is carried out on the text by using the bert model weight finely adjusted based on the text classification task, the traditional word sense limitation and structure limitation can be avoided due to the special attention mechanism of the bert, so that word embedding vectors can well express context semantic features and business scene semantic features, then the semantic vectors after word embedding are sent into a text clustering model based on cosine similarity for iterative clustering, and for each clustering result, a credible set, an incredible set and a fault set are divided by controlling the distance of each sample to a centroid.
And semantic matching and assembling are carried out on the credible set, the incredible set and the fault set, and the granularity of semantic matching is constrained through controlling the radius of the credible set, so that the automatic labeling of the semantic matching data set in a remote supervision mode is realized.
Based on the automatic labeling of remote supervision of the technical scheme, 2000 texts with coarse service categories can be automatically labeled into a semantic matching data set with the magnitude of 20-30 ten thousand, and the data scale can float along with the small range of semantic matching particles.
And after the clustering iteration is finished, obtaining each semantic vector matching pair and semantic vector mismatching pair as a training set obtained by automatic labeling. Training of the semantic matching model can then be performed based on the training set. In a preferred embodiment of the present invention, the semantic matching model includes a Bert model, an anti-disturbance layer is added to an embedded layer of the Bert model, an output of the Bert model is connected to a softmax layer, an input of the Bert model is used as an input of the semantic matching model, and an output of the softmax layer is used as an output of the semantic matching model.
Specifically, in the embodiment, in the aspect of the pre-training model, bert-wmm is selected as the language model, because in Bert-base issued by google officers, chinese is segmented with characters as granularity, the Chinese word segmentation in the traditional NLP is not considered. Bert-wmm applies the whole word mask method to Chinese, uses Chinese Wikipedia (including simplified and traditional) for training, and uses the LTP as a word segmentation tool, i.e. masks all Chinese characters forming the same word. A softmax layer is added on the basis of Bert-wmm, so that the output of the model is normalized, and the output is converted into the probability with better interpretability in nature. After the softmax layer, a model architecture for resisting disturbance is added to optimize the classification effect of the boundary problem with fine granularity. The semantic matching is a downstream task of multi-sentence two-classification based on Bert-wmm, and the training of the semantic matching model is realized by calling a pre-training language model to face the classification task and adding anti-disturbance for fine adjustment.
The training of the semantic matching algorithm model is realized in the form of multi-sentence binary task through the pre-trained bert model, and the disturbance resisting model structure is assisted, so that the model can better cope with the boundary distinguishing of semantic matching. The reason for adding the anti-disturbance is that the semantic matching can not accurately distinguish matching from mismatching on a very fine boundary problem for the assembly logic under remote supervision, so that the anti-disturbance is added in the subsequent training process to optimize the semantic matching capability of the boundary problem model aiming at fine granularity. The method mainly comprises the step of adding disturbance to an Embedding layer of the Bert model to perform countermeasure training, wherein the output of the Embedding layer is directly obtained from an Embedding parameter matrix, so that the Embedding parameter matrix can be directly disturbed. The diversity of the countersamples thus obtained, although less (because the same token shares the same perturbation for different samples), still functions as a regularization. The semantic matching model trained by using the data set of remote supervision and annotation can reach an accuracy rate of more than 95%, and the analysis of 12345 hot-line multi-person complaint and the analysis of time-space semantic aggregation of emergency can be accurately enabled in service.
In summary, in a city-oriented operation scenario, semantic matching analysis is performed on event descriptions in city operation scenarios such as 12345 citizen hotlines, city emergency incidents and the like, automatic labeling of a target data set is achieved through external knowledge according to a pre-training language model and a deep learning technology and by means of a remote supervision means, and accordingly intelligent analysis applications of upper-level specific services are supported, such as semantic merging in a space-time semantic dimension of the emergency, 12345 multi-person complaint analysis, one-person multi-complaint analysis and other analysis applications based on a semantic matching algorithm model.
The invention also provides a generation system of a semantic matching model based on remote supervision, which applies the generation method and comprises the following steps:
the data enhancement module 1 is used for acquiring a plurality of original text data of a preset application scene, and performing data enhancement on each original text data to obtain a plurality of enhanced text data, wherein each enhanced text data is associated with a standard semantic category label corresponding to the preset application scene;
the weight fine-tuning module 2 is connected with the data enhancement module 1 and is used for carrying out weight fine-tuning on the pre-trained language model according to each enhanced text data and the associated standard semantic category label to obtain a fine-tuned model;
the sample construction module 3 is respectively connected with the data enhancement module 1 and the weight fine-tuning module 2, and is used for embedding sentences of each original text data according to the fine-tuned model to obtain a plurality of semantic vectors, and automatically labeling each semantic vector to obtain a plurality of semantic matching vector pairs and a plurality of semantic mismatching vector pairs;
and the model training module 4 is connected with the sample construction module 3 and used for training according to each semantic matching vector pair and each semantic mismatching vector pair to obtain a semantic matching model.
In a preferred embodiment of the present invention, the sample construction module 3 comprises:
the semantic vector generating unit 31 is configured to perform sentence embedding on each original text data according to the fine-tuned model to obtain a plurality of semantic vectors, and add each semantic vector to a vector set;
the semantic vector clustering unit 32 is connected with the semantic vector generating unit 31 and used for performing text clustering on each semantic vector in the vector set to obtain the centroid of a plurality of clustering clusters formed after text clustering, respectively calculating the distance between the semantic vector serving as the centroid and other semantic vectors for each clustering cluster, adding the corresponding semantic vector into the fault set when the distance is within a preset threshold range, adding the corresponding semantic vector into the credible set when the distance is not greater than the lower limit value of the threshold range, and adding the corresponding semantic vector into the unreliable set when the distance is not within the threshold range and is greater than the lower limit value;
the judging unit 33 is connected with the semantic vector clustering unit 32, and is used for automatically labeling semantic vectors in a trusted set in the same clustering cluster as a plurality of semantic matching vector pairs when the trusted set is not an empty set, and automatically labeling semantic vectors in the trusted set and semantic vectors in an untrusted set of the same clustering cluster, and semantic vectors in the trusted set and semantic vectors in other clustering clusters as a plurality of semantic mismatching vector pairs;
the iteration unit 34 is connected with the judging unit 33 and is used for removing the semantic vectors in each credible set from the vector sets;
the semantic vector clustering unit 32 iteratively performs text clustering on the removed vector set until the judging unit 33 judges that the trusted set is an empty set.
In a preferred embodiment of the present invention, the semantic matching model includes a Bert model, an anti-disturbance layer is added to an embedded layer of the Bert model, an output of the Bert model is connected to a softmax layer, an input of the Bert model is used as an input of the semantic matching model, and an output of the softmax layer is used as an output of the semantic matching model.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A semantic matching model generation method based on remote supervision is characterized by comprising the following steps:
the method comprises the following steps of S1, obtaining a plurality of original text data of a preset application scene, and performing data enhancement on the original text data to obtain a plurality of enhanced text data, wherein each enhanced text data is associated with a standard semantic category label corresponding to the preset application scene;
s2, carrying out weight fine adjustment on a pre-trained language model according to each enhanced text data and the associated standard semantic category label to obtain a fine-adjusted model;
s3, sentence embedding is carried out on each original text data according to the fine-tuned model to obtain a plurality of semantic vectors, and each semantic vector is automatically labeled to obtain a plurality of semantic matching vector pairs and a plurality of semantic mismatching vector pairs;
and S4, training according to each semantic matching vector pair and each semantic mismatching vector pair to obtain a semantic matching model.
2. The generation method according to claim 1, wherein the step S1 includes:
step S11, acquiring each original text data from each service system associated with the preset application scene, wherein each original text data is associated with an original semantic category label defined in the corresponding service system;
step S12, correspondingly standardizing each original semantic category label into the standard semantic category label in the preset application scene, and performing data fusion on each original text data with the same standard semantic category label;
step S13, performing category scoring on each standard semantic category label according to each original text data after data fusion to obtain a corresponding score value, and judging whether the score value is greater than a preset score threshold value:
if yes, adding each original text data associated with the corresponding standard semantic category label into a high-quality data set, and then turning to step S14;
if not, adding each original text data associated with the corresponding standard semantic category label into a low-quality data set, and then turning to the step S15;
step S14, performing word extraction on each original text data in the high-quality data set, and constructing a semantic dictionary based on each semantic representative word obtained by the word extraction;
step S15, configuring the standard semantic category label for each original text data in the low-quality data set according to the semantic dictionary;
step S16, performing data enhancement on each original text data associated with each standard semantic category label to obtain each enhanced text data.
3. The generation method according to claim 2, wherein in step S13, a category score is performed according to a ratio of a data length and a data amount of each original text data in each standard semantic category to all the original text data to obtain the corresponding score value.
4. The generation method according to claim 2, wherein the step S14 includes:
step S141, extracting keywords and subject words from each original text data in the high-quality data set;
and S142, performing duplication elimination processing on the extracted keywords and the extracted subject words, and screening the keywords and the subject words after the duplication elimination processing to obtain each semantic representative word.
5. The generation method according to claim 2, wherein in step S16, the data enhancement mode includes:
performing non-keyword replacement on each original text data associated with each standard semantic category label to take each original text data and new text data formed by performing non-keyword replacement as the enhanced text data; and/or
And selectively and randomly sampling the original text data associated with each standard semantic category label, and taking the original text data obtained by random sampling as the enhanced text data associated with the standard semantic category label.
6. The generation method according to claim 1, wherein the step S3 includes:
step S31, embedding sentences of the original text data according to the fine-tuned model to obtain a plurality of semantic vectors, and adding the semantic vectors into a vector set;
step S32, performing text clustering on each semantic vector in the vector set to obtain a centroid of a plurality of clustering clusters formed after text clustering, respectively calculating the distance between the semantic vector serving as the centroid and other semantic vectors for each clustering cluster, and judging whether the distance is within a preset threshold range:
if yes, adding the corresponding semantic vector into a fault set, and then turning to the step S34;
if not, go to step S33;
step S33, determining whether the distance is not greater than the lower limit of the threshold range:
if yes, adding the corresponding semantic vector into a credible set, and then turning to the step S34;
if not, adding the corresponding semantic vector into an untrusted set, and then turning to step S34;
step S34, judging whether the credible set is an empty set:
if yes, quitting;
if not, automatically labeling the semantic vectors in the credible set in the same clustering cluster as a plurality of semantic matching vector pairs, and automatically labeling the semantic vectors in the credible set and the non-credible set of the same clustering cluster and the semantic vectors in the credible set and the semantic vectors in other clustering clusters as a plurality of semantic mismatching vector pairs;
step S35, removing the semantic vector in each trusted set from the vector set, and then returning to step S32.
7. The generation method according to claim 1, wherein the semantic matching model comprises a Bert model, wherein an embedding layer of the Bert model adds a disturbance rejection layer, wherein an output of the Bert model is connected to a softmax layer, wherein an input of the Bert model is used as an input of the semantic matching model, and wherein an output of the softmax layer is used as an output of the semantic matching model.
8. A generation system of a semantic matching model based on remote supervision, characterized in that the generation method of any one of claims 1-7 is applied, the generation system comprising:
the data enhancement module is used for acquiring a plurality of original text data of a preset application scene, and performing data enhancement on the original text data to obtain a plurality of enhanced text data, wherein each enhanced text data is associated with a standard semantic category label corresponding to the preset application scene;
the weight fine-tuning module is connected with the data enhancement module and used for carrying out weight fine-tuning on a pre-trained language model according to each enhanced text data and the associated standard semantic category label to obtain a fine-tuned model;
the sample construction module is respectively connected with the data enhancement module and the weight fine tuning module and is used for embedding sentences of the original text data according to the fine-tuned model to obtain a plurality of semantic vectors and automatically labeling the semantic vectors to obtain a plurality of semantic matching vector pairs and a plurality of semantic mismatching vector pairs;
and the model training module is connected with the sample construction module and used for training according to each semantic matching vector pair and each semantic mismatching vector pair to obtain a semantic matching model.
9. The generation system of claim 8, wherein the sample construction module comprises:
a semantic vector generating unit, configured to perform sentence embedding on each original text data according to the fine-tuned model to obtain a plurality of semantic vectors, and add each semantic vector to a vector set;
a semantic vector clustering unit connected to the semantic vector generating unit, configured to perform text clustering on each semantic vector in the vector set to obtain a centroid of multiple clustering clusters formed after text clustering, calculate, for each clustering cluster, a distance between the semantic vector serving as the centroid and another semantic vector, add the corresponding semantic vector to a fault set when the distance is within a preset threshold range, add the corresponding semantic vector to a trusted set when the distance is not greater than a lower limit of the threshold range, and add the corresponding semantic vector to an untrusted set when the distance is not within the threshold range and is greater than the lower limit;
a judging unit, connected to the semantic vector clustering unit, configured to automatically label, when the trusted set is not an empty set, a plurality of semantic matching vector pairs between the semantic vectors in the trusted set in the same clustering cluster, and automatically label, as a plurality of semantic mismatching vector pairs, between each semantic vector in the trusted set and each semantic vector in the untrusted set of the same clustering cluster, and between each semantic vector in the trusted set and each semantic vector in other clustering clusters;
the iteration unit is connected with the judging unit and is used for removing the semantic vectors in each credible set from the vector sets;
and the semantic vector clustering unit iteratively performs text clustering on the removed vector set until the judging unit judges that the credible set is an empty set.
10. The generation system of claim 8, wherein the semantic matching model comprises a Bert model, wherein an embedded layer of the Bert model adds a perturbation resisting layer, wherein an output of the Bert model is connected to a softmax layer, wherein an input of the Bert model is used as an input of the semantic matching model, and wherein an output of the softmax layer is used as an output of the semantic matching model.
CN202211166854.8A 2022-09-23 2022-09-23 Semantic matching model generation method and system based on remote supervision Pending CN115563512A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211166854.8A CN115563512A (en) 2022-09-23 2022-09-23 Semantic matching model generation method and system based on remote supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211166854.8A CN115563512A (en) 2022-09-23 2022-09-23 Semantic matching model generation method and system based on remote supervision

Publications (1)

Publication Number Publication Date
CN115563512A true CN115563512A (en) 2023-01-03

Family

ID=84741099

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211166854.8A Pending CN115563512A (en) 2022-09-23 2022-09-23 Semantic matching model generation method and system based on remote supervision

Country Status (1)

Country Link
CN (1) CN115563512A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117350302A (en) * 2023-11-04 2024-01-05 湖北为华教育科技集团有限公司 Semantic analysis-based language writing text error correction method, system and man-machine interaction device
CN117555644A (en) * 2024-01-11 2024-02-13 之江实验室 Front-end page construction method and device based on natural language interaction

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117350302A (en) * 2023-11-04 2024-01-05 湖北为华教育科技集团有限公司 Semantic analysis-based language writing text error correction method, system and man-machine interaction device
CN117350302B (en) * 2023-11-04 2024-04-02 湖北为华教育科技集团有限公司 Semantic analysis-based language writing text error correction method, system and man-machine interaction device
CN117555644A (en) * 2024-01-11 2024-02-13 之江实验室 Front-end page construction method and device based on natural language interaction
CN117555644B (en) * 2024-01-11 2024-04-30 之江实验室 Front-end page construction method and device based on natural language interaction

Similar Documents

Publication Publication Date Title
US20220405592A1 (en) Multi-feature log anomaly detection method and system based on log full semantics
US20230031738A1 (en) Taxpayer industry classification method based on label-noise learning
CN112084337B (en) Training method of text classification model, text classification method and equipment
CN115563512A (en) Semantic matching model generation method and system based on remote supervision
CN111291195A (en) Data processing method, device, terminal and readable storage medium
Chang et al. An unsupervised iterative method for Chinese new lexicon extraction
KR20180120488A (en) Classification and prediction method of customer complaints using text mining techniques
CN110928981A (en) Method, system and storage medium for establishing and perfecting iteration of text label system
Zhang et al. Video-aided unsupervised grammar induction
CN113468317B (en) Resume screening method, system, equipment and storage medium
CN112967144B (en) Financial credit risk event extraction method, readable storage medium and device
CN112199496A (en) Power grid equipment defect text classification method based on multi-head attention mechanism and RCNN (Rich coupled neural network)
CN114896305A (en) Smart internet security platform based on big data technology
CN116089873A (en) Model training method, data classification and classification method, device, equipment and medium
CN110008699A (en) A kind of software vulnerability detection method neural network based and device
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN114153978A (en) Model training method, information extraction method, device, equipment and storage medium
CN110704638A (en) Clustering algorithm-based electric power text dictionary construction method
CN110728139A (en) Key information extraction model and construction method thereof
CN114997169A (en) Entity word recognition method and device, electronic equipment and readable storage medium
Chandra et al. Aviation-BERT: A preliminary aviation-specific natural language model
Chen et al. An effective crowdsourced test report clustering model based on sentence embedding
CN115455416A (en) Malicious code detection method and device, electronic equipment and storage medium
Mishra et al. Explainability for NLP
CN112133308A (en) Method and device for multi-label classification of voice recognition text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination