CN113780471A

CN113780471A - Data classification model updating and application method, device, storage medium and product

Info

Publication number: CN113780471A
Application number: CN202111144796.4A
Authority: CN
Inventors: 尹泽夏; 王小波
Original assignee: Jingdong City Beijing Digital Technology Co Ltd
Current assignee: Jingdong City Beijing Digital Technology Co Ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2021-12-10

Abstract

The embodiment of the invention provides a method, equipment, a storage medium and a product for updating and applying a data classification model, wherein a first labeled sample set and a first unlabeled sample set are obtained from a municipal administration event data pool, and preset data enhancement processing is respectively carried out to obtain a corresponding second labeled sample set and a corresponding second unlabeled sample set; obtaining label prediction results corresponding to the label-free samples in the second label-free sample set according to the initial data classification model; performing mixed sample data enhancement processing on the second labeled sample set and the second unlabeled sample set after label prediction; and performing parameter updating on the initial data classification model according to the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data to obtain a target data classification model. By adopting a semi-supervised learning method, the diversity of samples is increased through data enhancement and mixing, the robustness and stability of the model are improved, the prediction effect of the model is improved, and the cost is saved.

Description

Data classification model updating and application method, device, storage medium and product

Technical Field

The embodiment of the invention relates to the field of artificial intelligence, in particular to a method, equipment, a storage medium and a product for updating and applying a data classification model.

Background

At present, the classification of urban events can be realized by adopting a data classification model, so that the events are distributed to corresponding departments for processing. However, as time goes on, the distribution of the data of the urban area events may change, for example, the environment changes as time goes on, such as sudden epidemic situation, natural disaster, etc., and in order to ensure the prediction effect of the classification model, the data classification model needs to be updated continuously at regular intervals.

In the prior art, in order to ensure the predictive capability of a data classification model, the model is generally required to be updated (fine-tuned) at intervals by using newly collected data, and common methods include retraining, online training/updating, incremental learning and the like. However, the above-mentioned methods all require the acquisition of labeled data to perform the training and updating of the model.

In practical applications, it is often very difficult to obtain tagged data, or an online system is required to provide a timely feedback mechanism, or manual data labeling is required, which results in a large amount of additional economic cost and time cost. If the latest labeled data cannot be obtained in time, the data classification model may face the situation of reduced prediction capability; even if the labeled data is obtained, if the data volume is insufficient, the improvement of the prediction capability of the data classification model by the existing model updating mode is limited, and the risk of the reduction of the model effect also exists.

Disclosure of Invention

The embodiment of the invention provides a method, equipment, a storage medium and a product for updating and applying a data classification model, which are used for improving the model effect and reducing the model updating cost.

In a first aspect, an embodiment of the present invention provides an updating method for a data classification model, including:

acquiring a first labeled sample set and a first unlabeled sample set from a municipal administration event data pool, and respectively performing preset data enhancement processing on the first labeled sample set and the first unlabeled sample set to obtain a corresponding second labeled sample set and a second unlabeled sample set;

classifying each unlabeled sample in the second unlabeled sample set according to the initial data classification model to obtain a corresponding label prediction result;

performing mixed sample data enhancement processing on the second labeled sample set and the second unlabeled sample set after label prediction;

and updating parameters of the initial data classification model according to the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data to obtain a target data classification model for classifying the municipal administration event data.

In a second aspect, an embodiment of the present invention provides a classification method for urban area management event data, where a data classification model obtained by applying the update method for the data classification model according to the first aspect is applied; the method comprises the following steps:

acquiring urban area management event data to be classified, wherein the urban area management event data comprises event content data and event occurrence place data;

inputting the municipal administration event data into the data classification model to obtain a corresponding classification result;

and sending the municipal administration event data to user equipment corresponding to the classification result according to the classification result.

In a third aspect, an embodiment of the present invention provides an update apparatus for a data classification model, including:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a first labeled sample set and a first unlabeled sample set from a municipal administration event data pool;

the data enhancement module is used for respectively carrying out preset data enhancement processing on the first labeled sample set and the first unlabeled sample set to obtain a corresponding second labeled sample set and a second unlabeled sample set;

the label prediction module is used for classifying each label-free sample in the second label-free sample set according to the initial data classification model to obtain a corresponding label prediction result;

the mixing module is used for performing mixed sample data enhancement processing on the second labeled sample set and the second unlabeled sample set after label prediction;

and the model updating module is used for updating parameters of the initial data classification model according to the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data to obtain a target data classification model for classifying the municipal administration event data.

In a fourth aspect, an embodiment of the present invention provides a classification device for urban area management event data, where a data classification model obtained by using the data classification model updating method according to the first aspect is applied; the apparatus comprises:

the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring urban management event data to be classified, and the urban management event data comprises event content data and event occurrence place data;

the classification module is used for inputting the municipal administration event data into the data classification model to obtain a corresponding classification result;

and the sending module is used for sending the urban area management event data to the user equipment corresponding to the classification result according to the classification result.

In a fifth aspect, an embodiment of the present invention provides an electronic device, including: at least one processor; and a memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of the first aspect.

In a sixth aspect, an embodiment of the present invention provides an electronic device, including: at least one processor; and a memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of the second aspect.

In a seventh aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the method according to the first aspect or the second aspect is implemented.

In an eighth aspect, an embodiment of the present invention provides a computer program product, which includes computer executable instructions, and when executed by a processor, the computer executable instructions implement the method according to the first aspect or the second aspect.

According to the method, the device, the storage medium and the product for updating and applying the data classification model, a first labeled sample set and a first unlabeled sample set are obtained from a municipal administration event data pool, and preset data enhancement processing is respectively carried out on the first labeled sample set and the first unlabeled sample set to obtain a corresponding second labeled sample set and a corresponding second unlabeled sample set; classifying each unlabeled sample in the second unlabeled sample set according to the initial data classification model to obtain a corresponding label prediction result; performing mixed sample data enhancement processing on the second labeled sample set and the second unlabeled sample set after label prediction; and updating parameters of the initial data classification model according to the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data to obtain a target data classification model for classifying the municipal administration event data. According to the embodiment, a semi-supervised learning method is adopted under the condition that labeled samples are few, the diversity of the samples is increased through data enhancement and mixing, the labeled samples and the unlabeled samples are fully utilized, the robustness and stability of the model are improved, the model prediction effect is improved, meanwhile, the unlabeled samples are collected more easily, and the economic cost, the labor cost and the time cost are saved in the process of updating the data classification model.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic view of an application scenario according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for updating a data classification model according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for updating a data classification model according to another embodiment of the present invention;

FIG. 4 is a flowchart of a method for updating a data classification model according to another embodiment of the present invention;

fig. 5 is a flowchart of a classification method for municipal administration event data according to an embodiment of the present invention;

FIG. 6 is a block diagram of an update apparatus for a data classification model according to an embodiment of the present invention;

fig. 7 is a structural diagram of a classification device for municipal administration event data according to an embodiment of the present invention;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

First, terms related to embodiments of the present invention are explained:

and (3) supervised learning: the supervised learning is a method of machine learning, which can learn or establish a pattern from training data, and infer a new instance according to the pattern. The training data is composed of input objects and expected outputs. The output of the function may be a continuous value or a predictive classification tag. A supervised learner's task, after observing some previously labeled training paradigm, predicts the output of this function to any input that may occur.

Unsupervised learning: unsupervised learning is a method of machine learning that automatically classifies or groups input data without giving a previously labeled training example. The main applications of unsupervised learning include: clustering analysis, relation rule and dimensionality reduction. It is a choice out of strategies such as supervised learning and reinforcement learning.

Semi-supervised learning: semi-supervised learning is a machine learning method that combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning is between unsupervised learning and supervised learning, and when unlabeled data is combined with a small amount of labeled data for use, the learning accuracy can be obviously improved. The collection of labeled data for learning problems typically requires skilled personnel or physical experimentation.

Data enhancement: is a technique for artificially expanding the training data set by letting limited data yield more equivalent data. The method is an effective means for overcoming the defect of insufficient training data, and is widely applied to various fields of deep learning at present. But also inevitably brings about a noise problem due to the difference between the generated data and the real data.

Word vector: also known as Word embedded Natural Language Processing (NLP), a collective term for a set of language modeling and feature learning techniques in which words or phrases from a Word list are mapped to vectors of real numbers. Conceptually, it involves mathematical embedding from a one-dimensional space of each word to a continuous vector space with lower dimensions. Methods of generating such mappings include neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the context of explicit representation of terms for word occurrences. Word and phrase embedding, when used as the underlying input representation, has been shown to improve the performance of NLP tasks, such as parsing and sentiment analysis.

TF-IDF: a common weighting technique for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency). In a given document, the Term Frequency (TF) refers to the frequency with which a given word appears in the document. The Inverse Document Frequency (IDF) is a measure of the general importance of a term, and can be obtained by dividing the total document number by the number of documents containing the term, and taking the obtained quotient to be a base-10 logarithm. Finally, the TF-IDF is obtained by calculating the product of TF and IDF.

In the prior art, in order to ensure the predictive capability of a data classification model, the model is generally required to be updated (fine-tuned) at intervals by using newly collected data, and common methods include retraining, online training/updating, incremental learning and the like.

Wherein, retraining: and taking the historical training data and the received new data as a new training data set together, and retraining a new model to replace the old model on the line.

On-line training/updating: the online model is updated by utilizing the real-time data stream received on line, and the model can be quickly adjusted in real time according to the online feedback data, so that the model can reflect the change on line in time, the online prediction accuracy is improved, and each sample is generally required to be used only once.

Incremental learning: in the field of machine learning, incremental learning aims to solve a common defect of model training, namely catastrophic forgetting (i.e., when a general machine learning model (especially a back propagation-based deep learning method) is trained on a new task, the performance on the old task is usually significantly reduced. The incremental learning solution model can keep most of the previously learned knowledge while learning new knowledge, namely the model can perform well on both the old task and the new task.

However, the above-mentioned methods all require the acquisition of labeled data to perform the training and updating of the model. In practical applications, it is often very difficult to obtain tagged data, or an online system is required to provide a timely feedback mechanism, or manual data labeling is required, which results in a large amount of additional economic cost and time cost. If the latest labeled data cannot be obtained in time, the data classification model may face the situation of reduced prediction capability; even if the labeled data is obtained, if the data volume is insufficient, the improvement of the prediction capability of the data classification model by the existing model updating mode is limited, and the risk of the reduction of the model effect also exists.

In order to solve the technical problems, the embodiment of the invention aims to automatically update the online data classification model by fully utilizing a large amount of non-tag data by applying a semi-supervised learning method under the condition of no new tagged data or only a small amount of tagged data, so as to ensure the effect of the online data classification model.

Specifically, in the embodiment of the present invention, a first labeled sample set and a first unlabeled sample set are obtained from a municipal administration event data pool, and preset data enhancement processing is performed on the first labeled sample set and the first unlabeled sample set, so as to obtain a second labeled sample set and a second unlabeled sample set corresponding to the first labeled sample set and the first unlabeled sample set; classifying each unlabeled sample in the second unlabeled sample set according to the initial data classification model to obtain a corresponding label prediction result; performing mixed sample data enhancement processing on the second labeled sample set and the second unlabeled sample set after label prediction; and updating parameters of the initial data classification model according to the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data to obtain a target data classification model for classifying the municipal administration event data.

The target data classification model can be used for classifying the municipal administration event data, and specifically can acquire the municipal administration event data to be classified, wherein the municipal administration event data comprises event content data and event occurrence place data; inputting the municipal administration event data into the data classification model to obtain a corresponding classification result; and sending the municipal administration event data to user equipment corresponding to the classification result according to the classification result.

In the embodiment of the present invention, only classification of municipal administration event data is taken as an example, as shown in fig. 1, a specific application scenario may include a database 101 and a server 102, where the database 101 has a data pool for storing the municipal administration event data, including a labeled data pool and a non-labeled data pool, and the server 102 may obtain a first labeled sample set and a first non-labeled sample set from the municipal administration event data pool of the database 101, and then the server 102 may perform a subsequent data classification model updating method according to the embodiment of the present invention.

Of course, models in other application scenarios may also be updated by the model updating method according to the embodiment of the present invention, for example, language recognition, image recognition and understanding, computer vision, real-time language translation, enterprise management, market analysis, decision optimization, material allocation and transportation, and the like, and details are not repeated here.

The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart of an updating method of a data classification model according to an embodiment of the present invention. The embodiment provides an updating method of a data classification model, the execution subject of the updating method is an electronic device such as a server, a desktop computer, a notebook computer, a smart phone, a tablet computer and the like, and the updating method of the data classification model specifically comprises the following steps:

s201, a first labeled sample set and a first unlabeled sample set are obtained from a municipal administration event data pool, preset data enhancement processing is respectively carried out on the first labeled sample set and the first unlabeled sample set, and a corresponding second labeled sample set and a corresponding second unlabeled sample set are obtained.

In this embodiment, the data pool is used for storing the history data, and may be divided into a tagged data pool and a non-tagged data pool, which are used for storing tagged data and non-tagged data, respectively. The tagged data in the tagged data pool can be obtained by a user feedback mechanism, namely, manually labeling part of data in the non-tagged data pool, and when the non-tagged data is labeled, the data leaves the non-tagged data pool and enters the tagged data pool.

For the classified scenes of the municipal administration event data, the municipal administration event data may include event content data and event occurrence location data, and the municipal administration event data needs to be classified to be distributed to the user equipment of the corresponding department. Generally, only a very small amount of municipal administration events may be completed (distributed to corresponding departments) quickly, that is, the data of the municipal administration events are labeled data (the labels are classified); however, most of other municipal administration events take a certain time to complete the transaction, so that a large amount of label-free data is generated on line, namely only the event content and the text description of the event venue are available.

In the embodiment, online data of a municipal administration event is collected, the online data with a label can be stored in a labeled data pool, and the online data without a label can be stored in a non-labeled data pool; and acquiring online data from the labeled data pool and the non-labeled data pool respectively to obtain a corresponding first labeled sample set and a corresponding first non-labeled sample set.

After the first labeled sample set and the first unlabeled sample set are obtained, in order to enable limited data to generate more equivalent data to expand the sample training samples, increase the utilization rates of the labeled samples and the unlabeled samples, and improve the model effect, in this embodiment, preset data enhancement processing may be performed on the first labeled sample set and the first unlabeled sample set, respectively, where the preset data enhancement processing includes, but is not limited to, word replacement, reverse translation, random noise injection, and the like.

Optionally, in this embodiment, for a sample with a label, the label is still the same as the original label after data enhancement; for unlabeled samples, n (n is more than or equal to 1) times of data enhancement is carried out on each sample to obtain n different enhanced samples, the enhanced samples have the same label, and the prediction capability of the model can be remarkably improved by continuously correcting the deviation of the model between the predicted values of the different enhanced samples from the same sample. In this embodiment, data enhancement is performed only once on the labeled samples, and the enhanced samples form a second labeled sample set, where the size of the second labeled sample set is consistent with that of the original first labeled sample set; performing n (n > ═ 1) times of data enhancement on the unlabeled data, namely obtaining n enhanced samples from each unlabeled sample to form a second unlabeled sample set, wherein the size of the second unlabeled sample set is n times of that of the original first unlabeled sample set; or the first unlabeled sample set may also be added to the second unlabeled sample set, and then the size of the second unlabeled sample set is n +1 times of the original first unlabeled sample set.

S202, classifying the unlabeled samples in the second unlabeled sample set according to the initial data classification model to obtain corresponding label prediction results.

In this embodiment, each unlabeled exemplar in the second unlabeled exemplar set may be classified according to the initial data classification model, so that the unlabeled exemplars all have corresponding labels. Optionally, the initial data classification model may be trained according to the first labeled sample set; optionally, the initial data classification model may also be the data classification model on the current line.

It should be noted that, if the preset data enhancement processing is performed on the first unlabeled sample set multiple times in S201, each unlabeled sample obtains n corresponding enhanced samples, and based on the data enhancement principle, the n enhanced samples should have the same label as the original unlabeled sample. Therefore, for a plurality of unlabeled samples obtained by subjecting the same unlabeled sample to multiple preset data enhancement processes, corresponding label prediction results are obtained through the initial data classification model respectively, the average value of the label prediction results is obtained, and the average value is determined as the label prediction results of the unlabeled samples. The label prediction results are probability vectors, that is, probabilities corresponding to each type of label, so that when the average value of the label prediction results is obtained, the average value of the probability vectors of the label prediction results, that is, the average value of the probability of each type of label, is directly obtained.

Optionally, label prediction may be performed on each unlabeled sample and n enhancement samples corresponding to the unlabeled sample, which are n +1 total unlabeled samples, the n +1 prediction results are averaged, and the average value is determined as the label prediction results of the plurality of unlabeled samples.

And S203, performing mixed sample data enhancement processing on the second labeled sample set and the second unlabeled sample set after label prediction.

In the embodiment, the Mixed Sample Data Augmentation (MSDA) is simple to implement and really helps to improve performance, so that the MSDA is widely applied to the fields of image recognition, GAN, semi-supervised learning and the like, a representative algorithm of the MSDA is Data Mixed Mixup, and Data Augmentation is realized by randomly mixing two training samples and labels thereof according to a certain proportion, so that the mixing mode not only can increase the diversity of the samples, but also can make decision boundaries of different types smoother, reduce false recognition of some samples, improve the robustness of models, and is relatively stable during training.

Optionally, as shown in fig. 3, in step S203, performing enhancement processing on the mixed sample data on the second labeled sample set and the second unlabeled sample set after label prediction may specifically include:

s2031, mixing the second labeled sample set and the second unlabeled sample set after label prediction to obtain a data total set;

s2032, randomly sampling the total data set to obtain a third sample set with the number equal to that of the samples of the second labeled sample set, and determining the residual samples in the total data set as a fourth sample set;

s2033, performing mixed sample data enhancement processing on the samples in the second labeled sample set and the third sample set by using a Mixup algorithm to obtain a second labeled sample set after the mixed sample data enhancement processing;

s2034, performing mixed sample data enhancement processing on the samples in the second unlabeled sample set and the fourth sample set after label prediction by using a Mixup algorithm to obtain a second unlabeled sample set after the mixed sample data enhancement processing.

In this embodiment, the second labeled sample set X and the second unlabeled sample set U after label prediction are mixed together to obtain a data total set W, and then a third sample set W with the same number of samples as the second labeled sample set X is obtained by random sampling from the data total set W₁The remaining samples in the data total set W form a third sample set W with the same number of samples as the second unlabeled sample set U₂。

The second labeled sample set X and the third sample set W are combined₁Carrying out Mixup calculation on the samples one by one to obtain a second labeled sample set X' after enhancement processing of mixed sample data; combining the second unlabeled sample set U with the third sample set W₂And carrying out Mixup calculation on the samples one by one to obtain a second unlabeled sample set U' after the enhancement processing of the mixed sample data. It should be noted that, this embodiment does not limit the sequence of S2033 and S2034.

The concrete process of performing the enhancement processing on the mixed sample data by adopting the Mixup algorithm is as follows:

acquiring an initial mixing weight lambda according to Beta distribution; and determining the maximum value of the lambda and 1-lambda as the optimized mixing weight lambda'; and respectively carrying out weighted summation on the word vectors and the probability vectors of the two samples to be mixed according to the lambda 'and 1-lambda', and taking the sum as the word vectors and the probability vectors of the samples subjected to the enhancement processing of the mixed sample data.

In this embodiment, for two samples (x) whose corresponding label probabilities are known₁，p₁) And (x)₂，p₂) The sample representation and the label probability (x ', p') of the sample representation after the enhancement processing of the mixed sample data can be obtained through the following calculation formula:

λ～Beta(α，α)

λ′＝max(λ，1-λ)

x′＝λ′x₁+(1-λ′)x₂

p′＝λ′p₁+(1-λ′)p₂

wherein (x)₁，p₁) For the samples of the second labeled sample set X, (X)₂，p₂) Samples of the third sample set W1; or, (x)₁，p₁) For samples of the second unlabeled sample set U, (x)₂，p₂) Samples of the third sample set W2; wherein x is the vector representation of the words or sentences and p is the corresponding label probability vector representation; the same parameter α is chosen in the Beta distribution and the optimized mixing weight λ 'takes the maximum of λ and 1- λ, i.e. λ' ≧ 0.5, so that during mixing, (x) is chosen₁，p₁) In proportion of (x)₂，p₂) More.

Of course, other mixed sample data enhancement algorithms may also be adopted in the embodiment, which is not limited herein.

And S204, updating parameters of the initial data classification model according to the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data to obtain a target data classification model for classifying the municipal administration event data.

In this embodiment, since the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data both have corresponding labels, the initial data classification model can be updated based on the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data, so as to obtain the final data classification model.

Specifically, as shown in fig. 4, in step S204, performing parameter update on the initial data classification model according to the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data may include:

s2041, acquiring a first loss of the initial data classification model according to the second labeled sample set after the enhancement processing of the mixed sample data;

s2042, acquiring a second loss of the initial data classification model according to the second unlabeled sample set subjected to the enhancement processing of the mixed sample data;

s2043, combining the first loss and the second loss to obtain a final loss;

and S2044, updating parameters of the initial data classification model according to the final loss.

In this embodiment, loss calculation is performed on the initial data classification model based on the second labeled sample set after the enhancement processing of the mixed sample data to obtain a first loss, and loss calculation is performed on the initial data classification model based on the second unlabeled sample set after the enhancement processing of the mixed sample data to obtain a second loss, where the present embodiment does not limit the sequence of S2041 and S2042. And further combining the first loss and the second loss to obtain a final loss, updating parameters of the initial data classification model based on the final loss, and if the model is not converged, repeating loss calculation and parameter updating until the model is converged or reaching a preset iteration number, and ending the model updating.

In the method for updating a data classification model provided by this embodiment, a first labeled sample set and a first unlabeled sample set are obtained from a municipal administration event data pool, and preset data enhancement processing is performed on the first labeled sample set and the first unlabeled sample set respectively to obtain a corresponding second labeled sample set and a corresponding second unlabeled sample set; classifying each unlabeled sample in the second unlabeled sample set according to the initial data classification model to obtain a corresponding label prediction result; performing mixed sample data enhancement processing on the second labeled sample set and the second unlabeled sample set after label prediction; and updating parameters of the initial data classification model according to the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data to obtain a target data classification model for classifying the municipal administration event data. According to the embodiment, a semi-supervised learning method is adopted under the condition that labeled samples are few, the diversity of the samples is increased through data enhancement and mixing, the labeled samples and the unlabeled samples are fully utilized, the robustness and stability of the model are improved, the model prediction effect is improved, meanwhile, the unlabeled samples are collected more easily, and the economic cost, the labor cost and the time cost are saved in the process of updating the data classification model.

Optionally, the tagged data pool and the untagged data pool in the above embodiment may update data periodically, where the updating principle is first-in first-out, that is, data with a relatively long generation time is removed, and data with a relatively long generation time is stored, specifically, the data with the predetermined time length that is the longest time in the data pool is automatically removed every predetermined time interval, for example, 3 days is taken as an example, the data with the 3 days that is the longest time in the data pool is automatically removed every 3 days, so that concept drift of data due to time lapse is avoided, and the actual effect of the data is ensured, so that the updated actual effect of model training can be ensured, and particularly, the data of a municipal administration event can be changed along with changes of policies and environments, such as a sudden epidemic situation, a natural disaster, and the like; meanwhile, the labeled data pool and the unlabeled data pool need to be kept updated synchronously, so that the data in the labeled data pool and the unlabeled data pool are generated in the same time period, the data distribution is the same, and the reliability of the model is ensured.

Optionally, on the basis of the foregoing embodiment, when the first labeled sample set and the first unlabeled sample set are obtained in S201, the online data in the same time period may be respectively collected from the labeled data pool and the unlabeled data pool, so as to obtain the corresponding first labeled sample set and first unlabeled sample set. The first labeled sample set and the first unlabeled sample set are derived from the same time period and have the same distribution, so that the reliability of model training is ensured. Optionally, online data in the same time period are respectively collected from the tagged data pool and the non-tagged data pool in a random sampling manner; or, according to the weight of each online data, acquiring online data of the same time period from the labeled data pool and the unlabeled data pool respectively, where the weight of each online data is configured for each online data in advance according to a preset rule, for example, the online data may be configured with weights according to a chronological order, the more recent online data has a greater weight, and the online data may be acquired sequentially according to the descending order of the online data weights when acquiring data.

Optionally, in the embodiment, when the S201 performs the preset data enhancement processing on the first labeled sample set and the first unlabeled sample set respectively, the preset data enhancement processing includes at least one of the following:

1) word replacement: attempts are made to replace words of the sample text without altering the subject matter of the sentence.

1.1) dictionary-based replacement: a word is randomly taken from the sentence and replaced with a synonym using a synonym dictionary.

1.2) word vector based replacement: and (3) embedding words which are trained in advance, such as Word2Vec, GloVe, FastText, Sent2Vec and the like, and replacing Word vectors of some words of the sample by using Word vectors of nearest adjacent preset words in an embedding space.

1.3) TF-IDF based word substitution: since it is considered that the word with the low TF-IDF score cannot provide information in the NLP field, the word with the low TF-IDF score (lower than the first preset threshold) in the sample text is selected to be replaced with the word with the low TF-IDF score (lower than the second preset threshold) in the preset dictionary.

2) Reverse translation: and interpreting the text by utilizing machine translation, retraining the meaning and expanding the unlabeled text. The method comprises the steps of translating a sample text from a current language to a preset language, translating the sample text from the preset language back to the current language, and checking whether a new text is different from an original text. If not, then this new text is used as a data enhancement for the original text. The reverse translation may be run in different languages simultaneously to generate more data variants.

3) Random noise injection: noise is added to the text, making the trained model robust to perturbations. The common method is to break up sentences in the training text to create an enhanced version; randomly inserting or deleting words, etc.

Through the preset data enhancement processing, the sample training samples can be expanded by enabling limited data to generate more equivalent data, the utilization rate of the labeled samples and the unlabeled samples is increased, and the model effect is improved.

Fig. 5 is a flowchart of a classification method for municipal administration event data according to an embodiment of the present invention. The embodiment provides a classification method of urban area management event data, wherein an execution main body of the classification method is an electronic device such as a server, a desktop computer, a notebook computer, a smart phone, a tablet personal computer and the like, the execution main body of the classification method can be the same as or different from the execution main body of the embodiment, the classification method of the urban area management event data can be applied to a data classification model obtained by the updating method of the data classification model of the embodiment, and the classification method of the urban area management event data specifically comprises the following steps:

s301, acquiring urban management event data to be classified, wherein the urban management event data comprises event content data and event occurrence place data;

s302, inputting the municipal administration event data into the data classification model to obtain a corresponding classification result;

and S303, sending the urban area management event data to user equipment corresponding to the classification result according to the classification result.

In this embodiment, based on the updated data classification model, the municipal administration event data to be classified may be classified, and then the municipal administration event data may be distributed to the user equipment of the corresponding department according to the classification result, and the user may process the municipal administration event data. The classification result obtained based on the updated data classification model is more accurate and has effectiveness.

Fig. 6 is a structural diagram of an updating apparatus of a data classification model according to an embodiment of the present invention. The updating device of the data classification model provided in this embodiment may perform the processing flow provided by the embodiment of the data classification model updating method, as shown in fig. 6, the updating device 400 of the data classification model includes an acquisition module 401, a data enhancement module 402, a label prediction module 403, a mixing module 404, and a model updating module 405.

The acquisition module 401 is configured to acquire a first labeled sample set and a first unlabeled sample set from a municipal administration event data pool;

a data enhancement module 402, configured to perform preset data enhancement processing on the first labeled sample set and the first unlabeled sample set respectively to obtain a second labeled sample set and a second unlabeled sample set corresponding to the first labeled sample set and the second unlabeled sample set;

a label prediction module 403, configured to classify each unlabeled sample in the second unlabeled sample set according to the initial data classification model to obtain a corresponding label prediction result;

a mixing module 404, configured to perform sample data mixing enhancement processing on the second labeled sample set and the second unlabeled sample set after label prediction;

and the model updating module 405 is configured to perform parameter updating on the initial data classification model according to the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data, so as to obtain a target data classification model for classifying the municipal administration event data.

On the basis of any of the above embodiments, when the data enhancement module 402 performs preset data enhancement processing on the first labeled sample set and the first unlabeled sample set respectively to obtain a corresponding second labeled sample set and a second unlabeled sample set, the data enhancement module is configured to:

performing preset data enhancement processing on each labeled sample in the first labeled sample set once, and keeping the label of the labeled sample after the preset data enhancement processing unchanged to obtain a second labeled sample set;

and performing preset data enhancement processing on each unlabeled sample of the first unlabeled sample set one or more times to obtain a second unlabeled sample set.

Based on any of the above embodiments, when the label prediction module 403 classifies each unlabeled sample in the second unlabeled sample set according to the initial data classification model to obtain a corresponding label prediction result, it is configured to:

and for the second unlabeled sample set, obtaining a plurality of unlabeled samples obtained by carrying out multiple times of preset data enhancement on the same unlabeled sample, respectively obtaining corresponding label prediction results through an initial data classification model, obtaining an average value of the label prediction results, and determining the average value as the label prediction results of the plurality of unlabeled samples.

On the basis of any of the above embodiments, the preset data enhancement processing includes at least one of:

replacing at least one word in the sample text with a corresponding synonym;

replacing the word vector of at least one word in the word vectors of the sample with the nearest preset word vector in the space;

replacing words with TF-IDF scores lower than a first preset threshold in the sample text with words with TF-IDF scores lower than a second preset threshold in a preset dictionary;

translating the sample text from the current language to a preset language, and translating the sample text from the preset language back to the current language to obtain a text different from the sample text;

random noise is added to the sample text.

On the basis of any of the above embodiments, the mixing module 404 is configured to perform sample data enhancement processing on the second labeled sample set and the second unlabeled sample set after label prediction;

mixing the second labeled sample set and a second unlabeled sample set after label prediction to obtain a data total set;

randomly sampling the data total set to obtain a third sample set with the number equal to that of the samples of the second labeled sample set, and determining the residual samples in the data total set as a fourth sample set;

performing mixed sample data enhancement processing on the samples in the second labeled sample set and the third sample set by using a Mixup algorithm to obtain a second labeled sample set after the mixed sample data enhancement processing;

and performing mixed sample data enhancement processing on the samples in the second unlabeled sample set and the fourth sample set after label prediction by adopting a Mixup algorithm to obtain a second unlabeled sample set after the mixed sample data enhancement processing.

On the basis of any of the above embodiments, when performing the enhancement processing on the mixed sample data by using the Mixup algorithm, the mixing module 404 is configured to:

acquiring an initial mixing weight lambda according to Beta distribution; and determining the maximum value of the lambda and 1-lambda as the optimized mixing weight lambda';

and respectively carrying out weighted summation on the word vectors and the probability vectors of the two samples to be mixed according to the lambda 'and 1-lambda', and taking the sum as the word vectors and the probability vectors of the samples subjected to the enhancement processing of the mixed sample data.

On the basis of any of the above embodiments, when the model updating module 405 performs parameter updating on the initial data classification model according to the second labeled sample set and the second unlabeled sample set after enhancement processing of the mixed sample data, is configured to:

acquiring a first loss for the initial data classification model according to the second labeled sample set after the enhancement processing of the mixed sample data, acquiring a second loss for the initial data classification model according to the second unlabeled sample set after the enhancement processing of the mixed sample data, and combining the first loss and the second loss to obtain a final loss;

and updating parameters of the initial data classification model according to the final loss.

On the basis of any of the above embodiments, when the acquiring module 401 acquires the first labeled sample set and the first unlabeled sample set, it is configured to:

collecting online data of a municipal administration event, storing the online data with a label into a labeled data pool, and storing the online data without the label into a non-label data pool;

and respectively acquiring online data of the same time period from the labeled data pool and the non-labeled data pool to obtain a corresponding first labeled sample set and a corresponding first non-labeled sample set.

On the basis of any of the above embodiments, when the collecting module 401 respectively collects online data of the same time period from the tagged data pool and the untagged data pool, it is configured to:

respectively acquiring online data of the same time period from the tagged data pool and the non-tagged data pool in a random sampling mode; or

And respectively acquiring online data of the same time period from the labeled data pool and the unlabeled data pool according to the weight of the online data, wherein the weight of the online data is preset according to a preset rule.

Based on any of the above embodiments, before classifying each unlabeled exemplar in the second unlabeled exemplar set according to the initial data classification model, the model updating module 405 is further configured to:

an initial data classification model is trained based on the first set of labeled samples.

The updating device of the data classification model provided in the embodiment of the present invention may be specifically configured to execute the method embodiments provided in fig. 2 to 4, and specific functions are not described herein again.

According to the updating device of the data classification model provided by the embodiment of the invention, a first labeled sample set and a first unlabeled sample set are obtained from a municipal administration event data pool, and preset data enhancement processing is respectively carried out on the first labeled sample set and the first unlabeled sample set to obtain a corresponding second labeled sample set and a corresponding second unlabeled sample set; classifying each unlabeled sample in the second unlabeled sample set according to the initial data classification model to obtain a corresponding label prediction result; performing mixed sample data enhancement processing on the second labeled sample set and the second unlabeled sample set after label prediction; and updating parameters of the initial data classification model according to the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data to obtain a target data classification model for classifying the municipal administration event data. According to the embodiment, a semi-supervised learning method is adopted under the condition that labeled samples are few, the diversity of the samples is increased through data enhancement and mixing, the labeled samples and the unlabeled samples are fully utilized, the robustness and stability of the model are improved, the model prediction effect is improved, meanwhile, the unlabeled samples are collected more easily, and the economic cost, the labor cost and the time cost are saved in the process of updating the data classification model.

Fig. 7 is a structural diagram of a classification device for municipal administration event data according to an embodiment of the present invention. The classification device for municipal administration event data provided in this embodiment may execute the processing flow provided in the method embodiment, as shown in fig. 7, the classification device 500 for municipal administration event data includes: an obtaining module 501, a classifying module 502 and a sending module 503.

An obtaining module 501, configured to obtain municipal administration event data to be classified, where the municipal administration event data includes event content data and event occurrence location data;

a classification module 502, configured to input the municipal administration event data into the data classification model to obtain a corresponding classification result;

a sending module 503, configured to send the urban area management event data to the user equipment corresponding to the classification result according to the classification result.

The classification device for municipal administration event data according to the embodiment of the present invention may be specifically configured to execute the method embodiment provided in fig. 5, and specific functions are not described herein again.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device provided in the embodiment of the present invention may perform a processing flow provided in the embodiment of the data classification model updating method and/or the classification method of the urban area management event data, as shown in fig. 8, the electronic device 60 includes a memory 61, a processor 62, and a computer program; wherein a computer program is stored in the memory 61 and is configured to be executed by the processor 62 for the updating method of the data classification model and/or the classification method of the municipal administration event data as described in the above embodiments. Furthermore, the electronic device 60 may have a communication interface 63 for transmitting control commands and/or data.

The electronic device in the embodiment shown in fig. 8 may be used to implement the above-mentioned updating method of the data classification model and/or the technical solution in the embodiment of the classification method of the urban area management event data, and the implementation principle and the technical effect are similar, which are not described herein again.

It should be noted that the updating method of the data classification model and the classification method of the urban area management event data may be executed on the same electronic device, or may be executed on different electronic devices.

In addition, the present embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for updating the data classification model and/or the method for classifying the municipal administration event data described in the above embodiments.

The present embodiment also provides a computer program product, which includes computer executable instructions, and when the computer executable instructions are executed by a processor, the computer executable instructions implement the method for updating the data classification model and/or the method for classifying the municipal administration event data according to the above embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.

The above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for updating a data classification model, comprising:

2. The method according to claim 1, wherein the performing preset data enhancement processing on the first labeled sample set and the first unlabeled sample set respectively to obtain a corresponding second labeled sample set and a second unlabeled sample set comprises:

3. The method of claim 2, wherein classifying each unlabeled exemplar in the second unlabeled exemplar set according to the initial data classification model to obtain a corresponding labeled prediction result comprises:

4. The method according to any one of claims 1-3, wherein the pre-set data enhancement process comprises at least one of:

replacing at least one word in the sample text with a corresponding synonym;

random noise is added to the sample text.

5. The method according to any one of claims 1-3, wherein said performing a mixed sample data enhancement process on the second labeled sample set and the second unlabeled sample set after label prediction comprises;

6. The method of claim 5, wherein the performing the enhancement processing on the mixed sample data by using the Mixup algorithm comprises:

7. The method of claim 1, wherein the updating parameters of the initial data classification model according to the second labeled sample set and the second unlabeled sample set after the enhancement processing of the mixed sample data comprises:

8. The method of claim 1, wherein obtaining the first set of labeled samples and the first set of unlabeled samples comprises:

9. The method of claim 8, wherein collecting online data from the tagged data pool and the untagged data pool for a same time period comprises:

10. The method of claim 1, wherein before classifying each unlabeled exemplar in the second set of unlabeled exemplars according to the initial data classification model, further comprising:

11. A classification method of urban management event data, which is characterized in that the data classification model obtained by the updating method of the data classification model according to any one of claims 1-10 is applied; the method comprises the following steps:

12. An apparatus for updating a data classification model, comprising:

13. A classification device for municipal administration event data, characterized in that the classification device is a data classification model obtained by applying the updating method of the data classification model according to any one of claims 1 to 10; the apparatus comprises:

14. An electronic device, comprising: at least one processor; and a memory;

the memory stores computer-executable instructions;

execution of the computer-executable instructions stored by the memory by the at least one processor causes the at least one processor to perform the method of any one of claims 1-10 or claim 11.

15. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1-10 or claim 11.

16. A computer program product comprising computer executable instructions, characterized in that the computer executable instructions, when executed by a processor, implement the method according to any of claims 1-10 or claim 11.