CN115438745A

CN115438745A - Metadata intelligent identification method and device based on semi-supervised learning

Info

Publication number: CN115438745A
Application number: CN202211157967.1A
Authority: CN
Inventors: 蒋静; 程环宇; 顾颖程; 朱力鹏; 周爱华; 潘森; 乔俊峰
Original assignee: State Grid Smart Grid Research Institute Co ltd; State Grid Corp of China SGCC; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Smart Grid Research Institute Co ltd; State Grid Corp of China SGCC; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2022-12-06

Abstract

The invention provides a metadata intelligent identification method and a device based on semi-supervised learning, wherein the method comprises the following steps: generating a metadata keyword identifier according to the conditional random field, wherein the metadata keyword identifier is used for extracting metadata characteristics corresponding to any identifier; acquiring an initial labeled data set, and training a metadata keyword identifier according to the initial labeled data set to generate a metadata classifier; acquiring an unlabeled data set, predicting the unlabeled data set according to a metadata classifier, and generating a prediction result; generating an intermediate training data set according to the prediction result; circularly self-training the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain a circularly self-trained metadata classifier; and intelligently identifying the metadata according to the metadata classifier after the cyclic self-training. By the method and the device, the problem of low metadata identification construction efficiency in the related technology is solved.

Description

Metadata intelligent identification method and device based on semi-supervised learning

Technical Field

The invention relates to the field of metadata identification, in particular to a metadata intelligent identification method and device based on semi-supervised learning.

Background

With the rapid development of the information era, the data middle station and the internet of things management platform are in initial scale, and meanwhile, a large amount of complex data are generated by informatization, so that a huge challenge is generated on a power grid data management technology. On one hand, with the emergence of various complex and diverse application scenes, the increase of power grid data is exponentially displayed, and a large amount of data is accumulated, so that the original data processing mode cannot meet the application requirements of the complex scenes; on the other hand, the data in the data center station comes from various places, not only the number is large, but also the types are various, and the data are continuously updated, so that the difficulty of managing and maintaining the metadata is high. At present, metadata management, classification and identification work still stays in a manual construction stage, efficiency is low, and consistency and correctness of metadata and data in a data center station cannot be guaranteed. Therefore, the problem of low efficiency of metadata identification construction exists in the prior art.

Disclosure of Invention

The invention provides a metadata intelligent identification method and device based on semi-supervised learning, which at least solve the problem of low metadata identification construction efficiency in the related technology.

According to a first aspect of the embodiments of the present invention, there is provided a metadata intelligent identification method based on semi-supervised learning, the method including: generating a metadata keyword identifier according to the conditional random field, wherein the metadata keyword identifier is used for extracting metadata characteristics corresponding to any identifier; acquiring an initial labeled data set, and training the metadata keyword identifier according to the initial labeled data set to generate a metadata classifier; acquiring an unlabeled data set, predicting the unlabeled data set according to the metadata classifier, and generating a prediction result, wherein the prediction result comprises a labeled data set and an unlabeled data set; generating an intermediate training data set according to the prediction result; performing circular self-training on the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain a metadata classifier after circular self-training; and intelligently identifying the metadata according to the metadata classifier after the cyclic self-training.

Optionally, the generating metadata keyword identifiers from the conditional random field model comprises: representing metadata and an identifier corresponding to the metadata in the form of an undirected graph; determining an observation sequence and an identification sequence according to the conditional random field, wherein the observation sequence corresponds to the metadata, and the identification sequence corresponds to the identification corresponding to the metadata; determining a target function and a feature set according to the observation sequence and the identification sequence, wherein the target function is used for obtaining the identification sequence with the maximum probability corresponding to the observation sequence, and the feature set is a set of metadata features; and generating a metadata keyword identifier according to the target function and the feature set.

Optionally, the observation sequence includes a transfer feature function determined based on the conditional random field, wherein the transfer feature function acts on an undirected graph edge and represents a relationship between a previous output state and a current output state.

Optionally, the obtaining an unlabeled data set, predicting the unlabeled data set according to the metadata classifier, and generating a prediction result, where the prediction result includes a labeled data set and an unlabeled data set, and includes: predicting the unlabeled data according to the metadata features corresponding to any identifier, wherein the unlabeled data is any piece of data in the unlabeled data set; if the unmarked data has the metadata features of the same type, generating an identifier corresponding to the metadata features of the same type for the unmarked data, and storing the unmarked data with the generated identifier into a marked data set; if the unlabeled data does not have the same metadata characteristics, not generating an identifier for the unlabeled data, and storing the unlabeled data without the generated identifier into an unlabeled data set.

Optionally, the generating an intermediate training data set according to the prediction result includes: and generating an intermediate training data set according to the data with high classification characteristic weight in the unlabeled data set and the labeled data set.

Optionally, the performing a cyclic self-training on the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain a metadata classifier after the cyclic self-training includes: training the metadata keyword identifier according to the intermediate training data set to generate a metadata classifier; predicting the unlabeled data set according to the metadata classifier to generate a prediction result; generating an intermediate training data set according to the prediction result; and circularly generating a metadata classifier, generating a prediction result and generating an intermediate training data set until the metadata classifier is converged to obtain the metadata classifier after circular self-training.

Optionally, the performing a cyclic self-training on the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain a metadata classifier after the cyclic self-training further includes: determining classification precision according to the labeled data set in the prediction result; if the classification precision does not reach a preset value, circularly generating a metadata classifier, a prediction result and an intermediate training data set; and if the classification precision reaches a preset value, finishing the circular self-training to obtain the metadata classifier after the circular self-training.

According to a second aspect of the embodiments of the present invention, there is also provided a metadata intelligent identification apparatus based on semi-supervised learning, the apparatus including: the device comprises a first generation module, a second generation module and a third generation module, wherein the first generation module is used for generating a metadata keyword identifier according to a conditional random field, and the metadata keyword identifier is used for extracting metadata characteristics corresponding to any identifier; the second generation module is used for acquiring an initial labeled data set and training the metadata keyword identifier according to the initial labeled data set to generate a metadata classifier; a third generating module, configured to obtain an unlabeled data set, predict the unlabeled data set according to the metadata classifier, and generate a prediction result, where the prediction result includes a labeled data set and an unlabeled data set; the fourth generation module is used for generating an intermediate training data set according to the prediction result; the circulating self-training module is used for circularly self-training the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain a metadata classifier after circular self-training; and the identification module is used for intelligently identifying the metadata according to the metadata classifier which is circularly self-trained.

Optionally, the first generating module comprises: the representing unit is used for representing the metadata and the identifier corresponding to the metadata into the form of an undirected graph; a first determining unit, configured to determine an observation sequence and a marker sequence according to the conditional random field, where the observation sequence corresponds to the metadata, and the marker sequence corresponds to a marker corresponding to the metadata; a second determining unit, configured to determine an objective function and a feature set according to the observation sequence and the identification sequence, where the objective function is used to obtain the identification sequence with the highest probability corresponding to the observation sequence, and the feature set is a set of metadata features; and the generating unit is used for generating a metadata keyword identifier according to the objective function and the feature set.

Optionally, the third generating module comprises: the prediction unit is used for predicting the unlabeled data according to the metadata characteristics corresponding to any identifier, wherein the unlabeled data is any piece of data in the unlabeled data set; the first storage unit is used for generating an identifier corresponding to the similar metadata feature for the unmarked data when the unmarked data has the similar metadata feature, and storing the unmarked data with the generated identifier into a marked data set; and the second storage unit is used for not generating an identifier for the unmarked data when the unmarked data does not have the similar metadata characteristics, and storing the unmarked data without the generated identifier into an unmarked data set.

Optionally, the fourth generating module includes: and the generating unit is used for generating an intermediate training data set according to the data with high classification feature weight in the unlabeled data set and the labeled data set.

Optionally, the cyclic self-training module comprises: the first generation unit is used for training the metadata keyword identifier according to the intermediate training data set to generate a metadata classifier; the second generation unit is used for predicting the unlabeled data set according to the metadata classifier and generating a prediction result; a third generating unit, configured to generate an intermediate training data set according to the prediction result; and the obtaining unit is used for circularly generating the metadata classifier, generating the prediction result and generating the intermediate training data set until the metadata classifier is converged to obtain the metadata classifier after circular self-training.

Optionally, the obtaining unit includes: the determining submodule is used for determining classification precision according to the labeled data set in the prediction result; the generation submodule is used for circularly generating a metadata classifier, a prediction result and an intermediate training data set if the classification precision does not reach a preset value; and the obtaining sub-module is used for finishing the circular self-training when the classification precision reaches a preset value so as to obtain the metadata classifier after the circular self-training.

According to a third aspect of the embodiments of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; wherein the memory is used for storing the computer program; a processor for performing the method steps in any of the above embodiments by running the computer program stored on the memory.

According to a fourth aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the method steps in any of the above embodiments when the computer program is run.

In an embodiment of the present invention, a metadata keyword identifier is generated based on a conditional random field; acquiring an initial labeled data set, and training a metadata keyword identifier according to the initial labeled data set to generate a metadata classifier; acquiring an unlabeled data set, predicting the unlabeled data set according to a metadata classifier, and generating a prediction result; generating an intermediate training data set according to the prediction result; circularly self-training the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain a circularly self-trained metadata classifier; and intelligently identifying the metadata according to the metadata classifier after the cyclic self-training. The metadata features are extracted through the metadata keyword identifier, the metadata keyword identifier is trained by using labeled data to generate the metadata classifier, and part of labeled data and label-free data are used for circularly self-training the metadata keyword identifier and the metadata classifier, so that the purpose of labeling the metadata through the metadata classifier can be realized, the technical effect of improving the metadata labeling efficiency is achieved, and the problem of low metadata identifier construction efficiency in the related technology is solved.

In the embodiment of the invention, the metadata and the identifier corresponding to the metadata are represented in the form of an undirected graph; and determining an observation sequence and an identification sequence according to the conditional random field, and determining a target function and a feature set according to the observation sequence and the identification sequence, thereby realizing the purposes of constructing a model and using the conditional random field to realize metadata identification.

In the embodiment of the invention, an intermediate training data set is generated according to the unlabeled data set and the labeled data set, the metadata keyword identifier and the metadata classifier are circularly self-trained by using the intermediate training data set, and the useful structure information contained in the unlabeled sample data is utilized, so that the effect of improving the data availability is achieved, and the problem of insufficient labeled samples is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive labor.

FIG. 1 is a diagram illustrating a hardware environment of an alternative semi-supervised learning based intelligent identification method for metadata according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an alternative semi-supervised learning-based intelligent metadata identification method according to an embodiment of the present invention;

FIG. 3 is a block diagram of an alternative semi-supervised learning based intelligent metadata identification apparatus according to an embodiment of the present invention;

fig. 4 is a block diagram of an alternative electronic device according to an embodiment of the invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to a first aspect of the embodiment of the invention, a metadata intelligent identification method based on semi-supervised learning is provided. Optionally, in this embodiment, the above-mentioned metadata intelligent identification method based on semi-supervised learning may be applied to a hardware environment as shown in fig. 1. As shown in fig. 1, the terminal 102 may include a memory 104, a processor 106, and a display 108 (optional components). The terminal 102 may be communicatively coupled to a server 112 via a network 110, the server 112 may be configured to provide services (e.g., application services, etc.) for the terminal or clients installed on the terminal, and a database 114 may be provided on the server 112 or separate from the server 112 for providing data storage services for the server 112. Additionally, a processing engine 116 may be run in the server 112, and the processing engine 116 may be used to perform the steps performed by the server 112.

Alternatively, the terminal 102 may be, but is not limited to, a terminal capable of calculating data, such as a mobile terminal (e.g., a mobile phone, a tablet Computer), a notebook Computer, a PC (Personal Computer) Computer, and the like, and the network may include, but is not limited to, a wireless network or a wired network. Wherein, this wireless network includes: bluetooth, WIFI (Wireless Fidelity), and other networks that enable Wireless communication. Such wired networks may include, but are not limited to: wide area networks, metropolitan area networks, and local area networks. The server 112 may include, but is not limited to, any hardware device capable of performing computations.

In addition, in this embodiment, the above-mentioned metadata intelligent identification method based on semi-supervised learning may also be applied to, but not limited to, an independent processing device with a relatively high processing capability without data interaction. For example, the processing device may be, but is not limited to, a terminal device with a relatively high processing capability, that is, the operations of the above metadata intelligent identification method based on semi-supervised learning may be integrated into a separate processing device. The above is merely an example, and this is not limited in this embodiment.

Optionally, in this embodiment, the above-mentioned metadata intelligent identification method based on semi-supervised learning may be executed by the server 112, or may be executed by the terminal 102, or may be executed by both the server 112 and the terminal 102. The terminal 102 executing the semi-supervised learning based intelligent metadata identification method according to the embodiment of the present invention may also be executed by a client installed thereon.

Taking the intelligent metadata identification method based on semi-supervised learning as an example applied to a central processing unit, fig. 2 is a schematic flow chart of an optional intelligent metadata identification method based on semi-supervised learning according to an embodiment of the present invention, as shown in fig. 2, the flow chart of the method may include the following steps:

step S201, generating a metadata keyword identifier according to the conditional random field, wherein the metadata keyword identifier is used for extracting metadata characteristics corresponding to any identifier. Optionally, the metadata and the corresponding identifier are respectively corresponding to the observation sequence and the identifier sequence in the conditional random field, and the metadata feature corresponding to the metadata identifier is extracted.

Step S202, an initial labeled data set is obtained, and a metadata keyword identifier is trained according to the initial labeled data set to generate a metadata classifier. Optionally, the metadata keyword identifier is trained using the initial labeled dataset to generate a metadata classifier, which can generate a corresponding identifier for the metadata.

Step S203, obtaining the unlabeled data set, predicting the unlabeled data set according to the metadata classifier, and generating a prediction result, wherein the prediction result comprises the labeled data set and the unlabeled data set. Optionally, the metadata classifier is used to predict data in the unlabeled dataset, and if the corresponding identifier is generated by prediction, the corresponding data is stored in the labeled dataset, and if the corresponding identifier is not generated by prediction, the corresponding data is stored in the unlabeled dataset.

And step S204, generating an intermediate training data set according to the prediction result. Optionally, partial data is selected from the labeled data set and the unlabeled data set to generate an intermediate training data set for cyclic training of the model.

And S205, circularly self-training the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain the circularly self-trained metadata classifier. Optionally, the metadata keyword identifier and the metadata classifier are automatically and circularly trained by using an intermediate training data set, so as to obtain a metadata classifier which converges after training.

And step S206, intelligently identifying the metadata according to the metadata classifier after the circulation self-training. Optionally, a trained converged metadata classifier is used to generate corresponding identifications for the metadata.

In the embodiment of the invention, a metadata keyword identifier is generated according to a conditional random field; acquiring an initial labeled data set, and training a metadata keyword identifier according to the initial labeled data set to generate a metadata classifier; acquiring an unlabeled data set, predicting the unlabeled data set according to a metadata classifier, and generating a prediction result; generating an intermediate training data set according to the prediction result; circularly self-training the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain a circularly self-trained metadata classifier; and intelligently identifying the metadata according to the metadata classifier after the cyclic self-training. The metadata features are extracted through the metadata keyword identifier, the metadata keyword identifier is trained by using labeled data to generate the metadata classifier, and part of labeled data and label-free data are used for circularly self-training the metadata keyword identifier and the metadata classifier, so that the purpose of labeling the metadata through the metadata classifier can be realized, the technical effect of improving the metadata labeling efficiency is achieved, and the problem of low metadata identifier construction efficiency in the related technology is solved.

As an alternative embodiment, generating a metadata key identifier from a conditional random field model includes: representing the metadata and the identifier corresponding to the metadata into an undirected graph; determining an observation sequence and an identification sequence according to the conditional random field, wherein the observation sequence corresponds to metadata, and the identification sequence corresponds to an identification corresponding to the metadata; determining a target function and a feature set according to the observation sequence and the identification sequence, wherein the target function is used for obtaining the identification sequence with the maximum probability corresponding to the observation sequence, and the feature set is a set of metadata features; and generating a metadata keyword identifier according to the target function and the feature set.

Optionally, the metadata and the identifier corresponding to the metadata are represented in the form of an undirected graph, G = (E, F), where G represents an undirected graph, E represents a node of the undirected graph, corresponding metadata, and F represents an edge of the undirected graph, corresponding to the identifier of the metadata. Defining the observation sequence in the conditional random field as U = (U) ₁ ,u ₂ ,u ₃ ...u _n ) Corresponding to the metadata, an identification sequence in the conditional random field is defined as V = (V) ₁ ,v ₂ ,v ₃ ...v _n ) The probability of determining a marker sequence from the features and the observation sequence according to conditional random field theory, corresponding to the marker of the metadata, can be expressed as:

where λ represents the set of features, k is the number of features, x _k (v _i-1 ,v _i U, i) is the transfer characteristic function of the observed sequence with the identification positions between i-1 and i, y _k (v _i ,v _i U, i) is the observed sequence state feature function with the identification position i.

Assuming a number of features k, the conditional probability P of the marker sequence _θ (v | u) is:

wherein λ is _k And mu _k For feature weight values of different features, Z (u) is a normalization factor independent of the observed value, acting on the whole sequence, for normalization.

The original data is identified, that is, an observation sequence is given, and the identification sequence with the maximum probability is solved, so that the objective function is as follows:

wherein the content of the first and second substances,

identification sequence of maximum probability, P _λ (v | u) identifies the conditional probability of the sequence for the maximum probability, and the feature set λ is estimated using an inductive iterative scaling method.

As an alternative embodiment, the observation sequence includes a transfer function that is determined based on the conditional random field, wherein the transfer function doesAnd the relation between the previous output state and the current output state is shown on an undirected graph edge. Optionally, according to conditional random field theory, the observation sequence comprises a transfer feature function, transfer feature function x _k (v _i-1 ,v _i U, i) acts on an undirected graph edge and represents the relationship between the previous output state and the current output state, namely the relationship between the identification position i-1 and the identification position i. In the embodiment of the invention, the transfer characteristic function of the observed sequence is determined through the conditional random field, and the relation between the sequence hidden state metadata identifiers is considered in the model.

As an alternative embodiment, obtaining an unlabeled data set, predicting the unlabeled data set according to a metadata classifier, and generating a prediction result, where the prediction result includes the labeled data set and the unlabeled data set, and includes: predicting the unlabeled data according to the metadata characteristics corresponding to any identifier, wherein the unlabeled data is any piece of data in the unlabeled data set; if the unmarked data has the metadata features of the same type, generating an identifier corresponding to the metadata features of the same type for the unmarked data, and storing the unmarked data with the generated identifier into the marked data set; if the unmarked data do not have the metadata features of the same type, the unmarked data are not generated with marks, and the unmarked data without the generated marks are stored into the unmarked data set. And obtaining a prediction result containing the labeled data set and the unlabeled data set.

Optionally, predicting the unmarked data set by using a metadata classifier, wherein the metadata classifier predicts according to metadata features in the feature set, specifically, for any unmarked data in the unmarked data set, if the unmarked data has metadata features of the same kind, generating an identifier corresponding to the metadata features of the same kind for the unmarked data, and storing the unmarked data with the generated identifier into the marked data set; if the unlabeled data does not have the same metadata characteristics, the identifier is not generated for the unlabeled data, and the unlabeled data without the generated identifier is stored in the unlabeled data set. In the embodiment of the invention, the corresponding identification is determined according to the metadata characteristics, so that intelligent and efficient identification of the metadata is realized.

As an alternative embodiment, the generating the intermediate training data set according to the prediction result includes: and generating an intermediate training data set according to the data with high classification characteristic weight in the unlabeled data set and the labeled data set. Optionally, data with high classification feature weights in the unlabeled data set and the labeled data set in the prediction result is selected to generate an intermediate training data set, and the model is further trained by using the intermediate training data set. In the embodiment of the invention, the intermediate training data set is generated according to the data with high classification characteristic weight in the unlabeled data set and the labeled data set, the metadata keyword identifier and the metadata classifier are circularly self-trained by using the intermediate training data set, and the useful structure information contained in the unlabeled sample data is utilized, so that the effect of improving the data availability is achieved, and the problem of insufficient labeled samples is solved.

As an alternative embodiment, the circularly self-training the metadata keyword identifier and the metadata classifier according to the intermediate training data set, and obtaining the circularly self-trained metadata classifier includes: training a metadata keyword identifier according to the intermediate training data set to generate a metadata classifier; predicting the unlabeled data set according to the metadata classifier to generate a prediction result; generating an intermediate training data set according to the prediction result; and circularly generating a metadata classifier, generating a prediction result and generating an intermediate training data set until the metadata classifier is converged to obtain the circularly self-trained metadata classifier.

Optionally, automatically and circularly training the model by using an intermediate training data set, and repeatedly training the metadata keyword identifier according to the intermediate training data set to generate a metadata classifier; predicting the unlabeled data set according to the metadata classifier to generate a prediction result; generating an intermediate training data set according to the prediction result; and training the process of generating the metadata classifier by the metadata keyword identifier according to the intermediate training data set until the metadata classifier is converged to obtain the metadata classifier after the cyclic self-training. In the embodiment of the invention, through the circular self-training, the model can learn more metadata characteristics, and a better classification effect is achieved.

As an alternative embodiment, the metadata keyword identifier and the metadata classifier are circularly self-trained according to the intermediate training data set, and obtaining the metadata classifier after circular self-training further includes: determining classification precision according to the labeled data set in the prediction result; if the classification precision does not reach the preset value, circularly generating a metadata classifier, generating a prediction result and generating an intermediate training data set; and if the classification precision reaches a preset value, finishing the circular self-training to obtain the metadata classifier after the circular self-training. Optionally, the classification precision of the prediction data is calculated, if the classification precision does not reach a preset value, the model continues to be subjected to circular self-training, and if the classification precision reaches the preset value, the circular self-training is stopped. In the embodiment of the invention, when the cyclic self-training of the model is stopped is determined according to the classification precision, and compared with the condition of model convergence, the model with better classification effect can be obtained.

According to another aspect of the embodiment of the invention, a semi-supervised learning based metadata intelligent identification device for implementing the above semi-supervised learning based metadata intelligent identification method is also provided. Fig. 3 is a block diagram of an alternative metadata intelligent identification apparatus based on semi-supervised learning according to an embodiment of the present invention, and as shown in fig. 3, the apparatus may include: a first generating module 301, configured to generate a metadata keyword identifier according to the conditional random field, where the metadata keyword identifier is used to extract a metadata feature corresponding to any identifier; a second generating module 302, configured to obtain an initial labeled data set, train a metadata keyword identifier according to the initial labeled data set, and generate a metadata classifier; a third generating module 303, configured to obtain an unlabeled data set, predict the unlabeled data set according to the metadata classifier, and generate a prediction result, where the prediction result includes a labeled data set and an unlabeled data set; a fourth generating module 304, configured to generate an intermediate training data set according to the prediction result; a cyclic self-training module 305, configured to perform cyclic self-training on the metadata keyword identifier and the metadata classifier according to the intermediate training data set, so as to obtain a metadata classifier after the cyclic self-training; and the identification module 306 is used for intelligently identifying the metadata according to the metadata classifier after the cyclic self-training.

It should be noted that the first generating module 301 in this embodiment may be configured to execute the step S201, the second generating module 302 in this embodiment may be configured to execute the step S202, the third generating module 303 in this embodiment may be configured to execute the step S203, the fourth generating module 304 in this embodiment may be configured to execute the step S204, the cyclic self-training module 305 in this embodiment may be configured to execute the step S205, and the identifying module 306 in this embodiment may be configured to execute the step S206.

Through the module, the metadata features are extracted according to the metadata keyword identifier, the metadata classifier is generated by training the metadata keyword identifier with the marked data, and the metadata classifier and the metadata keyword identifier are circularly self-trained by using part of the marked data and part of the unmarked data, so that the purpose of marking the metadata through the metadata classifier can be realized, the technical effect of improving the metadata marking efficiency is achieved, and the problem of low metadata identification construction efficiency in the related technology is solved.

As an alternative embodiment, the first generating module comprises: the representing unit is used for representing the metadata and the identifier corresponding to the metadata into the form of an undirected graph; the device comprises a first determining unit, a second determining unit and a judging unit, wherein the first determining unit is used for determining an observation sequence and an identification sequence according to a conditional random field, the observation sequence corresponds to metadata, and the identification sequence corresponds to an identification corresponding to the metadata; the second determining unit is used for determining an objective function and a feature set according to the observation sequence and the identification sequence, wherein the objective function is used for obtaining the identification sequence with the maximum probability corresponding to the observation sequence, and the feature set is a set of metadata features; and the generating unit is used for generating the metadata keyword identifier according to the target function and the feature set.

As an alternative embodiment, the observation sequence includes a transfer feature function determined based on the conditional random field, wherein the transfer feature function acts on an undirected graph edge to represent a relationship between a previous output state and a current output state.

As an alternative embodiment, the third generating module comprises: the prediction unit is used for predicting the unlabeled data according to the metadata characteristics corresponding to any identifier, wherein the unlabeled data is any piece of data in the unlabeled data set; the first storage unit is used for generating an identifier corresponding to the same type of metadata features for the unmarked data when the unmarked data have the same type of metadata features, and storing the unmarked data with the generated identifier into the marked data set; and the second storage unit is used for not generating an identifier for the unmarked data when the unmarked data does not have the metadata characteristics of the same type, and storing the unmarked data without the generated identifier into the unmarked data set.

As an alternative embodiment, the fourth generating module includes: and the generating unit is used for generating an intermediate training data set according to the data with high classification feature weight in the unmarked data set and the marked data set.

As an alternative embodiment, the cyclic self-training module comprises: the first generation unit is used for training the metadata keyword identifier according to the intermediate training data set to generate a metadata classifier; the second generation unit is used for predicting the unmarked data set according to the metadata classifier and generating a prediction result; a third generating unit, configured to generate an intermediate training data set according to the prediction result; and the obtaining unit is used for circularly generating the metadata classifier, generating the prediction result and generating the intermediate training data set until the metadata classifier is converged to obtain the metadata classifier after circular self-training.

As an alternative embodiment, the obtaining unit includes: the determining submodule is used for determining classification precision according to the labeled data set in the prediction result; the generation submodule is used for circularly generating a metadata classifier, a prediction result and an intermediate training data set if the classification precision does not reach a preset value; and the obtaining submodule is used for finishing the circular self-training when the classification precision reaches a preset value so as to obtain the metadata classifier after the circular self-training.

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. It should be noted that the modules described above as part of the apparatus may run in a hardware environment as shown in fig. 1, may be implemented by software, and may also be implemented by hardware, where the hardware environment includes a network environment.

According to another aspect of the embodiment of the present invention, there is also provided an electronic device for implementing the above-mentioned semi-supervised learning based intelligent metadata identification method, where the electronic device may be a server, a terminal, or a combination thereof.

Fig. 4 is a block diagram of an alternative electronic device according to an embodiment of the present invention, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403 and a communication bus 404, where the processor 401, the communication interface 402 and the memory 403 complete communication with each other through the communication bus 404, and the memory 403 is used for storing a computer program; the processor 401, when executing the computer program stored in the memory 403, implements the following steps:

generating a metadata keyword identifier according to the conditional random field, wherein the metadata keyword identifier is used for extracting metadata characteristics corresponding to any identifier; acquiring an initial labeled data set, and training a metadata keyword identifier according to the initial labeled data set to generate a metadata classifier; acquiring an unlabeled data set, predicting the unlabeled data set according to a metadata classifier, and generating a prediction result, wherein the prediction result comprises a labeled data set and an unlabeled data set; generating an intermediate training data set according to the prediction result; circularly self-training the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain a circularly self-trained metadata classifier; and intelligently identifying the metadata according to the metadata classifier after the cyclic self-training.

Alternatively, in this embodiment, the communication bus may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The memory may include RAM, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

As an example, as shown in fig. 4, the memory 403 may include, but is not limited to, a first generation module 301, a second generation module 302, a third generation module 303, a fourth generation module 304, a cyclic self-training module 305, and an identification module 306 in the metadata intelligent identification apparatus based on semi-supervised learning. In addition, other module units in the above metadata intelligent identification apparatus based on semi-supervised learning may also be included, but are not limited to this, and are not described in detail in this example.

The processor may be a general-purpose processor, and may include but is not limited to: a CPU (Central Processing Unit), NP (Network Processor), and the like; but also DSPs (Digital Signal Processing), ASICs (Application Specific Integrated circuits), FPGAs (Field-Programmable Gate arrays) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In addition, the electronic device further includes: and the display is used for displaying the intelligent identification result of the metadata based on the semi-supervised learning.

Optionally, for a specific example in this embodiment, reference may be made to the example described in the foregoing embodiment, and this embodiment is not described herein again.

It can be understood by those skilled in the art that the structure shown in fig. 4 is only an illustration, and the device implementing the above metadata intelligent identification method based on semi-supervised learning may be a terminal device, and the terminal device may be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 4 does not limit the structure of the electronic apparatus. For example, the terminal device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 4, or have a different configuration than shown in FIG. 4.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.

According to still another aspect of an embodiment of the present invention, there is also provided a storage medium. Optionally, in this embodiment, the storage medium may be used to execute a program code of a metadata intelligent identification method based on semi-supervised learning.

Optionally, in this embodiment, the storage medium may be located on at least one of a plurality of network devices in a network shown in the above embodiment.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:

generating a metadata keyword identifier according to the conditional random field, wherein the metadata keyword identifier is used for extracting metadata characteristics corresponding to any identifier; acquiring an initial labeled data set, and training a metadata keyword identifier according to the initial labeled data set to generate a metadata classifier; acquiring an unlabeled data set, predicting the unlabeled data set according to a metadata classifier, and generating a prediction result, wherein the prediction result comprises a labeled data set and an unlabeled data set; generating an intermediate training data set according to the prediction result; circularly self-training the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain a circularly self-trained metadata classifier; intelligent identification of metadata according to circularly self-trained metadata classifier

Optionally, the specific example in this embodiment may refer to the example described in the above embodiment, which is not described again in this embodiment.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a U disk, a ROM, a RAM, a removable hard disk, a magnetic disk, or an optical disk.

According to yet another aspect of an embodiment of the present invention, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium; the processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to make the computer device execute the metadata intelligent identification method steps based on semi-supervised learning in any of the embodiments.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable one or more computer devices (which may be personal computers, servers, or network devices) to execute all or part of the steps of the metadata intelligent identification method based on semi-supervised learning according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed client can be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is only a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, and may also be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution provided in this embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A metadata intelligent identification method based on semi-supervised learning is characterized by comprising the following steps:

generating a metadata keyword identifier according to the conditional random field, wherein the metadata keyword identifier is used for extracting metadata characteristics corresponding to any identifier;

acquiring an initial labeled data set, and training the metadata keyword identifier according to the initial labeled data set to generate a metadata classifier;

acquiring an unlabeled data set, predicting the unlabeled data set according to the metadata classifier, and generating a prediction result, wherein the prediction result comprises a labeled data set and an unlabeled data set;

generating an intermediate training data set according to the prediction result;

circularly self-training the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain a circularly self-trained metadata classifier;

and intelligently identifying the metadata according to the metadata classifier after the cyclic self-training.

2. The semi-supervised learning based intelligent metadata identification method according to claim 1, wherein the generating of the metadata keyword identifier according to the conditional random field model comprises:

representing metadata and an identifier corresponding to the metadata in the form of an undirected graph;

determining an observation sequence and an identification sequence according to the conditional random field, wherein the observation sequence corresponds to the metadata, and the identification sequence corresponds to the identification corresponding to the metadata;

determining a target function and a feature set according to the observation sequence and the identification sequence, wherein the target function is used for obtaining the identification sequence with the maximum probability corresponding to the observation sequence, and the feature set is a set of metadata features;

and generating a metadata keyword identifier according to the objective function and the feature set.

3. The semi-supervised learning based metadata intelligent identification method according to claim 2, wherein the observation sequence includes a transfer feature function determined based on a conditional random field, wherein the transfer feature function acts on an undirected graph edge and represents a relationship between a previous output state and a current output state.

4. The semi-supervised learning based intelligent metadata identification method according to claim 1, wherein the obtaining of the unlabeled dataset and the predicting of the unlabeled dataset by the metadata classifier generate a predicted result, wherein the predicted result includes the labeled dataset and the unlabeled dataset and includes:

predicting the unlabeled data according to the metadata features corresponding to any identifier, wherein the unlabeled data is any piece of data in the unlabeled data set;

if the unmarked data has the metadata features of the same type, generating an identifier corresponding to the metadata features of the same type for the unmarked data, and storing the unmarked data with the generated identifier into a marked data set;

if the unlabeled data does not have the same metadata characteristics, not generating an identifier for the unlabeled data, and storing the unlabeled data without the generated identifier into an unlabeled data set.

5. The semi-supervised learning based intelligent identification method for metadata as claimed in claim 4, wherein the generating of the intermediate training data set according to the prediction result comprises:

and generating an intermediate training data set according to the data with high classification characteristic weight in the unlabeled data set and the labeled data set.

6. The intelligent metadata identification method based on semi-supervised learning according to claim 1, wherein the cycle self-training of the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain the metadata classifier after cycle self-training comprises:

training the metadata keyword identifier according to the intermediate training data set to generate a metadata classifier;

predicting the unlabeled data set according to the metadata classifier to generate a prediction result;

and circularly generating a metadata classifier, generating a prediction result and generating an intermediate training data set until the metadata classifier is converged to obtain the metadata classifier after circular self-training.

7. The intelligent metadata identification method based on semi-supervised learning according to claim 6, wherein the metadata keyword identifier and the metadata classifier are circularly self-trained according to the intermediate training data set, and obtaining the circularly self-trained metadata classifier further comprises:

determining classification precision according to the labeled data set in the prediction result;

if the classification precision does not reach a preset value, circularly generating a metadata classifier, a prediction result and an intermediate training data set;

and if the classification precision reaches a preset value, finishing the circular self-training to obtain the metadata classifier after the circular self-training.

8. A semi-supervised learning based intelligent metadata identification device, characterized in that the device comprises:

the device comprises a first generation module, a second generation module and a third generation module, wherein the first generation module is used for generating a metadata keyword identifier according to a conditional random field, and the metadata keyword identifier is used for extracting metadata characteristics corresponding to any identifier;

the second generation module is used for acquiring an initial labeled data set and training the metadata keyword identifier according to the initial labeled data set to generate a metadata classifier;

a third generation module, configured to obtain an unlabeled data set, predict the unlabeled data set according to the metadata classifier, and generate a prediction result, where the prediction result includes a labeled data set and an unlabeled data set;

the fourth generation module is used for generating an intermediate training data set according to the prediction result;

the cyclic self-training module is used for carrying out cyclic self-training on the metadata keyword identifier and the metadata classifier according to the intermediate training data set to obtain a metadata classifier after cyclic self-training;

and the identification module is used for intelligently identifying the metadata according to the metadata classifier which is circularly self-trained.

9. An electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein said processor, said communication interface and said memory communicate with each other via said communication bus,

the memory for storing a computer program;

the processor for performing the method steps of any one of claims 1 to 7 by running the computer program stored on the memory.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.