CN113095423B

CN113095423B - Stream data classification method based on online anti-deduction learning and realization device thereof

Info

Publication number: CN113095423B
Application number: CN202110430304.1A
Authority: CN
Inventors: 李宇峰; 周志华; 黄宇轩
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2024-05-28
Anticipated expiration: 2041-04-21
Also published as: CN113095423A

Abstract

The invention discloses a stream data classification method based on online deduction learning and a realization device thereof, wherein the method comprises the steps of putting input unmarked (or weakly supervised mark) stream data into a current learner so as to obtain a pseudo mark for predicting the current stream data; performing a deductive reasoning operation on the predicted pseudo-tag by using the knowledge base (and the weak supervision tag) to obtain a modified pseudo-tag; finally, the learner is updated using the modified pseudo-tag. The above process is performed continuously as streaming data arrives. On one hand, the invention can utilize the domain knowledge of the first-order logic expression and use the online anti-deduction learning method to surpass the performance of the traditional online learning method; on the other hand, a large amount of streaming data can be processed quickly, unmarked or weakly marked data can be utilized, and new categories that may occur in the data can also be processed.

Description

Stream data classification method based on online anti-deduction learning and realization device thereof

Technical Field

The invention relates to a stream data classification method based on online deduction learning and an implementation device thereof, belonging to the technical field of artificial intelligence and pattern recognition tasks under large-scale data.

Background

The online learning is a mainstream machine learning algorithm, achieves remarkable effects in classification tasks such as streaming data, large-scale data and the like, mainly aims at continuously arriving a large amount of marked data, has limited equipment storage, and updates a current model by using a newly added training sample. The existing online learning technology is mostly realized by using a data-driven machine learning model, and has the defects that a large amount of annotation data is needed, weak annotation data is difficult to use, domain knowledge is difficult to use and the like.

Disclosure of Invention

The invention aims to: aiming at the problems and the shortcomings in the prior art, the invention provides a stream data classification method based on online deduction learning and an implementation device thereof.

The technical scheme is as follows: a method for classifying stream data based on online anti-deduction learning receives stream data, and obtains a pseudo mark for predicting a current sample by putting the input stream data into a current learner; converting the predicted false mark into false facts, and performing deduction reasoning operation by utilizing a knowledge base and weak mark data to obtain modified false facts; finally, converting the modified pseudo facts into pseudo marks, and updating the learner; the above process is continuously executed along with the arrival of streaming data; the weak annotation or non-annotation data is classified by an online anti-deduction learning method for the scene that the streaming training data and the knowledge base coexist.

The streaming data is unmarked or weakly supervised marked streaming data.

The flow of the flow data classification method based on online anti-deduction learning mainly comprises three parts, and the flow data classification method is continuously executed along with the arrival of data:

(1) Pseudo tag prediction process: taking one batch of streaming data, putting all input samples into a learner, and obtaining pseudo marks of the corresponding samples as output.

(2) Deductive reasoning labeling process: by converting the pseudo tag into a pseudo fact and inputting the pseudo fact into the knowledge base, logic algorithm is utilized to verify whether the pseudo fact is consistent with the knowledge base. If the pseudo marks are consistent, the pseudo marks are not modified; if not, an attempt is made to modify the pseudo facts according to the principles of minimizing the inconsistency, such that the modified pseudo facts agree with the knowledge base, and convert them to pseudo tags that are returned to the learner.

(3) Updating a learner process: the pseudo-mark obtained by deductive reasoning is taken as a real mark and is used for updating the learner together with the samples of the current batch.

Find the wrong marker location. The principle of minimizing inconsistencies is used, in other words, by modifying a minimum number of false facts, so that the modified facts are as consistent as possible with the knowledge base. When the number of marks is larger than the preset number, the process can search by adopting a non-gradient optimization method, and when the number of marks is smaller than the preset number, exhaustive search can be directly carried out. Specifically, the method firstly tries to find the fact corresponding to a certain pseudo mark, marks the fact as deductible, and then performs deduction to obtain the modified pseudo fact consistent with the knowledge base; if such facts do not exist, in other words, any one of the modified facts cannot be consistent with the knowledge base, the method will try to find the facts corresponding to some two marks, and mark them as deductible and try to infer, so as to obtain the pseudo marks consistent with the knowledge base. If it is still not consistent with the knowledge base, the number of labels that can be modified continues to be increased until a fact is found that can be modified to be consistent with the knowledge base.

An implementation device of a stream data classification method based on online deduction learning, comprising: a processor, and a memory coupled to the processor; the memory stores a domain knowledge base and instructions that, when executed by the one processor, cause the one processor to perform the above-described online anti-deductive learning streaming data classification method.

Drawings

FIG. 1 is a flow chart of the classification process of the method of the present invention;

FIG. 2 is a pseudo tag prediction flow chart of the method of the present invention;

FIG. 3 is a flow chart of the deductive reasoning labeling process of the method of the present invention;

fig. 4 is a block diagram of the apparatus of the present invention.

Detailed Description

The present application is further illustrated below in conjunction with specific embodiments, it being understood that these embodiments are meant to be illustrative of the application and not limiting the scope of the application, and that modifications of the application, which are equivalent to those skilled in the art to which the application pertains, fall within the scope of the application defined in the appended claims after reading the application.

The method for classifying the stream data based on the online anti-deduction learning carries out online learning on the input stream data based on a knowledge base and a learner to be learned. The learner in the method may be any learner suitable for the corresponding task, such as a neural network, a decision tree, etc. The learner before learning can perform supervised pre-training without pre-training. The content in the knowledge base can be domain knowledge rules expressed by first-order logic, or other forms of language expression and programs which can be used for reasoning and calculation.

The implementation device of the streaming data classification method based on online deduction learning can be executed by an electronic device, such as a terminal device or a server device. In other words, the method may be performed by software or hardware installed at a terminal device or a server device. Server devices include, but are not limited to: terminal devices such as a single server, a server cluster, a cloud server, or a cloud server cluster include, but are not limited to: any one of intelligent terminal equipment such as a smart phone, a personal computer, a notebook computer, a tablet personal computer, an electronic reader, a network television, a wearable device and the like.

As shown in fig. 1. For continuously incoming streaming data, first, a batch of data is taken and the current knowledge base is updated. Then, pseudo tag prediction is sequentially performed (the flow is shown in fig. 2), the pseudo tag is deduced and the learner process is updated. The three steps are continuously cycled until the proportion of the number of samples in the batch that are consistent with the knowledge base is determined to be greater than r. After the sample of one batch is learned, the sample of the next batch in the streaming data is taken, and then the process is repeated. As online learning is carried out on the streaming data samples, the time cost is low and the training speed is high. In addition, as only weak annotation data or no annotation data are needed, the requirement on data annotation is lower than that of the traditional online learning method.

Deduction reasoning marking process for online deduction learning method

The deduction reasoning marking process of the online deduction learning method consists of the following three sub-aspects:

1. And judging consistency by a knowledge base. First, a pseudo-marker y _pseudo, which is made up of n sub-markers, i.e., y _pseudo＝[y₁,y₂,…,y_n, is predicted for one sample by the learner. The pseudo-tag is converted into a pseudo-fact z _pseudo＝[z₁,z₂,…,z_n, and then the pseudo-fact and the weak supervision tag possibly attached to the sample are input into the knowledge base together, and a logic algorithm is used to verify whether the pseudo-fact is consistent with the knowledge base. If consistent with the knowledge base, no modification is made to the pseudo tag. Otherwise, if the knowledge base is inconsistent, the following steps 2 and 3 are performed to try to deduce the false facts.

2. Finding the false fact location of the error. The principle of minimizing inconsistencies is used, in other words, by modifying a minimum number of facts, so that the modified facts are brought into agreement with the knowledge base. When n is relatively large (greater than a preset value), the process can adopt a non-gradient optimization method to search, and when n is relatively small (less than the preset value), the process can directly conduct exhaustive search. Specifically, the method may first try to find a certain pseudo fact z _i, and label the one pseudo fact as deductible, and then give the deductive reasoning to the sub-step 3 to obtain a modified pseudo fact consistent with the knowledge base; if such a z _i does not exist, in other words, neither of the facts z _i is modified to obtain a fact that is consistent with the knowledge base, the method will attempt to find some two facts z _i and z _j, and label the two facts as deductible and attempt to deduce, resulting in a modified fact that is consistent with the knowledge base. If the position of the false fact is not consistent with the knowledge base, the number of the modifiable positions is continuously increased until the position of the false fact which is consistent with the knowledge base after the modifiable positions are found.

3. The deductive reasoning yields the modified pseudo-tag. In sub-step 2, deductible positions of the facts are obtained, these positions are set as deductible, and then the facts (and the weak supervision labels, if any) are deducted to the knowledge base so that the modified facts of these positions are consistent with the knowledge base, and finally converted into the pseudo labels.

The process of the on-line deductive learning of the deductive reasoning marks based on the 1,2,3 point sub-steps is shown in figure 3. Specifically, for the input sample and its pseudo-tag, first, according to substep 1, at 310 and 320, it is determined whether the input sample and its pseudo-tag are consistent, based on the pseudo-facts and knowledge base converted from the pseudo-tag, and if so, the input pseudo-tag is returned directly at 390. Next, according to sub-step 2, the search is calculated at 330, 340, 350, 385 to get the false facts location of the error, and this procedure will call sub-step 3, i.e. 360, 370, 380 to make the deductive reasoning to get the modified false facts. Finally the modified pseudo facts are converted back to pseudo tags at 390. Since the method searches for the mark with the least number of modifications first, the returned modified pseudo mark must conform to the principle of minimizing inconsistency.

Fig. 4 shows a schematic diagram of an implementation of an online deduction learning device according to an embodiment of the invention. As shown in fig. 4, the online deduction learning device 400 may include at least one processor 410, a memory 420, a storage (e.g., a non-volatile memory) 430, and a communication interface 440, and the at least one processor 410, the memory 420, the storage 430, and the communication interface 440 are connected together via a bus 450.

Bus 450 provides a communication channel between the components of the online deduction learning device 400. The at least one processor 410 may control the online deduction learning device 400. The at least one processor 410 may execute an operating system, firmware, etc. to drive the online deduction learning device 400. The at least one processor 410 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory. Memory 420 may be used as a working memory for processor 410. The memory 420 may include volatile memory, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), or non-volatile memory, such as phase change RAM (PRAM), magnetic RAM (MRAM), resistive RAM (ReRAM), or Ferroelectric RAM (FRAM). The memory 430 may store data generated by the at least one processor 410. Memory 430 may store operating system or firmware code for execution by at least one processor 410, as well as a domain knowledge base. The memory 430 may include a non-volatile memory (such as NAND flash memory, PRAM, MRAM, RRAM, or FRAM). The communication interface may include a network communication interface and a user input interface (such as a mouse, keyboard, microphone, and camera) for receiving information such as streaming data.

In one embodiment, computer-executable instructions are stored in memory that, when executed, cause the at least one processor 410 to: the method comprises the steps of obtaining a pseudo mark predicted for a current sample by putting input unmarked (or weak marked) streaming data into a current learner; performing a deductive reasoning operation on the predicted pseudo-tag by using the knowledge base (and the weak tag data) to obtain a modified pseudo-tag; finally, the learner in the online deduction learning device is updated using the modified pseudo tag.

The computer-executable instructions stored in the memory, when executed, cause a processor 410 to perform the various operations and functions described above in connection with fig. 1-3 in various embodiments of the invention.

According to one embodiment, a program product such as a machine-readable medium (e.g., a non-transitory machine-readable medium) is provided. The machine-readable medium may have instructions (i.e., elements described above implemented in software) that, when executed by a machine, cause the machine to perform various operations and functions described above in connection with fig. 1-3 in various embodiments of the invention. In particular, a system or apparatus provided with a readable storage medium having stored thereon software program code implementing the functions of any of the above embodiments may be provided, and a computer or processor of the system or apparatus may be caused to read out and execute instructions stored in the readable storage medium.

In this case, the program code itself read from the readable medium may implement the functions of any of the above embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present specification.

Examples of readable storage media include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or cloud by a communications network.

Claims

1. A method for classifying stream data based on online anti-deduction learning is characterized in that stream data is received, and the input stream data is put into a current learner so as to obtain a pseudo mark for predicting a current sample; converting the predicted false mark into false facts, and performing deduction reasoning operation by utilizing a knowledge base and weak mark data to obtain modified false facts; finally, converting the modified pseudo facts into pseudo marks, and updating the learner; the above process is continuously executed along with the arrival of streaming data; classifying weak annotation or non-annotation data according to the scene of the concurrent existence of the streaming training data and the knowledge base by an online anti-deduction learning method;

the streaming data is the streaming data without mark or with weak supervision mark;

the flow of the flow data classification method based on online anti-deduction learning comprises three parts, wherein the flow data classification method is continuously executed along with the arrival of data:

(1) Pseudo tag prediction process: taking one batch of streaming data, putting all input samples into a learner, and obtaining pseudo marks of the corresponding samples as output;

(2) Deductive reasoning labeling process: the false mark is converted into false facts and the false facts are input into a knowledge base, and whether the false facts are consistent with the knowledge base or not is verified by utilizing logic calculation; if the pseudo marks are consistent, the pseudo marks are not modified; if the pseudo facts are inconsistent, modifying the pseudo facts according to the principle of minimizing the inconsistency is attempted, so that the modified pseudo facts are consistent with the knowledge base, and the modified pseudo facts are converted into pseudo marks and returned to the learner;

(3) Updating a learner process: the pseudo mark obtained by deduction reasoning is taken as a real mark and is used for updating a learner together with the samples of the current batch;

Modifying the least number of false facts such that the modified facts are as consistent as possible with the knowledge base; when the number of marks is larger than the preset number, searching is performed by adopting a non-gradient optimization method, and when the number of marks is smaller than the preset number, exhaustive searching is directly performed; the process of finding the wrong marker position is: firstly, trying to find the fact corresponding to a certain pseudo mark, marking the fact as deductible, and then performing deduction to obtain the modified pseudo fact consistent with the knowledge base; if the fact does not exist, in other words, any one of the pseudo facts cannot be consistent with the knowledge base after modification, searching the pseudo facts corresponding to two marks, marking the pseudo facts as deductible and trying to deduce, and obtaining the pseudo marks consistent with the knowledge base; if it is still not consistent with the knowledge base, the number of labels that can be modified continues to be increased until a fact is found that can be modified to be consistent with the knowledge base.

2. An implementation device of a stream data classification method based on online deduction learning is characterized by comprising: a processor, and a memory coupled to the processor; the memory stores a domain knowledge base and instructions that, when executed by the one processor, cause the one processor to perform the above-described online anti-deductive learning streaming data classification method.