CN114444619B - Sample generation method, training method, data processing method and electronic device - Google Patents

Sample generation method, training method, data processing method and electronic device Download PDF

Info

Publication number
CN114444619B
CN114444619B CN202210340191.0A CN202210340191A CN114444619B CN 114444619 B CN114444619 B CN 114444619B CN 202210340191 A CN202210340191 A CN 202210340191A CN 114444619 B CN114444619 B CN 114444619B
Authority
CN
China
Prior art keywords
sample
significant
sample set
samples
characterization vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210340191.0A
Other languages
Chinese (zh)
Other versions
CN114444619A (en
Inventor
李硕
许晓文
聂磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210340191.0A priority Critical patent/CN114444619B/en
Priority to CN202210754096.5A priority patent/CN115130581B/en
Publication of CN114444619A publication Critical patent/CN114444619A/en
Application granted granted Critical
Publication of CN114444619B publication Critical patent/CN114444619B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a sample generation method, a training method, a data processing method and electronic equipment, and relates to the technical field of artificial intelligence, in particular to the technical fields of industrial safety, data mining, computer vision and deep learning. The specific implementation scheme is as follows: obtaining a sample characterization vector set according to a first sample set, wherein the first sample set comprises a plurality of samples, and the samples are not determined to be of the category; clustering the first sample set according to the sample characterization vector set to obtain at least one clustered sample set; and generating a remarkable sample data set according to the at least one clustering sample set.

Description

Sample generation method, training method, data processing method and electronic device
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to the technology of industrial safety, data mining, computer vision and deep learning. In particular, it relates to a sample generation method, a training method, a data processing method and an electronic device.
Background
With the development of computer technology, artificial intelligence technology has also been developed. Artificial intelligence techniques may include computer vision techniques, speech recognition techniques, natural language processing techniques, machine learning, deep learning, big data processing techniques, knowledge-graph techniques, and the like.
Artificial intelligence technology has found wide application in a variety of fields. For example, artificial intelligence techniques can be utilized to generate samples for training deep learning models.
Disclosure of Invention
The invention provides a sample generation method, a training method, a data processing method and electronic equipment.
According to an aspect of the present invention, there is provided a sample generation method including: obtaining a sample characterization vector set according to a first sample set, wherein the first sample set comprises a plurality of samples, and the samples are not determined to be of a category; clustering the first sample set according to the sample characterization vector set to obtain at least one clustered sample set; and generating a significant sample data set according to the at least one clustering sample set.
According to another aspect of the present invention, there is provided a training method of a deep learning model, including: inputting the significant sample into the deep learning model to obtain an output value; determining a loss function value according to the output value and the label value of the significant sample; and adjusting model parameters of the deep learning model according to the loss function value to obtain a trained deep learning model, wherein the significant samples are generated by using the method disclosed by the invention.
According to another aspect of the present invention, there is provided a data processing method including: and inputting the data to be processed into the trained deep learning model to obtain a data processing result, wherein the trained deep learning model is obtained by utilizing the method of the invention for training.
According to another aspect of the present invention, there is provided a sample generation apparatus comprising: a first obtaining module, configured to obtain a sample characterization vector set according to a first sample set, where the first sample set includes multiple samples, and a category of the samples is not determined; a second obtaining module, configured to cluster the first sample set according to the sample characterization vector set to obtain at least one clustered sample set; and the generating module is used for generating a remarkable sample data set according to the at least one clustering sample set.
According to another aspect of the present invention, there is provided a training apparatus for deep learning models, including: a third obtaining module, configured to input the significant sample into the deep learning model to obtain an output value; a first determining module, configured to determine a loss function value according to the output value and the tag value of the significant sample; and a fourth obtaining module, configured to adjust model parameters of the deep learning model according to the loss function value, so as to obtain a trained deep learning model, where the significant samples are generated by using the generating apparatus according to the present invention.
According to another aspect of the present invention, there is provided a data processing apparatus comprising: and a fifth obtaining module, configured to input data to be processed into the trained deep learning model to obtain a data processing result, where the trained deep learning model is obtained by training with a training apparatus according to the present invention.
According to another aspect of the present invention, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the method of the present invention.
According to another aspect of the present invention, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the present invention.
According to another aspect of the invention, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the method of the invention.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the invention. Wherein:
fig. 1 schematically shows an exemplary system architecture to which a sample data generation method, a training method of a deep learning model, a data processing method, and an apparatus according to an embodiment of the present invention may be applied;
FIG. 2 schematically shows a flow diagram of a sample data generation method according to an embodiment of the invention;
FIG. 3 schematically illustrates an example schematic of a training process of a characterization model according to an embodiment of the invention;
FIG. 4 schematically illustrates an example schematic diagram of an optimization process for characterizing a model according to an embodiment of this disclosure;
FIG. 5 schematically illustrates an example schematic diagram of a sample data generation process according to an embodiment of this disclosure;
FIG. 6 schematically shows a flow diagram of a method of training a deep learning model according to an embodiment of the invention;
FIG. 7 schematically illustrates an example schematic of a training process for a deep learning model according to an embodiment of the invention;
FIG. 8 schematically shows a flow diagram of a data processing method according to an embodiment of the invention;
FIG. 9 schematically illustrates an example schematic of an overall method flow according to an embodiment of this disclosure;
FIG. 10 schematically illustrates a block diagram of a sample generation apparatus according to an embodiment of the invention;
FIG. 11 schematically shows a block diagram of a training apparatus for deep learning models according to an embodiment of the present invention;
FIG. 12 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present invention; and
fig. 13 schematically shows a block diagram of an electronic device adapted to implement a sample data generation method, a training method of a deep learning model, and a data processing method according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
A variety of application scenarios can generate a huge amount of data. The same or similar samples exist in the mass data. If the training optimization of the model is performed based on all data, the cost is easily increased sharply, so in order to reduce the cost of the subsequent model training optimization, the model can be implemented by mining mass data.
For example, a model-based data gathering method may be utilized for mining of massive amounts of data. That is, the deep learning model may be trained using the sample set, resulting in a trained deep learning model. And determining an error output result according to the output result of the model and the output result, and taking a sample corresponding to the error output result as a sample obtained by mining. However, samples corresponding to the erroneous output result are not necessarily representative, so that the samples obtained by the model-based data collection method have nondirectivity, and it is difficult to extract effective representative significant samples from mass data, thereby increasing the data processing amount and the processing efficiency of the electronic device.
Therefore, the embodiment of the invention provides a sample generation scheme. Firstly, a sample characterization vector set is obtained according to a first sample set of which the category is not determined, then, clustering is carried out on the first sample set according to the sample characterization vector set to obtain at least one clustered sample set, and then, a significant sample set is determined according to the at least one clustered sample set. Therefore, training optimization of a subsequent model is not required to be performed on the basis of all the first sample sets, and the significant samples can be mined in the first sample sets through clustering, so that the data processing capacity of the electronic equipment such as a processor is reduced, and the processing efficiency of the electronic equipment such as the processor is improved. On the basis, the significant samples are effective samples, so that the significant samples are used for training and optimizing the subsequent models, the iteration times of the models are reduced, the training speed of the models is improved, the cost of training and optimizing the subsequent models is reduced, the effect of improving the internal performance of the electronic equipment according with the natural law is further obtained, and the core competitiveness of the electronic equipment is improved.
In the technical scheme of the invention, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good custom of the public order.
In the technical scheme of the invention, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.
Fig. 1 schematically shows an exemplary system architecture to which a sample data generation method, a training method of a deep learning model, a data processing method, and an apparatus according to an embodiment of the present invention may be applied.
It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present invention may be applied, so as to help those skilled in the art understand the technical content of the present invention, and it does not mean that the embodiments of the present invention may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the sample data generation method, the deep learning model training method, the data processing method, and the apparatus may be applied may include a terminal device, but the terminal device may implement the sample data generation method, the deep learning model training method, the data processing method, and the apparatus provided in the embodiments of the present invention without interacting with a server.
As shown in fig. 1, the system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as a knowledge reading application, a web browser application, a search application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be various types of servers that provide various services. For example, the Server 105 may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and VPS service (Virtual Private Server). Server 105 may also be a server of a distributed system or a server that incorporates a blockchain.
It should be noted that the sample data generation method and the data processing method provided in the embodiments of the present invention may be generally executed by the terminal device 101, 102, or 103. Accordingly, the sample data generating apparatus and the data processing apparatus provided in the embodiments of the present invention may also be disposed in the terminal device 101, 102, or 103.
Alternatively, the sample data generation method and the data processing method provided in the embodiment of the present invention may also be generally executed by the server 105. Accordingly, the sample data generating apparatus and the data processing apparatus provided in the embodiments of the present invention may be generally disposed in the server 105. The sample data generation method and the data processing method provided by the embodiment of the present invention may also be executed by a server or a server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, and 103 and/or the server 105. Accordingly, the sample data generating apparatus and the data processing apparatus provided in the embodiments of the present invention may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, and 103 and/or the server 105.
It should be noted that the training method of the deep learning model provided in the embodiment of the present invention may also be generally executed by the server 105. Accordingly, the training device for the deep learning model provided by the embodiment of the present invention may be generally disposed in the server 105. The training method of the deep learning model provided by the embodiment of the present invention may also be executed by a server or a server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, and 103 and/or the server 105. Correspondingly, the training apparatus for deep learning models provided in the embodiment of the present invention may also be disposed in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
Alternatively, the training method of the deep learning model provided by the embodiment of the present invention may be generally executed by the terminal device 101, 102, or 103. Correspondingly, the training device for the deep learning model provided by the embodiment of the invention can also be arranged in the terminal equipment 101, 102 or 103.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.
It should be noted that the sequence numbers of the respective operations in the following methods are merely used as representations of the operations for description, and should not be construed as representing the execution order of the respective operations. The method need not be performed in the exact order shown, unless explicitly stated.
Fig. 2 schematically shows a flowchart of a sample data generation method according to an embodiment of the present invention.
As shown in FIG. 2, the method 200 includes operations S210-S230.
In operation S210, a sample characterization vector set is obtained according to the first sample set. The first sample set includes a plurality of samples, the samples not being categorised.
In operation S220, the first sample set is clustered according to the sample characterization vector set to obtain at least one clustered sample set.
In operation S230, a significant sample set is generated according to the at least one clustered sample set.
According to an embodiment of the present invention, the first sample set may include a plurality of sample data of undetermined categories that need to be subjected to clustering processing. The sample characterization vector set may be obtained by performing feature extraction on a plurality of samples in the first sample set respectively. The cluster sample set may include a plurality of cluster samples. The salient sample set may include a plurality of salient samples. Each cluster sample set may have significant samples corresponding to the cluster sample set.
According to the embodiment of the invention, the sample characterization vectors corresponding to the samples can be obtained by performing feature extraction on the samples in the first sample set. For example, the samples in the first sample set may be processed using a conventional feature extraction algorithm to obtain sample characterization vectors corresponding to the samples. Alternatively, the samples in the first sample set may be processed using a characterization model to obtain sample characterization vectors corresponding to the samples. The embodiment of the invention does not limit the specific method for obtaining the sample characterization vector set, as long as the sample in the first sample set can be subjected to feature extraction to obtain the corresponding sample characterization vector.
According to embodiments of the present invention, clustering algorithms may include an analysis process that groups a set of physical or abstract objects into classes composed of similar objects, through which object classification and data mining may be performed. The clustering algorithm may include at least one of: K-Means Clustering, link-based hierarchical Clustering, density-based Clustering, Model-based SOM (Self-organizing map) Clustering, and probability-based GMM (Gaussian Mixture Model) Clustering, etc. The embodiment of the invention does not limit the clustering method as long as the first sample set can be clustered.
According to the embodiment of the invention, the first sample set can be clustered according to the first similarity between the sample characterization vectors in the sample characterization vector set to obtain at least one clustered sample set. And a first similarity between the sample characterization vectors belonging to the same cluster sample set is greater than or equal to a first preset similarity threshold. A first similarity between sample characterization vectors belonging to different sets of clustered samples is less than a first predetermined similarity threshold. The first predetermined similarity threshold may be configured according to an actual service requirement, and is not limited herein.
According to an embodiment of the present invention, the sample generation method of an embodiment of the present invention may be performed by an electronic device. The electronic device may include at least one processor. The processor may be configured to perform the sample generation methods provided by embodiments of the invention. The sample generation method provided by the embodiment of the present invention may be executed by a single processor, or may be executed in parallel by a plurality of processors.
According to the embodiment of the invention, a sample characterization vector set is obtained according to a first sample set of which the category is not determined, then clustering is carried out on the first sample set according to the sample characterization vector set to obtain at least one clustered sample set, and then a significant sample set is determined according to the at least one clustered sample set. Therefore, training optimization of a subsequent model is not required to be performed on the basis of all the first sample sets, and remarkable samples can be mined in a massive sample set through clustering, so that the data processing amount of electronic equipment such as a processor is reduced, and the processing efficiency of the electronic equipment such as the processor is improved. On the basis, the significant samples are effective samples, so that the significant samples are used for training and optimizing the subsequent models, the iteration times of the models are reduced, the training speed of the models is improved, the cost of training and optimizing the subsequent models is reduced, the effect of improving the internal performance of the electronic equipment according with the natural law is further obtained, and the core competitiveness of the electronic equipment is improved.
According to an embodiment of the invention, the sample may comprise one of: sample images, sample text, and sample audio.
According to the embodiment of the invention, in the case that the sample comprises a sample image, the significant sample determined by the sample generation method provided by the embodiment of the invention can be used in the field of image processing. In the case where the samples include sample text, the significant samples determined using the sample generation method provided according to the embodiment of the present invention may be used in the field of text processing. In the case where the samples comprise sample audio, the salient samples determined using the sample generation method provided according to the embodiments of the present invention may be used in the field of speech processing.
With reference to fig. 3 to fig. 5, a sample generation method according to an embodiment of the present invention is further described with reference to the specific embodiment.
According to an embodiment of the present invention, obtaining a sample characterization vector set according to the first sample set may include the following operations.
And processing the first sample set by using the characterization model to obtain a sample characterization vector set. The characterization model is obtained by training an automatic supervision model according to a sample characterization vector of the positive sample and sample characterization vectors of a plurality of negative samples corresponding to the positive sample based on a loss function. The plurality of negative examples are determined from a plurality of candidate negative examples corresponding to the positive examples.
According to the embodiment of the invention, in contrast learning, the child sample obtained by performing data enhancement on the parent sample is considered as a positive sample aiming at the parent sample, because the child sample and the parent sample have the same category and keep the same semantic information with each other. The parent sample may refer to a sample as a subject of data enhancement processing. For the same parent sample, data enhancement may be performed on the parent sample multiple times, resulting in multiple child samples. Although the plurality of child samples are directed to the same parent sample, the plurality of child samples are slightly different from each other, that is, the plurality of child samples do not completely coincide with each other. Negative examples may refer to other examples that differ from the category of the parent example. In the embodiment of the invention, the positive samples can comprise parent samples and positive samples obtained by performing data enhancement on the parent samples.
According to an embodiment of the invention, the self-supervision model may comprise at least one of: CPC (continuous Predictive coding), AMDIM (augmented Multiscale Deep InfoMax), MOCO (Momentum Contrast), SimCLR (simple frame for Contrast retrieval of Visual representations), BYOL (Bootstrap Young extension), and the like.
According to embodiments of the invention, the loss function (i.e. the first loss function) may comprise at least one of: InfoNCE (Info Noise-dependent Estimation), NCE (Noise-dependent Estimation Loss), and the like. The loss function may further include a loss function obtained by modifying the above-described loss function. For example, the loss function may also include distance-based InfoNCE.
According to the embodiment of the present invention, a plurality of negative samples may be determined from the plurality of candidate negative samples according to the second similarity between the sample characterization vector of the positive sample and the sample characterization vectors of the plurality of candidate negative samples corresponding to the positive sample. For example, a second similarity between the sample characterization vector of the positive sample and the sample characterization vector of each of the negative samples may be determined, resulting in a plurality of second similarities. A plurality of negative examples are determined from the plurality of candidate negative examples based on a second predetermined similarity threshold and a plurality of second similarities. For the candidate negative sample, determining the candidate negative sample as the negative sample if a second similarity between the sample characterization vector according to the positive sample and the sample characterization vector of the candidate negative sample is less than or equal to a second predetermined similarity threshold. The second predetermined similarity threshold may be configured according to an actual service requirement, and is not limited herein.
According to an embodiment of the present invention, the characterizing model is obtained by training an auto-supervised model using a positive sample and a plurality of negative samples corresponding to the positive sample, and may include: the characterization model may be obtained by training an auto-supervised model using the output values. The output value may be determined from a sample characterization vector of the positive samples and a sample characterization vector of a plurality of negative samples corresponding to the positive samples based on the first loss function.
According to an embodiment of the present invention, the determining the plurality of negative examples from the plurality of candidate negative examples corresponding to the positive examples may include: the plurality of negative samples corresponding to the positive samples are determined from the plurality of candidate negative samples based on the characterization vector of the positive samples and the characterization vectors of the plurality of candidate negative samples corresponding to the positive samples. The sample characterization vector of the positive sample is obtained by processing the positive sample by using an auto-supervision model. The sample characterization vector of the negative sample is obtained by processing the negative sample by using an auto-supervision model.
According to embodiments of the invention, a queue may include a plurality of queue elements. The plurality of queue elements are chronologically sequential, i.e., entered into the queue in chronological order. The queue has a "first in first out" feature, i.e. if a new queue element needs to be added to the queue, the oldest enqueued queue element can be dequeued and the new queue element added to the queue if the queue is full.
According to an embodiment of the present invention, a momentum queue may refer to a queue having a certain length. The queue elements in the momentum queue may be referred to as token vectors, i.e., the momentum queue may include a plurality of token vectors. The token vector included in the momentum queue may refer to a sample token vector corresponding to a negative sample. The sample characterization vectors comprised by the momentum queues may be dynamically updated, i.e. each turn has a momentum queue corresponding to the turn. Updating is performed in the momentum queue corresponding to the current turn by adding the sample token vector corresponding to the parent view corresponding to the previous turn to the momentum queue corresponding to the previous turn and removing one token vector of the momentum queue corresponding to the previous turn from the momentum queue according to the time sequence order, so that the number of the sample token vectors included in the momentum queue is kept unchanged.
According to an embodiment of the invention, the auto-supervised model may comprise a first encoder and a second encoder. Multiple rounds of training may be performed on the first encoder and the second encoder until a predetermined condition is satisfied. The trained second encoder is determined as the characterization model.
According to an embodiment of the present invention, performing multiple rounds of training on the first encoder and the second encoder may include: and processing the parent sample corresponding to the current round by using the first encoder corresponding to the current round to obtain a sample characterization vector of the parent sample corresponding to the current round. And processing the positive sample corresponding to the current round by using a second encoder corresponding to the current round to obtain a positive sample characterization vector corresponding to the current round. The positive samples are obtained by performing data enhancement on the negative samples. Training a first encoder and a second encoder corresponding to the current round using the sample characterization vector of the parent sample, the sample characterization vector of the positive samples, and the sample characterization vectors of the plurality of negative samples corresponding to the current round based on a first loss function. The sample characterization vectors of the negative samples corresponding to the current round are obtained from the momentum queue corresponding to the current round and the sample characterization vectors of the parent samples based on the sample selection policy corresponding to the current round. The momentum queue comprises a sample characterization vector of the candidate negative sample, which is obtained by processing the candidate negative sample by the second encoder.
According to the embodiment of the invention, the sample characterization vectors of the negative samples corresponding to the current round are obtained by selecting a part of the sample characterization vectors from the momentum queue corresponding to the current round according to at least one first target distance based on the sample selection strategy corresponding to the current round. The first target distance may be a distance between a sample characterization vector of a parent sample corresponding to the current round and a sample characterization vector of a candidate negative sample included in the momentum queue. For example, for each of the at least one first target distance, in an instance in which the first target distance is determined to be greater than or equal to a first predetermined distance threshold, a sample characterization vector of a candidate negative sample in the momentum queue of the current round corresponding to the first target distance may be determined as a sample characterization vector of a negative sample corresponding to the current round. The first predetermined distance threshold may be configured according to an actual service requirement, and is not limited herein.
According to an embodiment of the present invention, the InfoNCE based on the distance distribution may be determined according to the following formula (1).
Figure DEST_PATH_IMAGE001
(1)
In accordance with an embodiment of the present invention,
Figure 764954DEST_PATH_IMAGE002
the distance distribution based InfoNCE is characterized.
Figure DEST_PATH_IMAGE003
Figure 163706DEST_PATH_IMAGE004
A sample characterization vector characterizing a parent sample corresponding to the current round.
Figure DEST_PATH_IMAGE005
A sample characterization vector characterizing the positive samples corresponding to the parent sample of the current round.
Figure 200407DEST_PATH_IMAGE006
Characterize the first corresponding to the current round
Figure DEST_PATH_IMAGE007
Of a negative sampleThe samples characterize the vector.
Figure 848558DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE009
Is greater than or equal to 1 and less than or equal to
Figure 795785DEST_PATH_IMAGE010
Is an integer of (1).
Figure 783464DEST_PATH_IMAGE010
May be an integer greater than 1.
Figure DEST_PATH_IMAGE011
The number of negative samples included in the momentum queue corresponding to the current round is characterized.
Figure 853051DEST_PATH_IMAGE012
Characterization of
Figure DEST_PATH_IMAGE013
And with
Figure 985567DEST_PATH_IMAGE014
A first target distance therebetween.
Figure DEST_PATH_IMAGE015
A first predetermined distance threshold is characterized.
Figure 939748DEST_PATH_IMAGE016
Characterizing the hyper-parametric quantity.
According to the embodiment of the invention, the loss function value is determined by utilizing the distance distribution-based InfonCE, so that the negative sample is determined from a plurality of candidate negative samples, the negative sample with smaller difference with the positive sample in the momentum queue is effectively prevented from participating in the training of the model, and the probability of overfitting of the self-supervision model in the training stage is reduced.
FIG. 3 schematically shows an example schematic of a training process of a characterization model according to an embodiment of the invention.
As shown in fig. 3, in 300, the auto-supervised model 303 may include a first encoder 3031 and a second encoder 3032. The parent samples 301 may be processed with the first encoder 3031 resulting in a sample characterization vector 304 for the parent samples. The positive samples 302 corresponding to the parent samples 301 are processed by the second encoder 3032 to obtain a sample characterization vector 305 of positive samples.
A first target distance between the sample characterization vector 304 of the parent sample and the sample characterization vector 306 of each of the plurality of candidate negative samples of the momentum queue is determined, resulting in a plurality of first target distances 307. A sample characterization vector 308 for each of the plurality of negative samples is determined from the sample characterization vectors 306 for each of the plurality of candidate negative samples comprised by the vector queue, based on the plurality of first target distances 307 and the first predetermined distance threshold.
Based on the first loss function 309, a first loss function value 310 is obtained from the sample characterization vector 304 of the parent sample, the sample characterization vector 305 of the positive sample, and the sample characterization vectors 308 of the respective negative samples. The model parameters of the first encoder 3031 and the second encoder 3032 are adjusted according to the first loss function value 310, resulting in a trained second encoder 3032. The trained second encoder 3032 is determined as the characterization model.
According to an embodiment of the present invention, the significant sample set may include at least one significant sample.
According to an embodiment of the present invention, the sample data generation method may further include the following operations.
And determining an abnormal sample set from the clustering sample set corresponding to the significant samples according to the significant samples so as to optimize the characterization model by using the significant sample set and the abnormal sample set. The set of outlier samples includes outlier samples that are of a different class than the salient samples.
According to an embodiment of the present invention, the abnormal sample set may include at least one abnormal sample. The category of the abnormal sample is different from the category of the significant sample corresponding to the abnormal sample. Abnormal samples in the cluster sample set corresponding to the significant samples can be obtained according to the characteristic information of the significant samples and the characteristic information of the cluster samples in the cluster sample set corresponding to the significant samples. For example, a cluster sample that does not match the feature information of the significant sample is determined as an abnormal sample.
According to an embodiment of the present invention, after the abnormal sample set is determined, based on the second loss function, the second loss function value may be obtained according to the sample characterization vector of the abnormal sample included in the abnormal sample set and the sample characterization vector of the significant sample included in the significant sample set. And adjusting the model parameters of the characterization model according to the second loss function value to obtain the optimized characterization model.
According to an embodiment of the invention, the second loss function may comprise one of: a contrast Loss (i.e., contrast Loss) function, a triple Loss (i.e., triple Loss) function, a ranking table Loss (i.e., Ranked list Loss) function, and a majority Similarity Loss (i.e., Multi-Similarity Loss) function, among others.
FIG. 4 schematically shows an example schematic of an optimization process of a characterization model according to an embodiment of the invention.
As shown in fig. 4, in 400, a significant sample set 401 may be processed using a characterization model 402, resulting in a sample characterization vector 403 of significant samples included in the significant sample set 401. The abnormal sample set 404 is processed by the characterization model 402, and a sample characterization vector 405 of an abnormal sample in the abnormal sample set 404 is obtained. The sample characterization vector 403 for the significant sample and the sample characterization vector 405 for the abnormal sample may be input into a second loss function 406, resulting in a second loss function value 407. And adjusting the model parameters of the characterization model 402 according to the second loss function value 407 to obtain the optimized characterization model. The second loss function may comprise a triplet loss function.
According to an embodiment of the present invention, determining an abnormal sample set from a cluster sample set corresponding to a significant sample according to the significant sample may include the following operations.
In response to detecting a marking operation for a salient sample, a set of clustered samples corresponding to the salient sample is displayed. And determining samples different from the category of the significant samples from the clustering sample set corresponding to the significant samples to obtain an abnormal sample set.
According to the embodiment of the invention, under the condition that the marking operation aiming at the significant samples is detected, the clustered samples in the clustered sample set corresponding to the significant samples can be dynamically displayed, so that under the condition that the significant samples are marked, the clustered samples different from the classes of the significant samples can be determined from the clustered sample set corresponding to the significant samples, and the abnormal sample set is obtained.
According to the embodiment of the invention, the cluster samples in the cluster sample set corresponding to the significant samples can be displayed by utilizing the preset plug-in. For example, the predetermined plug-in may be a rendering plug-in having a page rendering function. A display page for displaying a set of clustered samples corresponding to a salient sample may be rendered with a rendering plug-in.
According to the embodiment of the invention, the characterization model is optimized by utilizing the significant sample set and the abnormal sample set with different classes, so that the generalization capability of the characterization model can be improved, and the training precision of the characterization model and the subsequent application model can be improved.
According to an embodiment of the present invention, operation S220 may include the following operations.
And utilizing a density-based clustering algorithm to represent a vector set according to the samples to obtain at least one clustering sample set. The cluster sample set has a cluster sample center. The cluster sample set includes at least one cluster sample. Determining a significant sample set from the at least one clustered sample set may include the following operations. The cluster sample center is determined to be a significant sample.
According to an embodiment of the present invention, the density-based clustering algorithm may include the following: DBSCAN (sensitivity-Based Spatial Clustering of Application with Noise, Density-Based Noise Application Spatial Clustering) algorithm, CFSFDP (Clustering by Fast Search and Find Density Peaks-Based Clustering) algorithm, and the like.
For example, in case the density-based clustering algorithm is the DBSCAN algorithm, the radius of the clustered sample set and the minimum number of samples in the clustered sample set need to be determined. The radius and the minimum sample number of the cluster sample set may be set adaptively, and may also be set according to actual service requirements, which is not limited herein. For example, a distance matrix can be determined that clusters all samples in a sample set. An upper triangular matrix of the distance matrix is obtained. And determining the radius of the cluster sample set according to the size of each element value included in the distance matrix. And under the condition that the radius of the clustering sample set is the radius of the clustering sample set, pre-clustering a preset sample set to obtain the number of samples respectively included in at least one pre-clustering sample set. The minimum number of samples is determined based on the number of samples each included in the at least one pre-clustered sample set. For example, an average determined according to the number of samples each included in the at least one pre-clustered sample set may be determined as the minimum number of samples. According to the embodiment of the invention, the center of the clustering sample is determined as the significant sample, so that the significant sample can be excavated in a mass sample set through clustering, and the cost of training and optimizing subsequent models is reduced.
According to an embodiment of the present invention, obtaining at least one cluster sample set by using a density-based clustering algorithm and characterizing a vector set according to samples may include the following operations.
And utilizing a clustering algorithm based on density to obtain at least one initial clustering sample set according to the sample characterization vector set. The initial clustered sample set has an initial clustered sample center. And under the condition that the deviated samples are determined to exist, determining an initial clustering sample set corresponding to the deviated samples according to the sample characterization vectors of the deviated samples and the sample characterization vectors corresponding to the centers of at least one initial clustering sample to obtain an updated initial clustering sample set. And clustering the sample sets to be reunited according to the sample characterization vector set corresponding to the sample sets to be reunited to obtain at least one clustering sample set corresponding to the sample sets to be reunited. The sample set to be regrouped comprises at least one of the following: the updated initial clustering sample set and at least one other clustering sample set, wherein the other clustering sample set is the initial clustering sample set except the updated initial clustering sample set in the at least one initial clustering sample set.
According to the embodiment of the invention, the sample characterization vector set can be subjected to preliminary clustering by using a density-based clustering algorithm to obtain at least one initial clustering sample set. The initial clustering sample center is the centroid of the initial clustering sample set.
According to the embodiment of the invention, in the case that the deviated sample is determined to exist, the second target distance between the deviated sample and the center of the at least one initial clustered sample is determined according to the sample characterization vector of the deviated sample and the sample characterization vector corresponding to the center of the at least one initial clustered sample, so as to obtain the at least one second target distance. And determining a target initial clustering sample center from the at least one initial clustering sample center according to the at least one second target distance. And determining the deviated sample as a clustering sample in a clustering sample set corresponding to the target initial clustering sample center. For example, a minimum target distance may be determined from the at least one second target distance. And determining the initial clustering sample center corresponding to the minimum target distance as the target initial clustering sample center.
According to the embodiment of the invention, clustering noise generated in the clustering process can be eliminated by re-matching the deviated samples generated by the density-based clustering algorithm, so that the quality of the determined samples for participating in the subsequent deep learning model training is improved.
According to an embodiment of the present invention, the sample data generating method may further include the following operations.
And determining the distance between the significant sample and at least one historical significant sample included in the historical significant sample set according to the sample characterization vector of the significant sample and the sample characterization vector set corresponding to the historical significant sample set included in the historical sample set, so as to obtain at least one distance. And determining whether a matching sample matched with the significant sample exists in the historical significant sample set according to at least one distance.
According to embodiments of the invention, the historical sample set may comprise a historical significant sample set. The set of historically significant samples may include a plurality of historically significant samples. The historically significant samples may have a set of historically clustered samples corresponding to the historically significant samples.
According to the embodiment of the invention, the historical significant sample set can be processed by using the characterization model, and the sample characterization vector set corresponding to the historical significant sample set is obtained. In addition, the historical significant sample set can be processed by using a feature extraction algorithm, and a sample characterization vector set corresponding to the historical significant sample set is obtained.
According to an embodiment of the present invention, it may be determined whether there is a matching sample in the historical set of significant samples that matches the significant sample based on the at least one distance and a second predetermined distance threshold. For example, for a distance of the at least one distance, in the event that it is determined that there is a distance less than or equal to a second predetermined distance threshold, it is determined that there is a matching sample in the historical set of significant samples that matches the significant sample. In the event that it is determined that there is no distance less than or equal to the second predetermined distance threshold, it is determined that there are no matching samples in the historical set of significant samples that match the significant samples. The second predetermined distance threshold may be configured according to actual service requirements, and is not limited herein.
According to an embodiment of the present invention, in a case where it is determined that the number of distances less than or equal to the second predetermined distance threshold is greater than 1, the minimum distance is determined from the plurality of distances. The historical salient sample corresponding to the minimum distance is determined to be the matching sample that matches the salient sample. In the event that it is determined that there are a number of distances equal to 1 that are less than or equal to the second predetermined distance threshold, the historical salient sample corresponding to the distance is determined to be a matching sample that matches the salient sample.
According to the embodiment of the invention, the significant samples can be added to the historical sample set, the clustering sample set corresponding to the significant samples is added to the historical sample set, and the construction of the historical sample set is completed step by step.
According to an embodiment of the present invention, the significant sample set may include at least one significant sample.
According to an embodiment of the present invention, the sample data generation method may further include the following operations.
And aiming at the significant samples, under the condition that the matched samples matched with the significant samples exist in the historical significant sample set according to the sample characterization vectors of the significant samples and the sample characterization vector set corresponding to the historical significant sample set included in the historical sample set, merging the clustering sample set corresponding to the significant samples and the clustering sample set corresponding to the matched samples. In the case that it is determined that there is no matching sample matching the significant sample in the historical significant sample set based on the sample characterization vector of the significant sample and the sample characterization vector set corresponding to the historical significant sample set included in the historical sample set, the significant sample is determined as a new historical significant sample, and the clustered sample set corresponding to the significant sample is added to the historical sample set.
According to the embodiment of the invention, in the case that the matched sample matched with the significant sample exists in the historical significant sample set, the significant sample, the cluster sample set corresponding to the significant sample and the cluster sample set corresponding to the matched sample can be merged. In the event that it is determined that there are no matching samples in the historical significant sample set that match the significant samples, the cluster sample set corresponding to the significant samples may be added to the historical sample set.
According to the embodiment of the invention, after the historical sample set is updated for multiple times, the historical sample set can be subjected to data cleaning. For example, respective distances between respective historical samples included in the historical sample set may be determined, resulting in a plurality of distances. And according to the plurality of distances and a third preset distance threshold value, re-determining the historical significant samples in the historical sample set and the historical clustering sample set corresponding to the historical significant samples. The third predetermined distance threshold may be configured according to actual service requirements, and is not limited herein. For example, the third predetermined distance threshold may be greater than the second predetermined distance threshold.
According to the embodiment of the invention, the merging operation of the clustering sample sets is carried out under the condition that the matched samples exist in the history significant sample sets. And under the condition that no matched sample exists in the historical significant sample set, adding the significant sample and the clustering sample set, so that repeated samples can be avoided, and unified management on the historical sample set is realized.
Fig. 5 schematically shows an example schematic diagram of a sample data generation process according to an embodiment of the present invention.
As shown in fig. 5, in 500, a sample 501_1, a sample 501_2, a sample 501_3, samples 501_4, …, samples 501_ P, …, and a sample 501_ P may be included in the first sample set 501. P may be an integer greater than 1. P ∈ {1, 2, … …, (P-1), P }.
The first sample set 501 may be processed using the characterization model 502 to obtain a sample characterization vector set 503. Sample characterization vector 503_1, sample characterization vector 503_2, sample characterization vector 503_3, sample characterization vectors 503_4, …, sample characterization vectors 503_ P, …, and sample characterization vector 503_ P may be included in sample characterization vector set 503. For example, sample 501_ p may be processed using characterization model 502 to obtain sample characterization vector 503_ p.
The first sample set 501 may be clustered according to a sample characterization vector set 503 to obtain at least one clustered sample set 504. The at least one clustered sample set 504 may include a clustered sample set 504_1, a clustered sample set 504_2, …, a clustered sample set 504_ Q, …, and a clustered sample set 504_ Q. Q may be an integer greater than 1 and less than P. Q ∈ {1, 2, … …, (Q-1), Q }. For example, the distances between the samples in the first sample set 501 may be determined according to the sample characterization vectors in the sample characterization vector set 503, resulting in a plurality of distances. And determining samples belonging to the same preset distance range as the samples of the cluster sample set according to the plurality of distances.
A significant sample set 505 may be generated from at least one clustered sample set 504. Significant sample 505 may include significant sample 505_1, significant samples 505_2, …, significant samples 505_ Q, …, and significant sample 505_ Q. For example, the cluster sample center of the cluster sample set 504_ q is determined as the significant sample 505_ q.
From each significant sample in the significant sample set 505, an abnormal sample set 506 may be determined from the cluster sample set corresponding to each significant sample. Exception samples 506_1, exception samples 506_2, …, exception samples 506_ R, …, and exception sample 506_ R may be included in exception sample set 506. R may be an integer greater than or equal to 1. R ∈ {1, 2, … …, (R-1), R }. For example, in response to detecting a marking operation for a salient sample 505_ q, a clustered sample set 504_ q corresponding to the salient sample 505_ q may be displayed. A sample different from the class of the significant sample 505_ q is determined from the clustered sample 504_ q corresponding to the significant sample 505_ q, resulting in an abnormal sample set corresponding to the significant sample 505_ q.
The characterization model 502 may be optimized using a significant sample set 505 and an abnormal sample set 506. For example, the characterization model 502 may be trained using the significant sample set 505 and the abnormal sample set 506, resulting in an optimized characterization model.
The above is only an exemplary embodiment, but is not limited thereto, and other sample data generation methods known in the art may be included as long as the sample data can be generated.
FIG. 6 schematically shows a flowchart of a training method of a deep learning model according to an embodiment of the present invention.
As shown in FIG. 6, the method 600 includes operations S610-S630.
In operation S610, the saliency samples are input into the deep learning model, resulting in an output value.
In operation S620, a loss function value is determined according to the output value and the tag value of the significant sample.
In operation S630, model parameters of the deep learning model are adjusted according to the loss function values, resulting in a trained deep learning model.
According to the embodiment of the invention, the significant sample can be generated by using the sample data generation method according to the embodiment of the invention.
According to an embodiment of the invention, the deep learning model may include one of: text processing models, audio processing models, and image processing models. The text processing model may include at least one of: a text recognition model, a text detection model, a text question-answering model and the like. The audio processing model may include at least one of: an audio recognition model, an audio detection model, an audio synthesis model, and the like. The image processing model may include at least one of: the system comprises an image identification model, an image segmentation model, an image classification model and a target detection model.
According to an embodiment of the invention, the deep learning model may include one of: supervised, semi-supervised and unsupervised models.
According to the embodiment of the invention, the significant samples can be input into the deep learning model, and the output value of the significant sample class for representing prediction is obtained. And inputting the output value and the label value of the significant sample into a loss function to determine a loss function value, and obtaining the loss function value. Model parameters of the deep learning model can be adjusted according to the loss function values until a predetermined termination condition is met. And determining the deep learning model obtained under the condition that the preset end condition is met as the trained deep learning model. The predetermined termination condition may include the model iteration satisfying a predetermined number of times or a loss function convergence.
According to the embodiment of the invention, the marked significant sample is used as the training sample, the deep learning model is obtained through training, and the significant sample is an effective sample, so that the significant sample is used for training the deep learning model, the iteration times of the model are reduced, the training speed of the model is increased, and the prediction precision of the model is increased, therefore, the training cost of the deep learning model is reduced, the effect of improving the internal performance of the electronic equipment according with the natural law is obtained, and the core competitiveness of the electronic equipment is improved.
According to an embodiment of the present invention, the training method of the deep learning model may further include the following operations.
In a case where it is determined that the significant sample is an erroneous sample from the output value and the tag value corresponding to the significant sample, a similar sample set corresponding to the erroneous sample is determined from the history sample set from the sample characterization vector of the erroneous sample and the sample characterization vector set corresponding to the history significant sample set included in the history sample set, so that a training operation for the trained deep learning model is performed using the similar sample set.
According to the embodiment of the invention, in the case that the significant sample is determined to be an error sample, a similar sample set corresponding to the error sample can be determined from the historical sample set according to the sample characterization vector of the error sample and the sample characterization vector set of the historical significant sample set. And inputting the similar sample set into the trained deep learning model, and performing directional iteration on the wrong sample. And adjusting model parameters of the trained deep learning model through a back propagation mechanism, so as to realize the optimization of the trained deep learning model.
According to the embodiment of the invention, the historical sample set is inquired based on the error sample, the similar sample set corresponding to the error sample is determined from the historical sample set, and the trained deep learning model is optimized, so that the generalization capability of the trained deep learning model can be improved, and the actual application effect of the trained deep learning model is further improved.
Referring to fig. 7, the training method of the deep learning model according to the embodiment of the present invention is further described with reference to the specific embodiment.
FIG. 7 schematically shows an example schematic of a training process for a deep learning model according to an embodiment of the invention.
As shown in fig. 7, at 700, salient samples 701 may be input to a deep learning model 702, resulting in output values 703. From the output value 703 and the tag value 704 of the significant sample, a loss function value 705 is determined. And adjusting the model parameters of the deep learning model 702 according to the loss function values 705 to obtain a trained deep learning model.
In the case where it is determined from the output value 703 and the tag value 704 corresponding to the significant sample 701 that the significant sample 701 is an erroneous sample, a similar sample set 706 corresponding to the erroneous sample may be determined from the historical sample set so as to perform a training operation for the trained deep learning model using the similar sample set 706.
The above is only an exemplary embodiment, but is not limited thereto, and other training methods of the deep learning model known in the art may be included as long as the deep learning model can be trained.
Fig. 8 schematically shows a flow chart of a data processing method according to an embodiment of the invention.
As shown in fig. 8, the method 800 includes operation S810.
In operation S810, data to be processed is input into the trained deep learning model, and a data processing result is obtained.
According to an embodiment of the present invention, the trained deep learning model may be obtained by training using a training method of the deep learning model provided according to the embodiment of the present invention.
According to an embodiment of the invention, the data to be processed may comprise at least one of: image data, text data, and audio data.
According to the embodiment of the invention, under the condition that the data to be processed is processed by utilizing the trained deep learning model, the category of the sample to be detected can be more accurately determined, so that the cost consumption of manually marking the data to be processed is reduced, and the prediction accuracy of the data to be processed and the processing efficiency of the data to be processed are improved.
The above is only an exemplary embodiment, but is not limited thereto, and other data processing methods known in the art may be included as long as they can process data.
FIG. 9 schematically shows an example schematic of an overall method flow according to an embodiment of the invention.
According to the embodiment of the invention, for example, the sample data generation method, the deep learning model training method and the data processing method provided by the embodiment of the invention can be applied to industrial safety production scenes. The following describes a scheme provided by an embodiment of the present invention, taking an industrial safety production scenario as an example. That is, the sample set 901 may be a production data set in an industrial safety production scenario.
As shown in fig. 9, in 900, there are six processes, that is, a sample data generation process, a process of updating a history sample set with a significant sample set obtained in the sample data generation process, a process of training a deep learning model with the significant sample set obtained in the sample generation process, a process of processing data with the trained deep learning model, a process of optimizing a characterization model with the significant sample set and an abnormal sample set obtained in the sample generation process, and a process of determining a similar sample set from the history sample set with an error sample determined by the history sample set and the trained deep learning model, and a process of optimizing the trained deep learning model with the similar sample set.
For the sample generation process, i.e., sample set 901 → characterization model 902 → sample characterization vector set 903 → cluster sample set 905 → data strategy based on cluster distribution 906 → significant sample set 907.
For example, the sample set 901 may be processed using the characterization model 902, resulting in a sample characterization vector set 903 corresponding to the sample set 901. And clustering 904 the sample set 901 according to the sample characterization vector set 903 to obtain at least one clustered sample set 905. A significant sample set 907 is determined from at least one cluster sample set 905 using a cluster distribution based data policy 906.
The update procedure for the historical sample set 911, i.e., significant sample set 907 → historical sample set 911.
For example, in a case where it is determined that there is a matching sample matching the significant sample in the historical significant sample set 911 from the sample characterization vector of the significant sample included in the significant sample set 907 and the sample characterization vector set corresponding to the historical significant sample set 911, the cluster sample set corresponding to the significant sample and the cluster sample set corresponding to the matching sample are merged.
In the event that it is determined from the sample characterization vector of the significant sample and the set of sample characterization vectors corresponding to the set of historical significant samples included in the set of historical samples 911 that there is no matching sample in the set of historical significant samples 911 that matches the significant sample, the significant sample is determined as a new historical significant sample, and the set of clustered samples corresponding to the significant sample is added to the set of historical samples 911.
For the training process of the deep learning model 910, the significant sample set 907 → the labeled significant sample set 909 → the deep learning model 910 → the trained deep learning model 912.
For example, salient sample set 907 may be labeled, resulting in labeled salient sample set 909. The deep learning model 910 is trained using the labeled salient sample set 909, resulting in a trained deep learning model 912.
For the data processing procedure, the data to be processed 913 → the trained deep learning model 912 → the data processing result 914.
For example, the data to be processed 913 may be input into the trained deep learning model 912, resulting in a data processing result 914.
For the optimization process of the characterization model 902, sample set → characterization model 902 → optimized characterization model. The sample sets may include a significant sample set 907 and an abnormal sample set 908.
For example, the characterization model 902 may be optimized using the significant sample set 907 and the abnormal sample set 908, resulting in an optimized characterization model.
For the optimization process of the trained deep learning model 912, significant samples → erroneous samples 915 → historical sample set 911 → similar sample set 916 → trained deep learning model 912 → optimized deep learning model.
For example, salient samples may be input into the trained deep learning model 912, resulting in output values. In the case where it is determined that the significant sample is the erroneous sample 915 based on the output value and the tag value corresponding to the significant sample, a similar sample set 916 corresponding to the erroneous sample 915 is determined from the historical sample set 911 based on the sample characterization vector of the erroneous sample 915 and the sample characterization vector set corresponding to the historical significant sample set included in the historical sample set 911. The trained deep learning model 912 is optimized by using the similar sample set 916, and an optimized deep learning model is obtained.
Fig. 10 schematically shows a block diagram of a sample data generation apparatus according to an embodiment of the present invention.
As shown in fig. 10, the sample generation apparatus 1000 may include a first obtaining module 1010, a second obtaining module 1020, and a generating module 1030.
A first obtaining module 1010, configured to obtain a sample characterization vector set according to the first sample set. The first sample set includes a plurality of samples, the samples not being categorised.
A second obtaining module 1020, configured to cluster the first sample set according to the sample characterization vector set, so as to obtain at least one clustered sample set.
A generating module 1030, configured to generate a significant sample data set according to at least one clustered sample set.
According to an embodiment of the present invention, the first obtaining module 1010 may include a first obtaining unit.
And the first obtaining unit is used for processing the first sample set by using the characterization model to obtain a sample characterization vector set. The characterization model is obtained by training an automatic supervision model according to a sample characterization vector of the positive sample and sample characterization vectors of a plurality of negative samples corresponding to the positive sample based on a loss function. The plurality of negative examples are determined from a plurality of candidate negative examples corresponding to the positive examples.
According to an embodiment of the present invention, the determining the plurality of negative examples from the plurality of candidate negative examples corresponding to the positive examples may include: the plurality of negative samples corresponding to the positive samples are determined from the plurality of candidate negative samples based on the characterization vector of the positive samples and the characterization vectors of the plurality of candidate negative samples corresponding to the positive samples. The sample characterization vector of the positive sample is obtained by processing the positive sample by using an auto-supervision model. The sample characterization vector of the negative sample is obtained by processing the negative sample by using an auto-supervision model.
According to an embodiment of the invention, the salient sample set comprises at least one salient sample.
According to an embodiment of the present invention, the sample generation apparatus 1000 may further include a second determining module.
And the second determining module is used for determining an abnormal sample set from the clustering sample set corresponding to the significant samples according to the significant samples so as to optimize the characterization model by using the significant sample set and the abnormal sample set. The set of outlier samples includes outlier samples that are of a different class than the salient samples.
According to an embodiment of the present invention, the second determination module may include a display unit and a first determination unit.
And the display unit is used for responding to the detection of the marking operation aiming at the significant samples and displaying the clustering sample set corresponding to the significant samples.
And the first determining unit is used for determining samples different from the category of the significant samples from the clustering sample set corresponding to the significant samples to obtain an abnormal sample set.
According to an embodiment of the present invention, the second obtaining module 1020 may include a second obtaining unit.
And the second obtaining unit is used for obtaining at least one clustering sample set according to the sample characterization vector set by using a density-based clustering algorithm. The cluster sample set has a cluster sample center, and the cluster sample set includes at least one cluster sample.
According to an embodiment of the present invention, the generating module 1030 may include a second determining unit.
And the second determining unit is used for determining the center of the clustered sample as the significant sample.
According to an embodiment of the present invention, the second obtaining unit may include a first obtaining sub-unit, a determining sub-unit, and a second obtaining sub-unit.
And the first obtaining subunit is used for obtaining at least one initial clustering sample set according to the sample characterization vector set by using a density-based clustering algorithm. The initial cluster sample set has an initial cluster sample center.
And the determining subunit is used for determining an initial clustering sample set corresponding to the deviated sample according to the sample characterization vector of the deviated sample and the sample characterization vector corresponding to the center of at least one initial clustering sample under the condition that the deviated sample is determined to exist, so as to obtain an updated initial clustering sample set.
And the second obtaining subunit is used for clustering the sample sets to be reunited according to the sample characterization vector set corresponding to the sample sets to be reunited to obtain at least one clustering sample set corresponding to the sample sets to be reunited. The sample set to be reunited comprises at least one of the following items: the updated initial clustering sample set and at least one other clustering sample set, wherein the other clustering sample set is the initial clustering sample set except the updated initial clustering sample set in the at least one initial clustering sample set.
According to an embodiment of the invention, the salient sample set comprises at least one salient sample.
According to an embodiment of the present invention, the sample generation apparatus 1000 may further include a third determination module and a fourth determination module.
And a third determining module, configured to, for the significant sample, merge the cluster sample set corresponding to the significant sample with the cluster sample set corresponding to the matched sample when it is determined that there is a matched sample matching the significant sample in the history significant sample set according to the sample characterization vector of the significant sample and the sample characterization vector set corresponding to the history significant sample set included in the history sample set.
And a fourth determining module, configured to determine the significant sample as a new historical significant sample and add the cluster sample set corresponding to the significant sample to the historical sample set, when it is determined that there is no matching sample matching the significant sample in the historical significant sample set according to the sample characterization vector of the significant sample and the sample characterization vector set corresponding to the historical significant sample set included in the historical sample set.
According to an embodiment of the present invention, the sample generation apparatus 1000 may further include a fifth determination module and a sixth determination module.
And a fifth determining module, configured to determine, according to the sample characterization vector of the significant sample and a sample characterization vector set corresponding to the historical significant sample set included in the historical first sample set, a distance between the significant sample and at least one historical significant sample included in the historical significant sample set, so as to obtain at least one distance.
And the sixth determining module is used for determining whether a matched sample matched with the significant sample exists in the historical significant sample set according to the at least one distance.
According to an embodiment of the invention, the sample comprises one of: sample images, sample text, and sample audio.
FIG. 11 is a block diagram schematically illustrating a training apparatus for deep learning models according to an embodiment of the present invention.
As shown in fig. 11, the training apparatus 1100 for deep learning model may include a third obtaining module 1110, a first determining module 1120, and a fourth obtaining module 1130.
And a third obtaining module 1110, configured to input the significant sample into the deep learning model to obtain an output value.
A first determining module 1120 is configured to determine a loss function value according to the output value and the label value of the significant sample.
A fourth obtaining module 1130, configured to adjust the model parameters of the deep learning model according to the loss function value, so as to obtain the trained deep learning model.
According to an embodiment of the present invention, the significant sample may be generated by using the sample data generating apparatus according to an embodiment of the present invention.
According to an embodiment of the present invention, the training apparatus 1100 for deep learning model described above may further include a seventh determining module.
And a seventh determining module, configured to determine, in a case where the significant sample is determined to be the error sample according to the output value and the tag value corresponding to the significant sample, a similar sample set corresponding to the error sample from the historical sample set according to the sample characterization vector of the error sample and the sample characterization vector set corresponding to the historical significant sample set included in the historical sample set, so as to perform a training operation for the trained deep learning model using the similar sample set.
Fig. 12 schematically shows a block diagram of a data processing device according to an embodiment of the present invention.
As shown in fig. 12, the data processing apparatus 1200 may include a fifth obtaining module 1210.
A fifth obtaining module 1210, configured to input data to be processed into the trained deep learning model, so as to obtain a data processing result.
According to an embodiment of the present invention, the trained deep learning model may be obtained by training with a training device of the deep learning model according to the embodiment of the present invention.
The invention also provides an electronic device, a readable storage medium and a computer program product according to the embodiments of the invention.
According to an embodiment of the present invention, an electronic apparatus includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.
According to an embodiment of the present invention, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to execute the method as described above.
According to an embodiment of the invention, a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.
Fig. 13 schematically shows a block diagram of an electronic device adapted to implement a sample data generation method, a training method of a deep learning model, and a data processing method according to an embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 13, the electronic device 1300 includes a computing unit 1301 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1302 or a computer program loaded from a storage unit 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data necessary for the operation of the electronic apparatus 1300 can also be stored. The calculation unit 1301, the ROM 1302, and the RAM 1303 are connected to each other via a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.
A number of components in the electronic device 1300 are connected to the I/O interface 1305, including: an input unit 1306 such as a keyboard, a mouse, and the like; an output unit 1307 such as various types of displays, speakers, and the like; a storage unit 1308 such as a magnetic disk, optical disk, or the like; and a communication unit 1309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1309 allows the electronic device 1300 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
Computing unit 1301 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of computing unit 1301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1301 performs the respective methods and processes described above, such as a sample generation method, a training method of a deep learning model, and a data processing method. For example, in some embodiments, the sample generation methods, the training methods for deep learning models, and the data processing methods may be implemented as computer software programs that are tangibly embodied on a machine-readable medium, such as storage unit 1308. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 1300 via the ROM 1302 and/or the communication unit 1309. When the computer program is loaded into the RAM 1303 and executed by the computing unit 1301, one or more steps of the sample generation method, the training method of the deep learning model, and the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1301 may be configured in any other suitable manner (e.g., by means of firmware) to perform the sample generation method, the training method of the deep learning model, and the data processing method.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present invention may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in this disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed herein can be achieved, and the present disclosure is not limited herein.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (26)

1. A sample generation method, comprising:
obtaining a sample characterization vector set according to a first sample set, wherein the first sample set comprises a plurality of samples, and the samples are not determined to be of a category;
clustering the first sample set according to the sample characterization vector set to obtain at least one clustered sample set; and
generating a significant sample set according to the at least one clustering sample set;
wherein the set of significant samples comprises at least one significant sample;
the method further comprises the following steps:
for the significant sample or samples in question,
under the condition that a matched sample matched with the significant sample exists in the historical significant sample set according to the sample characterization vector of the significant sample and a sample characterization vector set corresponding to the historical significant sample set included in the historical sample set, merging the clustering sample set corresponding to the significant sample and the clustering sample set corresponding to the matched sample; and
in the case that it is determined that there is no matching sample matching the significant sample in the historical significant sample set based on the sample characterization vector of the significant sample and the sample characterization vector set corresponding to the historical significant sample set included in the historical sample set, determining the significant sample as a new historical significant sample, and adding the clustering sample set corresponding to the significant sample to the historical sample set.
2. The method of claim 1, wherein the deriving a set of sample characterization vectors from the first set of samples comprises:
processing the first sample set by using a characterization model to obtain the sample characterization vector set, wherein the characterization model is obtained by training an auto-supervision model according to a sample characterization vector of a positive sample and sample characterization vectors of a plurality of negative samples corresponding to the positive sample based on a loss function, and the plurality of negative samples are determined from a plurality of candidate negative samples corresponding to the positive sample.
3. The method of claim 2, wherein the plurality of negative examples are determined from a plurality of candidate negative examples corresponding to the positive example, comprising:
a plurality of negative samples corresponding to the positive sample are determined from the plurality of candidate negative samples according to the characterization vector of the positive sample and the characterization vectors of a plurality of candidate negative samples corresponding to the positive sample;
wherein the sample characterization vector of the positive sample is obtained by processing the positive sample by using the auto-supervision model;
wherein the sample characterization vector of the negative sample is obtained by processing the negative sample with the auto-supervised model.
4. The method of claim 2 or 3, wherein the set of significant samples comprises at least one significant sample;
the method further comprises the following steps:
and determining an abnormal sample set from the clustering sample set corresponding to the significant samples according to the significant samples so as to optimize the characterization model by using the significant sample set and the abnormal sample set, wherein the abnormal sample set comprises abnormal samples with different categories from the significant samples.
5. The method of claim 4, wherein the determining, from the significant samples, a set of outlier samples from a set of clustered samples corresponding to the significant samples comprises:
in response to detecting a marking operation for the significant sample, displaying a cluster sample set corresponding to the significant sample; and
determining samples different from the category of the significant samples from the clustering sample set corresponding to the significant samples, and obtaining the abnormal sample set.
6. The method according to claim 1 or 2, wherein the clustering the first sample set according to the sample characterization vector set to obtain at least one clustered sample set comprises:
obtaining at least one clustering sample set according to the sample characterization vector set by using a density-based clustering algorithm, wherein the clustering sample set is provided with a clustering sample center and comprises at least one clustering sample;
wherein the determining a significant sample set from the at least one clustered sample set comprises:
determining the cluster sample center as the significant sample.
7. The method of claim 6, wherein said deriving the at least one clustered sample set from the sample characterization vector set using a density-based clustering algorithm comprises:
obtaining at least one initial clustering sample set according to the sample characterization vector set by using the density-based clustering algorithm, wherein the initial clustering sample set has an initial clustering sample center;
in the case where it is determined that there is a deviated sample,
determining an initial clustering sample set corresponding to the deviated sample according to the sample characterization vector of the deviated sample and the sample characterization vector corresponding to at least one initial clustering sample center to obtain an updated initial clustering sample set; and
clustering the sample sets to be reunited according to the sample characterization vector set corresponding to the sample sets to be reunited to obtain at least one clustering sample set corresponding to the sample sets to be reunited, wherein the sample sets to be reunited comprise at least one of the following: the updated initial cluster sample set and at least one other cluster sample set, the other cluster sample set being an initial cluster sample set in the at least one initial cluster sample set other than the updated initial cluster sample set.
8. The method of claim 1, further comprising:
determining a distance between the significant sample and at least one historical significant sample included in the historical significant sample set according to the sample characterization vector of the significant sample and a sample characterization vector set corresponding to the historical significant sample set included in the historical sample set, so as to obtain at least one distance; and
determining whether there is a matching sample in the historical significant sample set that matches the significant sample based on the at least one distance.
9. The method of claim 1 or 2, wherein the sample comprises one of: sample images, sample text, and sample audio.
10. A training method of a deep learning model comprises the following steps:
inputting the significant sample into the deep learning model to obtain an output value;
determining a loss function value according to the output value and the label value of the significant sample; and
adjusting the model parameters of the deep learning model according to the loss function value to obtain a trained deep learning model,
wherein the significant sample data is generated according to the method of any one of claims 1 to 9.
11. The method of claim 10, further comprising:
in a case where it is determined that the significant sample is an erroneous sample according to the output value and the tag value corresponding to the significant sample, a similar sample set corresponding to the erroneous sample is determined from the historical sample set according to a sample characterization vector of the erroneous sample and a sample characterization vector set corresponding to a historical significant sample set included in the historical sample set, so that a training operation for the trained deep learning model is performed using the similar sample set.
12. A method of data processing, comprising:
inputting the data to be processed into the trained deep learning model to obtain a data processing result,
wherein the trained deep learning model is trained according to the method of claim 10 or 11.
13. A sample generation device, comprising:
a first obtaining module, configured to obtain a sample characterization vector set according to a first sample set, where the first sample set includes multiple samples, and the samples are not classified;
the second obtaining module is used for clustering the first sample set according to the sample characterization vector set to obtain at least one clustering sample set; and
the generating module is used for generating a significant sample data set according to the at least one clustering sample set;
the significant sample set comprises at least one significant sample;
the device further comprises:
a third determination module to determine, for the significant samples,
under the condition that a matched sample matched with the significant sample exists in the historical significant sample set according to the sample characterization vector of the significant sample and a sample characterization vector set corresponding to the historical significant sample set included in the historical sample set, merging the clustering sample set corresponding to the significant sample and the clustering sample set corresponding to the matched sample; and
a fourth determining module, configured to determine the significant sample as a new historical significant sample if it is determined that there is no matching sample matching the significant sample in the historical significant sample set according to the sample characterization vector of the significant sample and the sample characterization vector set corresponding to the historical significant sample set included in the historical sample set, and add the cluster sample set corresponding to the significant sample to the historical sample set.
14. The apparatus of claim 13, wherein the first obtaining means comprises:
a first obtaining unit, configured to process the first sample set by using a characterization model to obtain the sample characterization vector set, where the characterization model is obtained by training an auto-supervision model according to a sample characterization vector of a positive sample and a sample characterization vector of a plurality of negative samples corresponding to the positive sample, and the plurality of negative samples are determined from a plurality of candidate negative samples corresponding to the positive sample, based on a loss function.
15. The apparatus of claim 14, wherein the plurality of negative examples are determined from a plurality of candidate negative examples corresponding to the positive example, comprising:
a plurality of negative samples corresponding to the positive sample are determined from the plurality of candidate negative samples according to the characterization vector of the positive sample and the characterization vectors of the plurality of candidate negative samples corresponding to the positive sample;
wherein the sample characterization vector of the positive sample is obtained by processing the positive sample by using the auto-supervision model;
wherein the sample characterization vector of the negative sample is obtained by processing the negative sample by using the auto-supervision model.
16. The apparatus of claim 14 or 15, wherein the significant sample set comprises at least one significant sample;
the device further comprises:
and a second determining module, configured to determine, according to the significant samples, an abnormal sample set from a cluster sample set corresponding to the significant samples, so as to optimize the characterization model by using the significant sample set and the abnormal sample set, where the abnormal sample set includes abnormal samples of which categories are different from those of the significant samples.
17. The apparatus of claim 16, wherein the second determining means comprises:
a display unit, configured to display a cluster sample set corresponding to the significant sample in response to detecting a marking operation for the significant sample; and
a first determining unit, configured to determine, from a cluster sample set corresponding to the significant sample, a sample different from the class of the significant sample, and obtain the abnormal sample set.
18. The apparatus of claim 13 or 14, wherein the second obtaining means comprises:
a second obtaining unit, configured to obtain the at least one cluster sample set according to the sample characterization vector set by using a density-based clustering algorithm, where the cluster sample set has a cluster sample center and includes at least one cluster sample;
wherein the generating module comprises:
a second determining unit, configured to determine the cluster sample center as the significant sample.
19. The apparatus of claim 18, wherein the second obtaining unit comprises:
a first obtaining subunit, configured to obtain at least one initial clustering sample set according to the sample characterization vector set by using the density-based clustering algorithm, where the initial clustering sample set has an initial clustering sample center;
a determining subunit, for determining, in the case that it is determined that there is a deviating sample,
determining an initial clustering sample set corresponding to the deviated sample according to the sample characterization vector of the deviated sample and the sample characterization vector corresponding to at least one initial clustering sample center to obtain an updated initial clustering sample set; and
a second obtaining subunit, configured to cluster the sample sets to be re-clustered according to a sample characterization vector set corresponding to the sample sets to be re-clustered, so as to obtain at least one clustered sample set corresponding to the sample sets to be re-clustered, where the sample sets to be re-clustered include at least one of the following: the updated initial cluster sample set and at least one other cluster sample set, the other cluster sample set being an initial cluster sample set in the at least one initial cluster sample set other than the updated initial cluster sample set.
20. The apparatus of claim 13, further comprising:
a fifth determining module, configured to determine, according to a sample characterization vector of the significant sample and a sample characterization vector set corresponding to a historical significant sample set included in the historical sample set, a distance between the significant sample and at least one historical significant sample included in the historical significant sample set, so as to obtain at least one distance; and
a sixth determining module, configured to determine whether there is a matching sample matching the significant sample in the historical significant sample set according to the at least one distance.
21. The apparatus of claim 13 or 14, wherein the sample comprises one of: sample images, sample text, and sample audio.
22. A training apparatus for deep learning models, comprising:
the third obtaining module is used for inputting the significant samples into the deep learning model to obtain output values;
a first determining module, configured to determine a loss function value according to the output value and a label value of the significant sample; and
a fourth obtaining module, configured to adjust model parameters of the deep learning model according to the loss function value to obtain a trained deep learning model,
wherein the significant sample is generated by an apparatus according to any one of claims 13 to 21.
23. The apparatus of claim 22, further comprising:
a seventh determining module, configured to, in a case where it is determined that the significant sample is an erroneous sample according to the output value and the tag value corresponding to the significant sample, determine a similar sample set corresponding to the erroneous sample from the historical sample set according to a sample characterization vector of the erroneous sample and a sample characterization vector set corresponding to a historical significant sample set included in the historical sample set, so as to perform a training operation for the trained deep learning model using the similar sample set.
24. A data processing apparatus comprising:
a fifth obtaining module, which is used for inputting the data to be processed into the trained deep learning model to obtain the data processing result,
wherein the trained deep learning model is trained according to the apparatus of claim 22 or 23.
25. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9 or any one of claims 10 to 11 or claim 12.
26. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9 or any one of claims 10-11 or claim 12.
CN202210340191.0A 2022-04-02 2022-04-02 Sample generation method, training method, data processing method and electronic device Active CN114444619B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210340191.0A CN114444619B (en) 2022-04-02 2022-04-02 Sample generation method, training method, data processing method and electronic device
CN202210754096.5A CN115130581B (en) 2022-04-02 2022-04-02 Sample generation method, training method, data processing method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210340191.0A CN114444619B (en) 2022-04-02 2022-04-02 Sample generation method, training method, data processing method and electronic device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202210754096.5A Division CN115130581B (en) 2022-04-02 2022-04-02 Sample generation method, training method, data processing method and electronic equipment

Publications (2)

Publication Number Publication Date
CN114444619A CN114444619A (en) 2022-05-06
CN114444619B true CN114444619B (en) 2022-07-26

Family

ID=81359288

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202210340191.0A Active CN114444619B (en) 2022-04-02 2022-04-02 Sample generation method, training method, data processing method and electronic device
CN202210754096.5A Active CN115130581B (en) 2022-04-02 2022-04-02 Sample generation method, training method, data processing method and electronic equipment

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202210754096.5A Active CN115130581B (en) 2022-04-02 2022-04-02 Sample generation method, training method, data processing method and electronic equipment

Country Status (1)

Country Link
CN (2) CN114444619B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115664906B (en) * 2022-10-18 2023-05-02 中国人民解放军军事科学院系统工程研究院 Method and device for unsupervised clustering of TDMA signal protocol
CN116012656B (en) * 2023-01-20 2024-02-13 北京百度网讯科技有限公司 Sample image generation method and image processing model training method and device
CN116522781B (en) * 2023-05-04 2024-04-05 北京百度网讯科技有限公司 Sample data generation method, model training method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109242106A (en) * 2018-09-07 2019-01-18 百度在线网络技术(北京)有限公司 sample processing method, device, equipment and storage medium
CN110705602A (en) * 2019-09-06 2020-01-17 平安科技(深圳)有限公司 Large-scale data clustering method and device and computer readable storage medium
CN112784981A (en) * 2021-01-20 2021-05-11 清华大学 Training sample set generation method, and training method and device for deep generation model
CN112784893A (en) * 2020-12-29 2021-05-11 杭州海康威视数字技术股份有限公司 Image data clustering method and device, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5539066B2 (en) * 2010-06-29 2014-07-02 キヤノン株式会社 Clustering processing apparatus and clustering processing method
CN110297907B (en) * 2019-06-28 2022-03-08 谭浩 Method for generating interview report, computer-readable storage medium and terminal device
CN113435545A (en) * 2021-08-14 2021-09-24 北京达佳互联信息技术有限公司 Training method and device of image processing model
CN113705650B (en) * 2021-08-20 2023-07-11 网易(杭州)网络有限公司 Face picture set processing method, device, medium and computing equipment
CN114118287A (en) * 2021-11-30 2022-03-01 北京百度网讯科技有限公司 Sample generation method, sample generation device, electronic device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109242106A (en) * 2018-09-07 2019-01-18 百度在线网络技术(北京)有限公司 sample processing method, device, equipment and storage medium
CN110705602A (en) * 2019-09-06 2020-01-17 平安科技(深圳)有限公司 Large-scale data clustering method and device and computer readable storage medium
CN112784893A (en) * 2020-12-29 2021-05-11 杭州海康威视数字技术股份有限公司 Image data clustering method and device, electronic equipment and storage medium
CN112784981A (en) * 2021-01-20 2021-05-11 清华大学 Training sample set generation method, and training method and device for deep generation model

Also Published As

Publication number Publication date
CN115130581A (en) 2022-09-30
CN115130581B (en) 2023-06-23
CN114444619A (en) 2022-05-06

Similar Documents

Publication Publication Date Title
CN114444619B (en) Sample generation method, training method, data processing method and electronic device
CN113222942A (en) Training method of multi-label classification model and method for predicting labels
CN114429633B (en) Text recognition method, training method and device of model, electronic equipment and medium
CN114612743A (en) Deep learning model training method, target object identification method and device
CN112800919A (en) Method, device and equipment for detecting target type video and storage medium
CN114741517A (en) Training method, device, equipment and medium of text classification model and text classification method, device and equipment
CN115082740A (en) Target detection model training method, target detection method, device and electronic equipment
CN112989170A (en) Keyword matching method applied to information search, information search method and device
CN113627536A (en) Model training method, video classification method, device, equipment and storage medium
CN113947701B (en) Training method, object recognition method, device, electronic equipment and storage medium
CN115632874A (en) Method, device, equipment and storage medium for detecting threat of entity object
CN114037059A (en) Pre-training model, model generation method, data processing method and data processing device
CN114494747A (en) Model training method, image processing method, device, electronic device and medium
CN114647727A (en) Model training method, device and equipment applied to entity information recognition
CN114049516A (en) Training method, image processing method, device, electronic device and storage medium
CN115169489B (en) Data retrieval method, device, equipment and storage medium
CN114444514B (en) Semantic matching model training method, semantic matching method and related device
CN115665783A (en) Abnormal index tracing method and device, electronic equipment and storage medium
CN112860626B (en) Document ordering method and device and electronic equipment
CN113920404A (en) Training method, image processing method, device, electronic device and storage medium
CN114610953A (en) Data classification method, device, equipment and storage medium
CN114359811A (en) Data authentication method and device, electronic equipment and storage medium
CN113360602B (en) Method, apparatus, device and storage medium for outputting information
CN113378781B (en) Training method and device of video feature extraction model and electronic equipment
CN114547448B (en) Data processing method, model training method, device, equipment, storage medium and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant