CN117093884B

CN117093884B - Multi-mode contrast learning sample construction method and system based on hierarchical clustering

Info

Publication number: CN117093884B
Application number: CN202311257184.5A
Authority: CN
Inventors: 郝建国; 孔桂兰; 张路霞
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-12-29
Anticipated expiration: 2043-09-27
Also published as: CN117093884A

Abstract

The application provides a multi-mode comparison learning sample construction method and system based on hierarchical clustering, wherein target clusters where anchor samples are located are determined from all clusters of a hierarchical clustering graph; traversing the hierarchical clustering graph by taking the position of the target cluster in the hierarchical clustering graph as a benchmark according to the target cluster centroid of the target cluster and the supervision information of the anchor sample, and determining at least one positive cluster and at least one divergent cluster corresponding to the anchor sample from the hierarchical clustering graph; determining at least one negative cluster corresponding to the anchoring sample according to each cluster, each positive cluster and each divergent cluster in the hierarchical cluster map; according to the anchoring sample, the target cluster, the positive sample in each positive cluster and the negative sample in each negative cluster, a plurality of comparison sample pairs are generated, so that the model is trained by using the comparison sample pairs, the purpose of correctly dividing the positive and negative samples and improving the performance index of the model is achieved.

Description

Multi-mode contrast learning sample construction method and system based on hierarchical clustering

Technical Field

The invention relates to the technical field of computers, in particular to a multi-mode contrast learning sample construction method and system based on hierarchical clustering.

Background

With the continuous development of computer technology, machine learning, especially deep learning methods, are increasingly used in the fields of analysis and processing of multi-modal data such as images, text, signals, etc., where machine learning can be classified into supervised learning and unsupervised learning. With the rapid growth of multi-modal data in recent years, in order to improve data processing efficiency, self-supervised learning is derived from non-supervised learning, which is a machine learning model for mining self-supervised information from large-scale non-labeled data.

Compared with learning, the model used as self-supervision learning can be free from the constraint of data annotation, when multi-mode data is processed, a user-defined agent task can be utilized, samples are divided in a mode of mining self-supervision information of the data, the samples describing similar things are used as positive samples, and other samples are used as negative samples. For example, taking an individual discrimination task (Instance Discrimination) as an example, it is a classical agent task in contrast learning, and the specific thinking is: taking any sample itself as an Anchor sample (Anchor), then the positive sample is a data enhancement sample of the Anchor sample, and other samples other than the Anchor sample tag or classification can be considered negative samples. Contrast learning learns feature representations by combining positive and negative samples in pairs, comparing similarities or differences between them; the goal is to maximize the similarity between samples of the same class while minimizing the similarity between samples of different classes.

However, there are widely divergent samples in the nature and professional fields that have both the characteristics of positive samples and negative samples; in this case, for the machine learning model based on similarity such as clustering, the split sample which is originally a positive sample may be mistakenly regarded as a negative sample, or the split sample of the negative sample may be regarded as a positive sample, so that the proportion of the false positive sample to the false negative sample is obviously increased, and the performance index of the model is reduced.

Disclosure of Invention

In view of this, the invention provides a multi-modal comparison learning sample construction method and system based on hierarchical clustering, which aims to realize correct division of positive and negative samples and improve performance indexes of a model.

The first aspect of the application provides a multi-mode contrast learning sample construction method based on hierarchical clustering, which comprises the following steps:

determining a target cluster in which an anchor sample is located from each cluster of the hierarchical cluster map; the hierarchical clustering graph is generated by utilizing each cluster, each cluster is obtained by clustering each sample according to the supervision information of each sample obtained by processing the original multi-modal data, and the anchoring sample is any sample in any cluster;

Traversing the hierarchical clustering graph by taking the position of the target cluster in the hierarchical clustering graph as a starting position according to the target cluster centroid of the target cluster and the supervision information of the anchor sample, and determining at least one positive cluster and at least one bifurcation cluster corresponding to the anchor sample from the hierarchical clustering graph;

determining at least one negative cluster corresponding to the anchoring sample according to each cluster, each positive cluster and each divergent cluster in the hierarchical cluster map;

and generating a plurality of comparison sample pairs according to the anchoring samples, the target clusters, the positive samples in each positive cluster and the negative samples in each negative cluster so as to train a model by using the comparison sample pairs.

Optionally, the clustering the samples according to the supervision information of each sample for processing the original multi-modal data to obtain each cluster includes:

preprocessing the acquired original multi-mode data to obtain a sample set; wherein the sample set comprises a plurality of samples and their associated other modality data;

for each sample, extracting at least one piece of supervision information from the other modal data associated with the sample according to an extraction mode matched with the data type of the other modal data associated with the sample;

Generating a corresponding supervision information vector according to the supervision information corresponding to the sample, and storing the supervision information vector in a lasting manner;

and carrying out unsupervised grouping on each sample in the sample set according to the supervision information in the supervision information vector corresponding to each sample to obtain a plurality of clusters, wherein the clusters at least comprise cluster centroids, and the cluster centroids are a combination of the supervision information.

Optionally, for each sample, extracting at least one kind of supervision information from the other modal data corresponding to the sample according to an extraction mode matched with the data type of the other modal data corresponding to the sample, including:

for each sample, if the data type of the other modal data associated with the sample is a text type of unstructured type, extracting at least one kind of supervision information from the other modal data associated with the sample based on a natural language processing method;

if the data type of the other mode data associated with the sample is a non-numerical structured type, extracting at least one kind of supervision information from the other mode data associated with the sample based on a preset rule;

And if the data type of the other mode data associated with the sample is a numerical structured type, extracting at least one initial keyword from the other mode data associated with the sample based on the preset rule, and converting each initial keyword into the supervision information of the component type.

Optionally, the performing unsupervised grouping on each sample in the sample set according to the supervision information in the supervision information vector corresponding to each sample to obtain a plurality of clusters, including:

generating a plurality of supervision information combinations according to the supervision information in each supervision information vector in the sample set;

and for each supervision information combination, taking the supervision information combination as a cluster centroid, and carrying out unsupervised grouping on each sample to obtain a cluster corresponding to the cluster centroid.

Optionally, the generating the hierarchical cluster map by using each cluster includes:

and according to the inclusion relation among the cluster centroids of the clusters, carrying out hierarchical sequencing on the clusters to obtain a cluster sequence, and connecting the clusters in the cluster sequence to generate a hierarchical cluster map.

Optionally, the traversing the hierarchical cluster map with the position of the target cluster in the hierarchical cluster map as a starting position according to the target cluster centroid of the target cluster and the supervision information of the anchor sample, and determining that the anchor sample corresponds to at least one positive cluster and at least one bifurcation cluster from the hierarchical cluster map includes:

Traversing the hierarchical clustering graph by taking the position of a target cluster in the hierarchical clustering graph as a starting position, determining a cluster corresponding to a subset of the cluster centroids of the hierarchical clustering graph as the target cluster centroids as positive clusters of the anchoring samples, and determining a cluster with both positive cluster attributes and negative cluster attributes in the hierarchical clustering graph as a divergent cluster of the anchoring samples;

the positive cluster attribute is the supervision information in the anchoring sample, and the negative cluster attribute is other supervision information except the supervision information in the anchoring sample.

Optionally, the generating a plurality of comparison sample pairs according to the anchoring samples, the target clusters, positive samples in each positive cluster, and negative samples in each negative cluster includes:

determining a representative sample from the target cluster;

according to the anchoring samples and the representative samples, determining the number of positive samples from the target clusters and each positive cluster, and adding each positive sample into a positive sample set of the anchoring samples; the number of positive samples and the number of negative samples of the anchoring samples are equal, and the number of positive samples and the number of negative samples of the anchoring samples are equal to the preset number of comparison sample pairs to be generated;

Determining the number of negative samples from each negative cluster according to the anchor samples and the representative samples, and adding each negative sample into a negative sample set of the anchor samples;

generating a plurality of contrast sample pairs according to the anchor sample, the positive sample set and the negative sample set; wherein the pair of comparison samples comprises the anchor sample, one of the positive samples, and one of the negative samples.

Optionally, the positive sample number includes a first positive sample number and a second positive sample number, determining the positive sample number of positive samples from the target cluster and each positive cluster according to the anchor sample and the representative sample, and adding each positive sample to a positive sample set of the anchor sample, including:

if the anchoring sample is the representative sample, taking representative samples in each positive cluster as positive samples of the anchoring sample until the positive samples of the positive sample number are obtained, and adding the positive samples of the positive sample number into a positive sample set of the anchoring sample;

if the anchoring sample is not the representative sample, taking the representative sample as a first positive sample of the anchoring sample, and randomly selecting at least one sample from the target cluster as the first positive sample until the first positive sample number is obtained;

Randomly selecting a second positive sample of the second positive sample number from each positive cluster;

adding each of the first positive samples and each of the second positive samples to a positive sample set of the anchor samples; wherein the positive sample set includes a number of positive samples equal to the number of positive samples.

Optionally, determining the number of negative samples from each of the negative clusters according to the anchor samples and the representative samples, and adding each of the negative samples to a negative sample set of the anchor samples, including:

if the anchoring samples are the representative samples, taking the representative samples in each negative cluster as the negative samples of the anchoring samples until the number of the negative samples is obtained, and adding each negative sample into a negative sample set of the anchoring samples;

and if the anchor sample is not the representative sample, randomly selecting negative samples from each negative cluster until the number of the negative samples is obtained, and adding each negative sample into a negative sample set of the anchor sample.

A second aspect of the present application provides a sample construction system, the system comprising:

The first determining unit is used for determining target clusters where the anchoring samples are located from all clusters of the hierarchical cluster map; the hierarchical clustering map is generated by a hierarchical clustering map generating unit by utilizing each cluster, wherein each cluster is obtained by clustering each sample by a cluster generating unit according to the supervision information of each sample obtained by processing the original multi-modal data, and the anchoring sample is any sample in any one of the clusters;

the first cluster determining unit is used for traversing the hierarchical cluster map by taking the position of the target cluster in the hierarchical cluster map as a starting position according to the target cluster centroid of the target cluster and the supervision information of the anchor sample, and determining at least one positive cluster and at least one bifurcation cluster corresponding to the anchor sample from the hierarchical cluster map;

a second cluster determining unit, configured to determine at least one negative cluster corresponding to the anchor sample according to each cluster, each positive cluster, and each divergent cluster in the hierarchical cluster map;

and the comparison sample pair generating unit is used for generating a plurality of comparison sample pairs according to the anchoring samples, the target clusters, the positive samples in each positive cluster and the negative samples in each negative cluster so as to train a model by using the comparison sample pairs.

The invention provides a multi-mode comparison learning sample construction method and system based on hierarchical clustering, which are used for determining target clusters where anchoring samples are located from all clusters of a hierarchical clustering graph; the hierarchical clustering graph is generated by utilizing each cluster, each cluster is obtained by clustering each sample according to the supervision information of each sample obtained by processing the original multi-modal data, and the anchoring sample is any sample in any cluster; according to the target cluster centroid of the target clusters and the supervision information of the anchor samples, traversing the hierarchical cluster map by taking the position of the target clusters in the hierarchical cluster map as a starting position, and determining at least one positive cluster and at least one divergent cluster corresponding to the anchor samples from the hierarchical cluster map; determining at least one negative cluster corresponding to the anchoring sample according to each cluster, each positive cluster and each divergent cluster in the hierarchical cluster map; generating a plurality of comparison sample pairs according to the anchoring samples, the target clusters, the positive samples in each positive cluster and the negative samples in each negative cluster, so that the comparison sample pairs are used for training a model; according to the technical scheme provided by the invention, the positive clusters, the branched clusters and the negative clusters corresponding to the anchoring samples can be correctly partitioned by utilizing the constraint of the hierarchical relation among the clusters in the hierarchical cluster map, and the branched clusters are removed, so that the high-quality contrast sample pairs of the anchoring samples are constructed by utilizing the anchoring samples, the target clusters, the positive clusters and the negative clusters, and the high-quality contrast sample pairs are used for training and optimizing the deep learning model based on contrast learning, so that the encoder of the multi-mode data has better differentiation and robustness on the representation level of the data characteristics, and the model obtains higher performance indexes when processing downstream tasks.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a multi-modal comparison learning sample construction method based on hierarchical clustering according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for processing original multi-modal data to obtain clusters according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an exemplary persistent storage of target original multimodal data after format conversion according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an exemplary matrix of sample vectors according to an embodiment of the present invention;

FIG. 5 is an exemplary diagram of various clusters of anchor samples provided by an embodiment of the present invention;

FIG. 6 is a flow chart of a method for generating multiple pairs of comparison samples according to an anchor sample, a target cluster, positive samples in each positive cluster, and negative samples in each negative cluster according to an embodiment of the present invention;

Fig. 7 is a schematic structural diagram of a multi-modal comparison learning sample construction system based on hierarchical clustering according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used for distinguishing between different devices, modules, or units and not for limiting the order or interdependence of the functions performed by these devices, modules, or units.

It should be noted that references to "one" or "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be interpreted as "one or more" unless the context clearly indicates otherwise.

According to the research, in the aspects of large healthy medical data and the like, the data refinement labeling is seriously dependent on professional knowledge, so that the labeling cost of the healthy medical data is high, the efficiency is low, the labeling quantity of the healthy medical data is small, and the training of a super-large-scale deep learning model is difficult to develop; in addition, the collection quality and integrity of the data of the health care are also often influenced by medical conditions, subjective willingness of both doctors and patients and other factors, and the defect of certain examination and inspection data is more likely to occur, so that the defect of data modes is caused.

Furthermore, individual variability may lead to different pathological features in different cases of the same disease, resulting in more false negative and false positive results, and individuals with false negative and false positive results are often divergent samples. Because of the existence of the bifurcation sample, based on the self-supervised learning model, it is difficult to divide the positive and negative samples without bifurcation by means of similarity and the bifurcated proxy task.

Finally, for analysis of health medical data, there is often not just a Multi-classification of one-to-many (Multi-Class Classification) problem, but a Multi-label classification of many information categories (Multi-Label Classification) that need to be identified. Such as analyzing complications or complications with health medical data. Under the trouble of double problems of a bifurcation sample and multi-label classification, the original advantages of self-supervision contrast learning are difficult to fully develop.

In the prior art, self-supervised learning generally classifies agent tasks into two classification problems, but in reality, the target results of many applications are not purely two-classification or multi-classification problems, but rather more complex multi-label classification problems, i.e. each sample may have labels of multiple classes at the same time. Therefore, it is difficult for the conventional classification method to correctly divide the positive and negative samples of the multi-modal data having multiple labels at the same time according to different semantic information of the corresponding labels. If the rule design of the agent task is unreasonable, the division of positive and negative samples is often caused to be against the true class of the data, and then the sample interval in the characterization space is frequently disturbed by updating and cannot be converged to an ideal value, so that the encoder of the multi-mode data is difficult to learn key features for distinguishing different labels and classes from the comparison of the samples, and the performance index of a model trained by using the divided positive and negative samples is poor.

Therefore, the invention provides a multi-mode comparison learning sample construction method and system based on hierarchical clustering, which accurately divide positive and negative samples of each anchoring sample through the supervision information of the samples; specifically, the supervision information of each sample is obtained by processing original multi-modal data in advance, each sample is clustered to obtain each cluster according to different combinations of the supervision information as a cluster centroid, each cluster is utilized to generate a hierarchical cluster map, the positive cluster, the bifurcation cluster and the negative cluster corresponding to the anchoring sample are correctly partitioned by utilizing the constraint of the hierarchical relationship among each cluster in the hierarchical cluster map, the bifurcation cluster is ignored, the anchoring sample, the target cluster, the positive cluster and the negative cluster are utilized to construct a high-quality comparison sample pair of the anchoring sample, so that the model is trained by utilizing the high-quality comparison sample pair, and the encoder in the comparison learning model learns to have better distinguishing degree and robustness characteristic representation at the characteristic level of data characteristics, thereby improving the performance index of the model on tasks such as classification, retrieval and the like.

Referring to fig. 1, a flow diagram of a hierarchical clustering-based multi-modal comparison learning sample construction method is shown, which specifically includes the following steps:

S101: determining a target cluster in which an anchor sample is located from each cluster of the hierarchical cluster map; the hierarchical cluster map is generated by utilizing each cluster, and each cluster is obtained by clustering each sample according to the supervision information of each sample obtained by processing the original multi-mode data.

In the embodiment of the application, firstly, the acquired original multi-mode data can be processed to obtain a sample set, and the supervision information corresponding to the sample is extracted from the samples in the sample set, so that different combinations of the supervision information of each sample are used as cluster centroids to cluster each sample to obtain a plurality of clusters, and finally each cluster is connected according to the inclusion relation among the cluster centroids of each cluster to generate a corresponding hierarchical cluster map.

Referring to fig. 2, a flow chart of a method for processing original multi-mode data to obtain clusters according to an embodiment of the present invention is shown, where the method specifically includes the following steps:

s201: preprocessing the acquired original multi-mode data to obtain a sample set; wherein the sample set includes a plurality of samples and their corresponding other modality data.

In the embodiment of the application, the original multi-modal data in the storage medium can be loaded through an Application Programming Interface (API) responsible for Input and Output (IO), and the loaded original multi-modal data is preprocessed to obtain a sample set.

In practical application, for the structured original multi-modal data, preprocessing operations such as standardization, normalization, data cleaning, data complement and the like of data values can be performed on the original multi-modal data so as to complete preprocessing of the original multi-modal data and obtain a sample set. The original multi-modal data comprises a plurality of original samples, and each original sample is provided with at least one other modal original multi-modal data. Wherein, the structured original multi-modal data refers to that the data type of other modal data associated with the original multi-modal data is a structured type.

For unstructured original multi-mode data, besides the preprocessing operation, operations such as data sampling, data conversion, data enhancement and the like can be further performed, wherein the data enhancement operation can comprise rotation, clipping, overturning and the like of pictures.

In this embodiment, the preprocessed or data-enhanced data may be stored in a corresponding memory/file, i.e. the sample set is stored in a persistent manner, as shown in fig. 3.

It should be noted that the raw multimodal data may be health medical data, and embodiments of the present application are not limited herein.

S202: for each sample, at least one piece of supervision information is extracted from the other modality data associated with the sample according to an extraction mode matched with the data type of the other modality data associated with the sample.

In this embodiment of the present application, after the sample set is obtained, for each sample in the sample set, the corresponding supervision information may be extracted from the corresponding sample according to the extraction mode corresponding to the data type corresponding to each sample in the sample set.

Optionally, for each sample, it may be determined whether the data type of the other modal data associated with the sample is text type data of unstructured type, or is structured type of non-numeric type, or is structured type of numeric type; if the data type of the other mode data associated with the sample is a text type of an unstructured type, extracting at least one kind of supervision information from the other mode data associated with the sample based on a natural language processing method; if the data type of the other mode data associated with the sample is a non-numerical structured type, extracting at least one kind of supervision information from the other mode data associated with the sample based on a preset rule; if the data type of the other mode data associated with the sample is a numerical structured type, extracting corresponding initial keywords from the other mode data associated with the sample based on a preset rule, and converting the initial keywords into supervision information of component types.

It should be noted that, the preset rule may be a rule summarized from each relevant service scenario; for example, by analyzing the non-numeric structured type data of a patient suffering from a chest and lung disease, it can be determined that the patient suffering from the chest and lung disease generally has two key variables of smoking history and family history, and thus, a corresponding preset rule can be generated according to the determined key variables, so that whether there is smoking history and whether there is two fields of structured type of family history can be extracted from other modality data of the non-numeric structured type as the supervision information of the associated sample according to the preset rule.

It should be noted that, since the supervision information is an enumeration type or a classification type variable contained in the other modality data, for an initial keyword that is not classified, the initial keyword may be converted into classified supervision information.

S203: and generating a corresponding supervision information vector according to the supervision information corresponding to the sample, and performing persistent storage on the supervision information vector.

In the embodiment of the present application, after extracting the supervision information corresponding to each sample, for each sample, a corresponding supervision information vector may be generated according to the supervision information corresponding to the sample, and the sample vector may be stored in a persistent manner.

In practical application, the supervision information of each sample can be represented by a 1*n-dimensional supervision information vector, so that m-dimensional supervision information vector matrixes can be formed by the supervision information of m samples in a sample set, and finally the obtained supervision information vector matrixes are subjected to persistent storage to form a Comma Separated value file (CSV) with an extension name. Each row of data of the CSV file comprises a sample id of a certain sample and a corresponding supervision information vector.

It should be noted that the supervision information vector matrix may also be stored in a persistent manner in the form of a relational database table.

For example, there are currently 12 samples, as shown in fig. 4, starting from the second row, each row includes one sample id and a corresponding supervision information vector, and the data dimension of each supervision information vector is 9; wherein each element value "1" in each supervision information vector indicates that a corresponding sample exists one kind of supervision information. For example, the ith row in FIG. 4 represents a sample x _i In the case of the supervision information included, the j-th column represents a certain supervision information c _j When sample x _i Reflecting supervision information c on labels, attributes or features of data _j And if the value of the ith row and the jth column is 1.

Note that, the supervision information may be labeling information of the sample, sample attributes, sample categories, sample features, and the like, which may reflect the properties of the sample data body.

It should be noted that, each supervision information vector is converted into a supervision information vector matrix form, which is convenient for batch processing by the computer; the batch processing means that the computer may be limited by hardware conditions such as a memory, and cannot load all data at one time, and the computer needs to load the same number of sample ids and corresponding supervision information in batches each time.

As an implementation manner of the embodiment of the present application, a corresponding supervision information vector may also be generated according to the label information of the sample and the corresponding supervision information.

In practical application, the label information and the supervision information of each sample can be used for 1*n-dimensional supervision information vector representation, and then the label information and the supervision information of m samples in the sample set can form an m-n-dimensional supervision information vector matrix.

S204: carrying out unsupervised grouping on each sample in the sample set according to the supervision information in the supervision information vector corresponding to each sample to obtain a plurality of clusters; the clusters at least comprise cluster centroids, and the cluster centroids are combinations of supervision information corresponding to all samples in the clusters.

In the embodiment of the application, each sample can be subjected to unsupervised grouping according to the supervision information in each structured supervision information vector to form clusters taking specific supervision information combinations as cluster centroids, and finally hierarchical cluster graphs can be generated by utilizing the interconnection of inclusion relations among the cluster centroids of each generated cluster.

It should be noted that the center of mass of a cluster is an imaginary center point or mean value of a cluster, which is a combination of various kinds of supervision information, and is also a key reflecting which aspects of the clusters have similarity, and represents that data are clustered together due to which similar features or common attributes.

Optionally, according to the supervision information in the supervision information vector corresponding to each sample, performing unsupervised grouping on each sample in the sample set, and the process of obtaining a plurality of clusters may be: generating a plurality of supervision information combinations according to the supervision information in each supervision information vector in the sample set; and aiming at each supervision information combination, taking the supervision information combination as a clustering centroid, and carrying out unsupervised grouping on each sample to obtain a cluster corresponding to the clustering centroid.

In some embodiments, a supervision information combination with similar features or common attributes may be mined according to supervision information reflected in each supervision information vector, and each supervision information combination is used as a cluster centroid to cluster each sample to form a plurality of clusters.

It should be noted that, samples in each cluster are clustered together according to the corresponding cluster centroid, and the samples are described as having similar characteristics or common attributes with reference to the cluster centroid, so that the samples have high intra-cluster similarity.

Optionally, the process of generating the hierarchical clustering map by using each cluster may be: and according to the inclusion relation among the cluster centroids of all the clusters, connecting all the clusters in the cluster sequence to generate a hierarchical cluster map.

It should be noted that, not only is there a difference between each cluster in the hierarchical cluster map, but also the relevance among clusters is reserved through the hierarchical relationship among the cluster centroids of each cluster, so that theoretical support is provided for realizing hierarchical comparison among different clusters.

For example, clusteringN _a Cluster centroid of (c)C _a And clusteringN _b Cluster centroid of (c)C _b The inclusion relation of (2) is thatCorresponding clusteringN _a And cluster withN _b The hierarchical relationship of (2) may be +.>. At this time if->Cluster centroid without other clustersC _z Make->In hierarchical clustering diagramN _a And (3) withN _b There are edges between the two nodes where the edges are connected to each other. Further, if any clusterN _a 、N _a And (3) withN _c Corresponding centroidC _a 、C _b And (3) withC _c There is an inclusion relationship- >And is also provided withI.e. ordering relationship +.>Clustering in hierarchical cluster mapN _a ClusteringN _b And clusteringN _c There are path connections.

It should be noted that the purpose of sorting the clusters is to comb the hierarchical relationship among different clusters, find out the relevance and the difference among different clusters, generate a hierarchical cluster map, and further compare the samples in each cluster more pertinently.

In the embodiment of the application, after hierarchical ordering is performed on each cluster according to the inclusion relation among centroids of each cluster, edges among the ordered clusters can be created, so that a hierarchical cluster map is formed.

In the process of specifically executing step S101, after the hierarchical cluster map is generated, the target positions of the anchor samples in the hierarchical cluster map may be further determined, so as to determine, from the clusters in the hierarchical cluster map, the target clusters to which the anchor samples belong according to the target positions of the anchor samples.

In the embodiment of the present application, the hierarchical clustering graph is an undirected graph, but as a preferred mode of the embodiment of the present application, as shown in fig. 5, a cluster with an empty centroid is set as a top node, a cluster with a full set of centroid as supervision information is set as a bottom node, a certain anchoring sample is used as a starting point, a direction of searching the top node is used as an upward direction, a direction of searching the bottom node is used as a downward direction, each sample in each cluster in the hierarchical clustering graph is traversed in a bottom-up manner, and any one of the traversed samples is determined to be an anchoring sample.

It should be noted that the anchor sample may be each sample in the sample set, and is used as a reference sample for comparison. For example, in the image data in the multi-mode data, each anchor sample may be based on a new sample formed by data enhancement means such as rotation, clipping, brightness adjustment and contrast adjustment, and although there is a slight difference in image sense from the anchor sample, since the sample after data enhancement describes the same thing as the original sample, the corresponding semantic information remains the same, which may be regarded as a positive sample of the anchor sample. In addition, other modal data describing the same sample are mutually positive samples because the semantics of the multi-modal data are not changed and are not affected by different data representation modes, the same supervision information is provided, and in the same cluster describing the same semantics, a foundation is provided for cross-modal analysis of the multi-modal data.

In an ideal state, the samples can form clusters according to the characteristic similarity of the samples. However, because the bifurcation samples have partial characteristics of the positive samples and the negative samples at the same time, in this case, it is difficult to accurately divide the positive and negative samples according to the categories of the samples only by using a machine learning algorithm or a self-supervision learning model based on similarity, so that the application can traverse the hierarchical cluster map according to the position of the target cluster in the hierarchical cluster map, and determine that the anchor sample corresponds to at least one positive cluster and at least one bifurcation cluster from the hierarchical cluster map.

S102: according to the target cluster centroid of the target clusters and the supervision information of the anchor samples, the hierarchical cluster map is traversed by taking the positions of the target clusters in the hierarchical cluster map as initial positions, and at least one positive cluster and at least one divergent cluster corresponding to the anchor samples are determined from the hierarchical cluster map.

In the specific execution process of step S102, after determining the target cluster of the anchor sample, a neighborhood cluster associated with the target cluster may be further determined, and according to the target cluster centroid of the target cluster and the supervision information of the anchor sample, the position of the target cluster in the hierarchical cluster map is used as a starting position, searching is performed according to the association relationship between the respective neighborhood clusters and the topology structure of the hierarchical cluster map, and at least one positive cluster and at least one diverging cluster of the anchor sample are determined, so that at least one negative cluster of the anchor sample is determined according to the positive cluster and the diverging cluster of the anchor sample.

It should be noted that, in the form of a graph, the hierarchical relationship between each cluster is constructed, so that the evaluation problem of similarity and difference between each cluster can be converted into the graph search problem, that is, the positive cluster and the negative cluster of the anchor sample can be determined through the graph search algorithm. Wherein positive samples in the positive cluster have similarity in some way to the anchor samples, which should be aggregated; samples in the negative cluster are greatly different from the anchor samples, and the spatial positions of the samples and the anchor samples need to be diverged as much as possible.

Alternatively, the position of the target cluster in the hierarchical cluster map is taken as the initial position, the hierarchical cluster map is traversed, the cluster corresponding to the subset of which the cluster centroid in the hierarchical cluster map is taken as the target cluster centroid is determined as the positive cluster of the anchoring sample, and the cluster with the positive cluster attribute and the negative cluster attribute in the hierarchical cluster map is determined as the divergent cluster of the anchoring sample;

It should be noted that, when the anchor sample and any sample in any positive cluster are aggregated in the characterization space, the samples with similar attributes are more and more compact; the spatial distance (such as euclidean distance) between similar attributes is also smaller and smaller, the relevance is also stronger, and the variability of the heterogeneous sample attributes is not reduced.

In a specific application process, all clusters above the target cluster can be obtained recursively according to the target cluster of the anchoring sample as a starting position and the topological structure search of the hierarchical cluster map (for convenience of understanding, any cluster above the target cluster is called a parent cluster, namely the parent cluster is all clusters of which the mass centers are subsets of the mass centers of the target cluster in the hierarchical cluster map), and all the parent clusters are determined to be all positive clusters of the anchoring sample;

According to the target cluster of the anchoring sample as a starting position, searching according to the topological structure of the hierarchical cluster map, recursively obtaining all clusters below the target cluster (for convenience of understanding, any cluster below the target cluster is called a sub-cluster, namely, all clusters in the hierarchical cluster map, the mass center of which is a true superset of the mass center of the target cluster), and determining all sub-clusters as all divergent clusters of the anchoring sample; the divergent clusters are provided with partial supervision information of partial positive clusters and partial supervision information of partial negative clusters at the same time.

And finally, carrying out complement operation on the union set of all positive clusters and all branched clusters according to all positive clusters and all branched clusters of the anchoring sample and the cluster total set in the hierarchical cluster map to obtain all negative clusters of the anchoring sample.

In this embodiment, each piece of supervision information corresponding to the anchor sample may be used as a positive cluster attribute of the anchor sample, and other supervision information except for the positive cluster feature in the hierarchical cluster map may be used as a negative cluster attribute, so that a cluster having both the positive cluster attribute and the negative cluster attribute in the hierarchical cluster map may be determined as a divergent cluster of the anchor sample.

It should be noted that, for any one of the samples in the bifurcation cluster of a certain anchor sample, no matter the sample is taken as a positive sample or a negative sample to participate in the subsequent model training, the performance of the encoder in the model is reduced when distinguishing part of the features, and further, when the subsequent model performs loss calculation, the loss function convergence is unstable, so that the accuracy of the final model obtained by training is reduced, therefore, the positive and negative samples corresponding to each anchor sample are taken as training data of contrast learning in the subsequent step, and the samples in the bifurcation cluster can be ignored, that is, the samples in the bifurcation cluster do not participate in the construction of the pair of positive and negative samples of contrast learning.

S103: and determining at least one negative cluster corresponding to the anchoring sample according to the hierarchical cluster map, each positive cluster and each divergent cluster.

In the specific execution of step S103, after all positive clusters and all branched clusters of a certain anchoring sample are determined, a complement operation may be performed on the union of all positive clusters and all branched clusters according to all clusters in the hierarchical cluster map, so as to obtain all negative clusters of the anchoring sample.

For example, each negative cluster is equal to each cluster in the hierarchical cluster map minus each positive cluster and each diverging cluster.

As an implementation of the embodiments of the present application, when it is desired to find certain anchor datax _a Positive samples of (2)And negative sample->In this case, it is possible to locate firstx _a Target clusters in the hierarchical cluster map are noted asN _a . Is provided withN _a Is the cluster centroid of (2)C _a If any cluster->Its cluster centroid->To anchor datax _a At the cluster centroidC _a Subset of (a), i.e.)>Clustering is performedAny sample in->Can all be used asx _a Positive samples of (2), and->Is thatx _a Because of the positive clustering of (1)The method comprises the steps of carrying out a first treatment on the surface of the Whereas for arbitrary clusters->So long as it clusters centroid +.>With anchor datax _a Cluster centroid of (c)C _a Is empty, clustering +.>Any sample in->Can all be used asx _a Because of->。

S104: and generating a plurality of comparison sample pairs according to the anchoring samples, the target clusters, the positive samples in each positive cluster and the negative samples in each negative cluster, so that the model is trained by using the comparison sample pairs.

In the specific execution of step S104, after determining each positive cluster and each negative cluster of the anchor samples, a plurality of pairs of reference samples may be generated according to the number of pairs of reference samples, the anchor samples, the target clusters, the positive samples in each positive cluster, and the negative samples in each negative cluster, which are set in advance, so as to train the model by using the pairs of reference samples.

Forming a positive sample set according to the target cluster of each anchoring sample and the positive samples in all positive clusters; forming a negative sample set according to samples in all the negative clusters; in the embodiment of the application, the number of the comparison sample pairs can be preset according to actual demands, and the anchoring sample, the positive sample set and the negative sample set are utilized to generate the ternary groups (the anchoring sample, the positive sample and the negative sample) with the preset number, namely the comparison sample pairs, so that the deep learning model based on the comparison learning can be better trained by utilizing the generated comparison sample pairs.

Referring to fig. 6, a flow chart of a method for generating a plurality of comparison sample pairs according to an anchor sample, a target cluster, positive samples in each positive cluster, and negative samples in each negative cluster according to an embodiment of the present invention is shown, where the method specifically includes the following steps:

s601: representative samples are determined from the target clusters.

The number of positive samples and the number of negative samples of the anchoring samples are equal, and the number of positive samples and the number of negative samples of the anchoring samples are equal to the preset number of comparison sample pairs to be generated.

In this embodiment of the present application, for each cluster, one sample in the cluster may be determined as a representative sample of the cluster, and the representative sample may be represented as a virtual center position of the cluster, so that in a subsequent deep learning process based on contrast learning, other samples in the cluster may gradually gather toward the representative sample, and further, the similarity of samples inside the cluster may gradually increase.

As a preferred manner in embodiments of the present application, the first sample in a cluster may be taken as a representative sample of the cluster.

In the embodiment of the present application, the number n of the comparison sample pairs may be preset according to specific situations and requirements related to the service. Wherein the number of positive samples and the number of negative samples are both equal to n.

When the positive sample number is the first positive sample number n _in And a second positive sample number n _out And (3) summing; wherein the first positive sample number n _in Equal to the preset positive sample number selected in the target cluster, and the second positive sample number n _out Equal to the number of positive samples preset to be selected in the rest of positive clusters.

S602: and determining the number of positive samples from the target clusters and each positive cluster according to the anchoring samples and the representative samples, and adding each positive sample into a positive sample set of the anchoring samples.

Alternatively, the anchor sample may be determinedx _a Whether or not it is a representative sample in the target cluster, if it is anchoredx _a Taking representative samples in each positive cluster as first positive samples of the anchor samples for representative samples in the target clusters until the positive samples are obtained, and adding the positive samples into a positive sample set of the anchor samples; If the anchoring sample is not the representative sample, taking the representative sample as a first positive sample of the anchoring sample, and randomly selecting at least one sample from the target cluster as the first positive sample until the first positive samples are obtained; randomly selecting a second positive sample number from each positive cluster; adding each of the first positive samples and each of the second positive samples to a positive sample set of the anchor samples; wherein the positive sample set includes a number of positive samples equal to the number of positive samples.

In some embodiments, the sample is anchoredx _a Under the condition of representing samples in the target clusters, according to the hierarchical relation of each positive cluster after being sequenced in the hierarchical clusters, adding the representing samples of each positive cluster into the positive sample set from bottom to top until the number of samples in the positive sample set reaches the preset positive sample number n. If anchoring the samplex _a Not representative sample, the representative sample is taken as an anchoring samplex _a Adding positive sample set to the first positive sample of the target cluster, and randomly selecting at least one sample from the target cluster as an anchor samplex _a Adding the first positive samples into the positive sample set until the number of the first positive samples reaches the preset first positive sample number n _in According to the hierarchical relation of the positive clusters after being sequenced in the hierarchical cluster map, randomly and preferentially selecting samples which are not accessed from all positive clusters from bottom to top as second positive samples to be added into the positive sample set until the number of the second positive samples reaches the preset second positive sample number n _out The method comprises the steps of carrying out a first treatment on the surface of the Constructing an anchor sample from each first positive sample and each second positive samplex _a Is a positive sample set of (1); wherein the sample is anchoredx _a The number of samples in the positive sample set is n preset; repeating the segment process until each sample in the training set has been used as an anchor samplex _a And selecting respective positive sample sets.

It should be noted that, if the anchor sample is a representative sample in the target cluster, the purpose of taking the representative sample in each positive cluster as the positive sample of the anchor sample is to aggregate the representative samples with each other among the positive clusters, and the samples in each positive cluster use the representative sample as the center cluster, so that the similar positive clusters and the internal positive samples thereof are more compact in the encoded characterization space.

Specifically, a representative sample in the target cluster is taken as a first positive sample to be added into a positive sample set, so that samples in the target cluster are gathered by taking the representative sample as a center; and randomly adding other samples in the target cluster as first positive samples into the positive sample set to enable the samples in the target cluster to be positive samples, so that the similarity in the cluster is enhanced. Meanwhile, the representative samples of each positive cluster are gathered towards the representative samples of the target clusters, so that the characterization space distance between the positive clusters is more compact, and the similarity between the positive clusters is further enhanced.

In practical application, for each second positive sample, samples which are not accessed are randomly and preferentially added into a positive sample set from bottom to top according to the hierarchical relationship of the positive clusters after being sequenced in the hierarchical cluster map from each positive cluster of the anchor samples, and the added samples are marked as accessed until the number of the added samples reaches the preset second positive sample number n _out Until that point.

S603: and determining the number of negative samples from each negative cluster according to the anchor samples and the representative samples, and adding each negative sample into a negative sample set of the anchor samples.

Optionally, whether the anchoring sample is a representative sample in the target cluster can be judged, if the anchoring sample is the representative sample in the target cluster, the representative sample in each negative cluster is taken as the negative sample of the anchoring sample until the number of negative samples is obtained, and each negative sample is added into the negative sample set of the anchoring sample; if the anchor sample is not a representative sample, negative samples are randomly selected from each negative cluster until the number of negative samples is obtained, and each negative sample is added to the negative sample set of the anchor sample.

In the specific application process, the anchoring sample is judgedx _a Whether or not to be a generation in target clustersA table sample; if anchoring the samplex _a For representative samples in the target clusters, adding the representative samples in each negative cluster into the anchor samples from bottom to top according to the hierarchical relationship of the negative clusters after being sequenced in the hierarchical cluster mapx _a Until the number of negative samples in the negative sample set reaches a preset negative sample number n; if anchoring the samplex _a Not representing samples, according to the hierarchical relation of the negative clusters after being sequenced in the hierarchical cluster map, randomly and preferentially selecting samples which are not accessed from all the negative clusters from bottom to top, and adding the samples into the anchor samplesx _a Until the number of negative samples of the negative sample set reaches a preset negative sample number n; repeating the segment process until each sample in the training set is taken as an anchor sample, and selecting a negative sample set corresponding to each anchor sample.

S604: generating a plurality of comparison sample pairs according to the anchoring sample, the positive sample set and the negative sample set; wherein the pair of comparison samples comprises an anchor sample, a positive sample and a negative sample.

In embodiments of the present application, after determining the positive and negative sample sets of the anchor samples, the anchor samples may be respectively derived from the anchor samples x _a Of the positive and negative sample sets, a positive sample is selected without replacementAnd a negative sampleAnd will anchor the samplex _a Selected positive samples->And negative sample->Constitute a comparative sample pair->Until an anchored sample is obtainedx _a N comparison samples of (2)In this cycle, n pairs of samples of m anchor samples, i.e. m×n total, are obtained. At the same time, each sample is striven for as a positive or negative sample for a certain anchor sample by recording whether each sample was once a positive or negative sample.

It should be noted that the comparison sample pairs of the anchor samples are in triadsDescription of forms of (a) an anchor samplex _a Sample->And negative sample->The relation is combined with the contrast learning loss function, so that the loss value of the contrast learning-based deep learning model is in a downward trend as a whole, and meanwhile, the encoder of the multi-mode data can obtain better performance.

The invention provides a multi-mode comparison learning sample construction method based on hierarchical clustering, which comprises the steps of determining target clusters corresponding to anchoring samples according to supervision information of the anchoring samples; the hierarchical clustering graph is generated by mutually connecting all clusters according to the inclusion relation among cluster centroids, and all the clusters are obtained by clustering a sample set by taking different combinations of supervision information as cluster centroids; taking a target cluster corresponding to the anchoring sample as a starting position, searching according to the association relation of the clustering neighborhood and the topological structure of the hierarchical clustering graph, and determining all positive clusters and all bifurcation clusters corresponding to the anchoring sample; determining all negative clusters of the anchoring sample according to all clusters in the hierarchical cluster map, all positive clusters and all divergent clusters of the anchoring sample; determining a positive sample set and a negative sample set corresponding to the anchoring sample according to the anchoring sample, the target cluster of the anchoring sample, all positive clusters and all negative clusters, so as to generate a preset number of comparison sample pairs of the anchoring sample according to the anchoring sample, the positive sample set and the negative sample set, and training a deep learning model based on comparison learning by using the comparison sample pairs; according to the technical scheme provided by the invention, the constraint of the hierarchical relation among the clusters in the hierarchical cluster map can be utilized to construct the high-quality comparison sample pair of each anchoring sample, so that the deep learning model based on comparison learning is trained and optimized by utilizing the constructed comparison sample, and the encoder of the multi-mode data has better differentiation and robustness on the representation level of the data characteristics, so that the model obtains higher performance indexes when processing downstream tasks.

Based on the above-mentioned multi-modal comparison learning sample construction method based on hierarchical clustering provided in the embodiment of the present application, correspondingly, the embodiment of the present application further provides a multi-modal comparison learning sample construction system based on hierarchical clustering, as shown in fig. 7, where the multi-modal comparison learning sample construction system based on hierarchical clustering specifically includes:

a first determining unit 71, configured to determine a target cluster in which the anchor sample is located from the clusters of the hierarchical cluster map; the hierarchical clustering map is generated by a hierarchical clustering map generating unit by utilizing each cluster, wherein each cluster is obtained by clustering each sample by a cluster generating unit according to the supervision information of each sample obtained by processing the original multi-modal data, and the anchoring sample is any sample in any cluster;

a first cluster determining unit 72, configured to traverse the hierarchical cluster map with a position of the target cluster in the hierarchical cluster map as a starting position according to the target cluster centroid of the target cluster and the supervision information of the anchor sample, and determine that the anchor sample corresponds to at least one positive cluster and at least one bifurcation cluster from the hierarchical cluster map;

a second cluster determining unit 73, configured to determine at least one negative cluster corresponding to the anchor sample according to each cluster, each positive cluster, and each divergent cluster in the hierarchical cluster map;

The comparison sample pair generating unit 74 is configured to generate a plurality of comparison sample pairs according to the anchor samples, the target clusters, the positive samples in each positive cluster, and the negative samples in each negative cluster, so as to train the model by using the comparison sample pairs.

The specific principle and execution process of each unit in the multi-modal comparison learning sample construction system based on hierarchical clustering disclosed in the above embodiment of the present invention are the same as the multi-modal comparison learning sample construction method based on hierarchical clustering disclosed in fig. 1 in the above embodiment of the present invention, and may refer to the corresponding parts in the multi-modal comparison learning sample construction method based on hierarchical clustering disclosed in fig. 1 in the above embodiment of the present invention, and will not be described in detail here.

The invention provides a multi-mode comparison learning sample construction system based on hierarchical clustering, which is used for determining target clusters where anchor samples are located from all clusters of a hierarchical clustering graph; the hierarchical clustering graph is generated by utilizing each cluster, each cluster is obtained by clustering each sample according to the supervision information of each sample obtained by processing the original multi-modal data, and the anchoring sample is any sample in any cluster; according to the target cluster centroid of the target clusters and the supervision information of the anchor samples, traversing the hierarchical cluster map by taking the position of the target clusters in the hierarchical cluster map as a starting position, and determining at least one positive cluster and at least one divergent cluster corresponding to the anchor samples from the hierarchical cluster map; determining at least one negative cluster corresponding to the anchoring sample according to each cluster, each positive cluster and each divergent cluster in the hierarchical cluster map; generating a plurality of comparison sample pairs according to the anchoring samples, the target clusters, the positive samples in each positive cluster and the negative samples in each negative cluster, so as to train a deep learning model based on comparison learning by using the comparison sample pairs; according to the technical scheme provided by the invention, the positive clusters, the branched clusters and the negative clusters corresponding to the anchoring samples can be correctly partitioned by utilizing the constraint of the hierarchical relation among the clusters in the hierarchical cluster map, and the branched clusters are removed, so that the high-quality contrast sample pairs of the anchoring samples are constructed by utilizing the anchoring samples, the target clusters, the positive clusters and the negative clusters, and the high-quality contrast sample pairs are used for training and optimizing the deep learning model based on contrast learning, so that the encoder of the multi-mode data has better differentiation and robustness on the representation level of the data characteristics, and the model obtains higher performance indexes when processing downstream tasks.

Optionally, the cluster generating unit includes:

the preprocessing unit is used for preprocessing the acquired original multi-mode data to obtain a sample set; wherein the sample set comprises a plurality of samples and their associated other modality data;

the extraction unit is used for extracting at least one kind of supervision information from the other modal data associated with the sample according to an extraction mode matched with the data type of the other modal data associated with the sample for each sample;

the persistence unit is used for generating a corresponding supervision information vector according to the supervision information corresponding to the sample and persistence-storing the supervision information vector;

and the unsupervised clustering unit is used for carrying out unsupervised grouping on each sample in the sample set according to the supervision information in the supervision information vector corresponding to each sample to obtain a plurality of clusters, wherein the clusters at least comprise cluster centroids which are a combination of the supervision information.

Optionally, the extracting unit includes:

the first extraction subunit is configured to extract, for each sample, at least one kind of supervision information from the other modal data associated with the sample based on a natural language processing method if the data type of the other modal data associated with the sample is a text type of an unstructured type;

The second extraction subunit is configured to extract at least one kind of supervision information from the other modal data associated with the sample based on a preset rule if the data type of the other modal data associated with the sample is a non-numeric structured type;

and the third extraction subunit is used for extracting at least one initial keyword from the other modal data associated with the sample based on a preset rule if the data type of the other modal data associated with the sample is a numerical structured type, and converting each initial keyword into the supervision information of the component type.

Optionally, the unsupervised clustering unit includes:

a supervision information combination generating unit for generating a plurality of supervision information combinations according to the supervision information in each supervision information vector in the sample set;

and the unsupervised clustering subunit is used for carrying out unsupervised grouping on each sample by taking the supervision information combination as a clustering centroid aiming at each supervision information combination to obtain a cluster corresponding to the clustering centroid.

Optionally, the hierarchical cluster map generating unit includes:

and the hierarchical cluster map generation subunit is used for hierarchically ordering all the clusters according to the inclusion relation among the cluster centroids of all the clusters to obtain a cluster sequence, and connecting all the clusters in the cluster sequence to generate a hierarchical cluster map.

Optionally, the first cluster determining unit includes:

the first cluster determining subunit is used for traversing the hierarchical cluster map by taking the position of the target cluster in the hierarchical cluster map as the initial position, determining the cluster corresponding to the subset of which the cluster centroid in the hierarchical cluster map is the target cluster centroid as the positive cluster of the anchoring sample, and determining the cluster with the positive cluster attribute and the negative cluster attribute in the hierarchical cluster map as the divergent cluster of the anchoring sample;

Optionally, the comparison sample pair generating unit includes:

a second determining unit for determining a representative sample from the target clusters;

the positive sample determining unit is used for determining the number of positive samples from the target clusters and each positive cluster according to the anchoring samples and the representative samples, and adding each positive sample into a positive sample set of the anchoring samples; the number of positive samples and the number of negative samples of the anchoring samples are equal, and the number of positive samples and the number of negative samples of the anchoring samples are equal to the preset number of comparison sample pairs to be generated;

The negative sample determining unit is used for determining the number of negative samples in each negative cluster according to the anchoring samples and the representative samples, and adding each negative sample into a negative sample set of the anchoring samples;

a comparison sample pair generating subunit, configured to generate a plurality of comparison sample pairs according to the anchor sample, the positive sample set, and the negative sample set; wherein the pair of comparison samples comprises an anchor sample, a positive sample and a negative sample.

Optionally, the positive sample determining unit includes:

the third determining unit is used for taking the representative samples in each positive cluster as positive samples of the anchoring samples if the anchoring samples are representative samples until the positive samples of the positive sample number are obtained, and adding the positive samples of the positive sample number into a positive sample set of the anchoring samples;

a fourth determining unit, configured to take the representative sample as a first positive sample of the anchor sample if the anchor sample is not the representative sample, and randomly select at least one sample from the target clusters as the first positive sample until the number of first positive samples is obtained;

a selecting unit, configured to randomly select a second positive sample number of second positive samples from each positive cluster;

An adding unit for adding the positive samples to the positive sample set of the anchor samples according to each first positive sample and each second positive sample; wherein the positive sample set includes a number of positive samples equal to the number of positive samples.

Optionally, the negative-sample determining unit includes:

a fifth determining unit, configured to, if the anchor sample is a representative sample, take the representative sample in each negative cluster as a negative sample of the anchor sample until a negative sample number of negative samples is obtained, and add each negative sample to a negative sample set of the anchor sample;

and a sixth determining unit, configured to randomly select negative samples from each negative cluster until a negative sample number is obtained if the anchor sample is not a representative sample, and add each negative sample to the negative sample set of the anchor sample.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. The multi-mode contrast learning sample construction method based on hierarchical clustering is characterized by comprising the following steps of:

according to the target cluster centroid of the target clusters and the supervision information of the anchoring samples, traversing the hierarchical cluster map by taking the position of the target clusters in the hierarchical cluster map as a starting position, and determining at least one positive cluster and at least one divergent cluster corresponding to the anchoring samples from the hierarchical cluster map;

Generating a plurality of comparison sample pairs according to the anchoring samples, the target clusters, positive samples in each positive cluster and negative samples in each negative cluster so as to train a model by using the comparison samples;

clustering each sample according to the supervision information of each sample obtained by processing the original multi-modal data to obtain each cluster, including: preprocessing the acquired original multi-mode data to obtain a sample set; wherein the sample set comprises a plurality of samples and their associated other modality data; for each sample, extracting at least one piece of supervision information from the other modal data associated with the sample according to an extraction mode matched with the data type of the other modal data associated with the sample; generating a corresponding supervision information vector according to the supervision information corresponding to the sample, and storing the supervision information vector in a lasting manner; performing unsupervised grouping on each sample in the sample set according to the supervision information in the supervision information vector corresponding to each sample to obtain a plurality of clusters, wherein the clusters at least comprise cluster centroids, and the cluster centroids are a combination of the supervision information;

For each sample, extracting at least one kind of supervision information from the other modal data associated with the sample according to an extraction mode matched with the data type of the other modal data associated with the sample, wherein the extraction mode comprises the following steps: for each sample, if the data type of the other modal data associated with the sample is a text type of unstructured type, extracting at least one kind of supervision information from the other modal data associated with the sample based on a natural language processing method; if the data type of the other mode data associated with the sample is a non-numerical structured type, extracting at least one kind of supervision information from the other mode data associated with the sample based on a preset rule; and if the data type of the other mode data associated with the sample is a numerical structured type, extracting at least one initial keyword from the other mode data associated with the sample based on the preset rule, and converting each initial keyword into the supervision information of the component type.

2. The method according to claim 1, wherein performing unsupervised grouping on the samples in the sample set according to the supervision information in the supervision information vector corresponding to each sample to obtain a plurality of clusters includes:

3. The method of claim 1, wherein generating a hierarchical cluster map using the respective clusters comprises:

and according to the inclusion relation among the cluster centroids of all the clusters, carrying out hierarchical sequencing on all the clusters to obtain a cluster sequence, and connecting all the clusters in the cluster sequence to generate a hierarchical cluster map.

4. The method according to claim 1, wherein traversing the hierarchical cluster map with the position of the target cluster in the hierarchical cluster map as a starting position according to the target cluster centroid of the target cluster and the supervision information of the anchor sample, determining that the anchor sample corresponds to at least one positive cluster and at least one divergent cluster from the hierarchical cluster map comprises:

traversing the hierarchical clustering graph by taking the position of the target cluster in the hierarchical clustering graph as a starting position, determining a cluster corresponding to a subset of the cluster centroids in the hierarchical clustering graph as the target cluster centroids as positive clusters of the anchoring samples, and determining a cluster with both positive cluster attributes and negative cluster attributes in the hierarchical clustering graph as a divergent cluster of the anchoring samples;

5. The method of claim 1, wherein the generating a plurality of contrast sample pairs from the anchor samples, the target clusters, positive samples in each of the positive clusters, and negative samples in each of the negative clusters comprises:

determining a representative sample from the target cluster;

6. The method of claim 5, wherein the positive number of samples comprises a first positive number of samples and a second positive number of samples, wherein determining the positive number of samples from the target cluster and each positive cluster based on the anchor sample and the representative sample, and adding each positive sample to the set of positive samples of the anchor sample comprises:

7. The method of claim 5, wherein determining the number of negative samples from each of the negative clusters based on the anchor samples and the representative samples, and adding each of the negative samples to a negative set of the anchor samples, comprises:

8. A sample construction system, the system comprising:

The first cluster determining unit is used for traversing the hierarchical cluster map by taking the position of the target cluster in the hierarchical cluster map as a starting position according to the target cluster centroid of the target cluster and the supervision information of the anchoring sample, and determining at least one positive cluster and at least one divergent cluster corresponding to the anchoring sample from the hierarchical cluster map;

a comparison sample pair generating unit, configured to generate a plurality of comparison sample pairs according to the anchor samples, the target clusters, positive samples in each positive cluster, and negative samples in each negative cluster, so as to train a model by using the comparison samples;

the cluster generation unit includes: the device comprises a preprocessing unit, an extraction unit, a persistence unit and an unsupervised clustering unit;

the unsupervised clustering unit is configured to perform unsupervised grouping on each sample in the sample set according to the supervision information in the supervision information vector corresponding to each sample, so as to obtain a plurality of clusters, where the clusters at least include cluster centroids, and the cluster centroids are a combination of the supervision information;

the extraction unit includes: a first extraction subunit, a second extraction subunit, and a third extraction subunit;

the first extraction subunit is configured to extract, for each sample, at least one kind of supervision information from other modal data associated with the sample based on a natural language processing method if a data type of the other modal data associated with the sample is a text type of an unstructured type;

the third extraction subunit is configured to extract at least one initial keyword from the other modal data associated with the sample based on the preset rule if the data type of the other modal data associated with the sample is a numeric structured type, and convert each initial keyword into supervision information of component type.