CN114298122B - Data classification method, apparatus, device, storage medium and computer program product - Google Patents

Data classification method, apparatus, device, storage medium and computer program product Download PDF

Info

Publication number
CN114298122B
CN114298122B CN202111232559.3A CN202111232559A CN114298122B CN 114298122 B CN114298122 B CN 114298122B CN 202111232559 A CN202111232559 A CN 202111232559A CN 114298122 B CN114298122 B CN 114298122B
Authority
CN
China
Prior art keywords
data
sample data
classification
classification model
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111232559.3A
Other languages
Chinese (zh)
Other versions
CN114298122A (en
Inventor
郭卉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111232559.3A priority Critical patent/CN114298122B/en
Publication of CN114298122A publication Critical patent/CN114298122A/en
Application granted granted Critical
Publication of CN114298122B publication Critical patent/CN114298122B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data classification method, a data classification device, a data classification equipment, a storage medium and a computer program product, and relates to the field of machine learning. The method comprises the following steps: acquiring semantic features of sample data in a sample data set; clustering the sample data set based on semantic features to obtain candidate clustering results; extracting data characteristics of sample data through a data classification model; determining a center representation corresponding to the target cluster based on the data characteristics; and performing iterative training on the data classification model based on the center characterization to obtain a target classification model. The semantic features corresponding to the sample data are integrated into the data model classification model for joint iteration training, so that the training efficiency of the data classification model is improved, and the classification accuracy of the target classification model and the correlation degree of the classification search result are improved.

Description

Data classification method, apparatus, device, storage medium and computer program product
Technical Field
Embodiments of the present application relate to the field of machine learning, and in particular, to a data classification method, apparatus, device, storage medium, and computer program product.
Background
Data classification is commonly used in the fields of speech classification retrieval or image classification retrieval. The data with similar data features is a classification set by classifying the large-scale data based on the similarity of the data features, and in the process of data retrieval, the category corresponding to the search data is determined first, then the data in the corresponding category is selected to obtain the corresponding data, and the data retrieval is completed.
In the related art, vector search is generally performed by extracting a feature vector of input data, selecting a data category having an association relationship with the input data based on the feature vector of the input data, determining a classification category corresponding to the input data based on the data category, and extracting data in the data category as a search result corresponding to the input data.
However, in the implementation process of the data classification method, semantic information contained in the input data is ignored only by extracting the feature vector of the input data for classification retrieval, so that the problem that the retrieval results are similar in shape and irrelevant in semantic exists, and the accuracy of classification retrieval is low.
Disclosure of Invention
The embodiment of the application provides a data classification method, a device, equipment, a storage medium and a computer program product, which can improve the accuracy of data classification. The technical scheme is as follows.
In one aspect, a method for classifying data is provided, the method comprising:
Extracting semantic features of sample data in a sample data set, wherein the semantic features are used for indicating classification semantics corresponding to the sample data;
Clustering the sample data sets based on the semantic features to obtain candidate clustering results corresponding to the sample data sets, wherein the candidate clustering results are used for aggregating sample data with semantic association relations in the sample data sets into the same cluster;
Extracting data characteristics of the sample data through a data classification model, wherein the data characteristics are used for indicating data element distribution characteristics corresponding to the sample data;
determining a center representation corresponding to the target cluster in the candidate cluster result based on the data characteristics;
And carrying out iterative training on the data classification model based on the central characterization to obtain a target classification model, wherein the target classification model is used for classifying and searching the input data.
In another aspect, there is provided a data sorting apparatus, the apparatus comprising:
The extraction module is used for extracting semantic features of sample data in the sample data set, wherein the semantic features are used for indicating classification semantics corresponding to the sample data;
the clustering module is used for clustering the sample data sets based on the semantic features to obtain candidate clustering results corresponding to the sample data sets, and the candidate clustering results are used for aggregating the sample data with semantic association relations in the sample data sets into the same cluster;
The extraction module is further used for extracting data features of the sample data through a data classification model, wherein the data features are used for indicating data element distribution features corresponding to the sample data;
the determining module is used for determining center characterization corresponding to the target cluster in the candidate cluster results based on the data characteristics;
The training module is used for carrying out iterative training on the data classification model based on the center characterization to obtain a target classification model, and the target classification model is used for carrying out classified retrieval on input data.
In another aspect, a computer device is provided, where the computer device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, where the at least one instruction, the at least one program, the set of codes, or the set of instructions are loaded and executed by the processor to implement a data classification method according to any one of the embodiments of the application described above.
In another aspect, a computer readable storage medium is provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement a data classification method according to any one of the embodiments of the application described above.
In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the data classification method according to any one of the above embodiments.
The technical scheme provided by the embodiment of the application has the beneficial effects that at least:
In the training process of the target classification model, candidate clustering results corresponding to sample data are obtained based on semantic features of the sample data, the data features of the sample data are extracted through the data classification model, central representations corresponding to the target clusters in the candidate clustering results are determined based on the data features, the central representations are used as training parameters for carrying out iterative training on the data classification model, finally, the target classification model for classified retrieval is obtained, the semantic information is contained in the model in the training process in a mode that the central representations corresponding to the target clusters are obtained through the semantic features for carrying out iterative training on the data classification model, and therefore accuracy of classified retrieval and relevance of retrieval results are improved, and classification retrieval efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a process schematic diagram of a semantic feature based image classification method according to an exemplary embodiment of the present application;
FIG. 2 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;
FIG. 3 is a flow chart of a method of classifying data provided in an exemplary embodiment of the application;
FIG. 4 is a flow chart of a data classification method provided by another exemplary embodiment of the application;
FIG. 5 is a schematic diagram of a ResNet-101 residual module architecture provided by an exemplary embodiment of the present application;
FIG. 6 is a flow chart of a data classification method provided by another exemplary embodiment of the application;
FIG. 7 is a flow chart of a data classification method provided by another exemplary embodiment of the application;
FIG. 8 is a block diagram of a data sorting apparatus according to an exemplary embodiment of the present application;
FIG. 9 is a block diagram of a data sorting apparatus according to another exemplary embodiment of the present application;
fig. 10 is a schematic diagram of a server according to an exemplary embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
First, a brief description will be given of terms involved in the embodiments of the present application.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Machine learning (MACHINE LEARNING, ML): is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Computer Vision technology (CV): the method is a science for researching how to make a machine "look at", and further means that a camera and a computer are used for replacing human eyes to recognize, follow and measure targets and other machine vision, and further graphic processing is performed, so that the computer is processed into images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, map construction, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc.
Image classification: refers to classification at the image class level, where the class of an object is determined by identifying the object by considering only the class of the object (e.g., people, dogs, cats, birds, etc.) and not considering a specific instance of the object, such as: an image is given, and a category corresponding to an object in the image is determined (wherein the category is designed in advance).
The data classification method provided by the embodiment of the application is mainly applied to data retrieval. Taking image retrieval as an example for explanation, vector retrieval by taking image embedded layer features (Embedding) obtained by deep learning training as feature vectors is a common technology of current large-scale image retrieval, namely, semantic features of sample data are extracted from a large number of sample data sets, central characterization of corresponding clusters of the sample data is obtained based on the semantic features, iterative training is carried out on a data classification model, so that the Embedding learning method based on image retrieval comprises classification semantics corresponding to the sample data, and classification retrieval efficiency and accuracy of a target classification model are improved. Referring to fig. 1 schematically, a schematic process diagram of an image classification method based on semantic features according to an exemplary embodiment of the present application is shown, as shown in fig. 1, for an image 101, an image 101 is input into a target classification model 102 to extract data features corresponding to the image 101, a classification category 103 corresponding to the data features corresponding to the image 101 is determined based on the data features of the image 101, the similarity threshold is used to determine similarity between the image 101 and a classification center of the classification category 103, a classified image 104 is obtained from the classification category 103, a distance between the data features of the image 101 and the data features of the classified image 104 is calculated, the image is ranked from small to large according to a calculation result, and a search result 105 is determined based on the ranking result.
The target classification model trained by the method comprises at least one of the following scenes when applied.
1. The method is applied to an image classification search scene, schematically, a target image is input, the target image contains a cartoon character, the image characteristic of the cartoon character is determined through a target classification model, the classification category corresponding to the image characteristic is determined based on the image characteristic, the image corresponding to the cartoon character in the input target image is obtained from the classification category, and the classification search result can be used in a picture identification search function, such as: after uploading an image containing a cartoon character to a server by a user, determining a category corresponding to the input image by the server according to a classification result, acquiring an image (such as a number requirement, a similarity requirement and the like) meeting the user requirement in the corresponding category, and feeding back the image to the user;
2. The method is applied to a voice classification search scene, and is characterized in that after target voice texts are input into a target classification model, feature extraction is carried out on the target voice texts to obtain voice features corresponding to the target voice texts, voice categories corresponding to the target voice texts are determined based on the voice features, voice texts corresponding to the input target voice texts are obtained from the voice categories, and the classification search result can be used in voice dialect recognition search functions, such as: after a user uploads a certain section of voice content to a server, the server determines a dialect category corresponding to the input voice content according to a dialect classification result, and feeds the voice content meeting the user requirement in the dialect category back to the user, so that the user can learn the dialect conveniently;
3. In a scenario of content recommendation to a user, schematically, historical interaction data of the user is obtained, for example: the method comprises the steps of extracting features of historical image browsing data to obtain image features corresponding to the historical image browsing data, determining image classification categories corresponding to the historical image browsing data based on the image features, determining images which are relevant to the historical image browsing data from the corresponding image classification categories as classification search results, and feeding back the classification search results to a user as recommended content.
It should be noted that the above application scenario is only an illustrative example, and the data classification method provided in the embodiment of the present application may also be applied to other scenarios, such as: scenes in which a plurality of texts are classified and searched, etc., which are not limited in the embodiment of the present application.
Next, an implementation environment according to an embodiment of the present application will be described, schematically, with reference to fig. 2, where a terminal 210 and a server 220 are involved, and the terminal 210 and the server 220 are connected through a communication network 230.
In some embodiments, the terminal 210 is configured to send the search data to be categorized to the server 220. In some embodiments, the terminal 210 has installed therein an application having a search function, illustratively, the terminal 210 has installed therein an application having an image search function; or an application program having a classification function is installed in the terminal 210. Such as: the terminal 210 is installed with a search engine program, an instant messaging application program, a game program, etc., which is not limited in the embodiment of the present application.
The server 220 includes a classification result predicted by the target classification model, retrieves the search data to be classified according to the classification result, outputs the retrieved result, and feeds back the retrieved result to the terminal 210 for display.
The target classification model is obtained by classifying and training sample data in a sample data set. Extracting semantic features of sample data in a sample data set, clustering the sample data set based on the semantic features to obtain corresponding candidate clustering results, extracting data features of the sample data through a data classification model, determining center characterization corresponding to target clustering in the candidate clustering results based on the data features, and performing iterative training on the data classification model according to the center characterization to finally obtain a target classification model, so that data retrieval is completed through classification results in the target classification model.
The terminal may be a mobile phone, a tablet computer, a desktop computer, a portable notebook computer, an intelligent television, an intelligent vehicle-mounted terminal device, and the embodiment of the application is not limited thereto.
It should be noted that the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), and basic cloud computing services such as big data and an artificial intelligence platform.
Cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, networks and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.
In some embodiments, the servers described above may also be implemented as nodes in a blockchain system. Blockchain (Blockchain) is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain is essentially a decentralised database, and is a series of data blocks which are generated by association by using a cryptography method, and each data block contains information of a batch of network transactions and is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The method for classifying data provided by the present application is described with reference to the above noun introduction and application scenario, and the method may be executed by a server or a terminal, or may be executed by the server and the terminal together.
In step 301, semantic features of sample data in a sample data set are extracted.
The semantic features are used for indicating classification semantics corresponding to the sample data.
In some embodiments, the sample data in the same sample data set are of the same data type, such as: the sample data in the sample data set is image data, text data, voice data, color data, etc., which is not limited in the embodiment of the present application.
Optionally, taking the sample data as an Image data as an example, the sample data set includes Imagenet, that is, a large-scale universal object identification Open source data set, or an Open Image data set, which is not limited in this embodiment.
Illustratively, the semantic features of the sample data are used to indicate classification semantic information corresponding to the sample data, and the semantic features contained in the single sample data correspond to one or more classification categories, where the classification categories are already preset, for example: taking image data as an example, when the sample data is an image and the image includes a person and a dog, the image includes semantic features about the person and semantic features about the pet, that is, the image may correspond to image data in a "person" category or may correspond to image data in a "pet" category.
Illustratively, the semantic features corresponding to the sample data are extracted through a preset feature classification model, the semantic features are extracted from the sample data by constructing a feature classification model, the feature classification model comprises a plurality of classification categories, and each classification category correspondingly comprises classification semantics.
Step 302, clustering the sample data set based on the semantic features to obtain candidate clustering results corresponding to the sample data set.
And the candidate clustering result is used for aggregating the sample data with the semantic association relationship in the sample data set to the same cluster.
In some embodiments, the sample data is clustered according to semantic features, the selection of the clusters comprising at least one of:
1. The average clustering method is that sample data are subjected to average clustering, the number of the sample data in each cluster is equal, and the sample data in the same cluster have the same or similar semantic characteristics;
2. clustering is carried out based on the clustering centers, namely, n clustering centers (n can be matched) are selected through determining the spatial distribution corresponding to the sample data, the distance between the sample data and the clustering centers is calculated, the clustering center closest to the sample data (or the distance threshold is set, and the distance between the sample data and the clustering center accords with the distance threshold) is selected, and the sample data is aggregated into the clusters.
It should be noted that the above-mentioned cluster selection manner for the sample data is only an illustrative example, and the specific cluster selection manner for the sample data in this embodiment is not limited.
Illustratively, the clustering implementation method includes a K-means clustering method (K-means clustering method), a spectral clustering method, a mean shift clustering method, and the like, which are not limited in the embodiment of the present application.
Optionally, one cluster corresponds to one sample data, or a plurality of clusters corresponds to one sample data, which is not limited herein, and the plurality of clusters corresponding to the sample data are used as candidate cluster results corresponding to the sample data set.
Step 303, extracting data features of the sample data through the data classification model.
The data characteristics are used for indicating the data element distribution characteristics corresponding to the sample data.
In some embodiments, when the sample data is image data, the data element of the sample data is a pixel point corresponding to the image data, or when the sample data is voice data, the data element of the sample data is a voice frame corresponding to the voice data, or when the sample data is video data, the data element of the sample data is a video frame corresponding to the video data (or is a corresponding pixel point in the video frame), which is not limited.
The data classification model is schematically a model to be trained, model parameters are required to be adjusted to be in a state to be learned in the training process, and the model parameters in the data classification model are adjusted in the training process so as to achieve the training purpose.
Illustratively, the data classification model is a model for classifying and retrieving the input sample data, and the manner of extracting the data features of the sample data by the data classification model includes at least one of the following manners:
1. Vector analysis is carried out on the sample data through a data classification model, feature vectors corresponding to the sample data are obtained, and data features of the sample data are obtained based on vector distribution conditions of the feature vectors corresponding to the sample data;
2. And carrying out element analysis on the sample data through a data classification model, obtaining the corresponding position of the element point corresponding to the sample data in space, and determining the data characteristic of the sample data based on the distribution condition of the element point in the space position.
It should be noted that the above-mentioned data feature extraction method of the sample data is merely an illustrative example, and the specific extraction method of the data feature in the present embodiment is not limited.
Optionally, each sample data corresponds to a respective data feature, where each data feature includes the same feature or a similar feature, or each data feature is a different feature, which is not limited.
In some embodiments, the sample data corresponds to one or more data elements, and the data element distribution feature is used to indicate a distribution feature of the data element corresponding to the sample data on a plane, or a distribution feature in space, which is not limited herein, and the data element distribution feature is used to indicate a distribution commonality of the data element corresponding to the sample data.
And step 304, determining a center representation corresponding to the target cluster in the candidate cluster result based on the data characteristics.
And the target clusters are clusters corresponding to the sample data in the candidate clustering result in an aggregation mode.
Optionally, center characterizations corresponding to each cluster in the candidate cluster result are sequentially determined based on the data features corresponding to the sample data.
In some embodiments, for the same target cluster, determining a center token corresponding to the target cluster based on the data features of the sample data, wherein the center token is a feature vector, one target cluster corresponds to one center token, and the determining manner of the center token includes at least one of the following manners:
1. summing the data features of the sample data in the same target cluster to calculate an average value, and taking the average value of the data features as a center representation corresponding to the target cluster;
2. determining the semantic features corresponding to each target cluster, selecting sample data with the highest similarity or the same semantic features as the semantic features corresponding to the target cluster from the target clusters, and taking the corresponding data features as the central representation corresponding to the target clusters;
3. And acquiring data characteristics corresponding to each sample data in the target cluster, analyzing and acquiring distribution commonalities of the data element distribution characteristics corresponding to each sample data, generating a commonality vector based on the distribution commonalities, and taking the commonality vector as a center representation corresponding to the target cluster.
It should be noted that the above determination manner of the center characterization is merely an illustrative example, and the specific determination manner of the center characterization in the present embodiment is not limited.
And 305, performing iterative training on the data classification model based on the center characterization to obtain a target classification model.
The target classification model is used for classifying and searching the input data.
In some embodiments, performing iterative training on the data classification model based on the center token refers to taking the center token as a model parameter in the data classification model, and each model training adjusts the center token, that is, adjusts the model parameter in the data classification model, which is schematically, adjusting the center token includes adjusting a feature parameter of the center token or adjusting a weight value of the center token, which is not limited.
In summary, in the method provided in this embodiment, in the training process of the target classification model, the candidate clustering result corresponding to the sample data is obtained based on the semantic features of the sample data, the data features of the sample data are extracted through the data classification model, the central representation corresponding to the target clusters in the candidate clustering result is determined based on the data features, the central representation is used as the training parameter to perform iterative training on the data classification model, finally, the target classification model for classification retrieval is obtained, and the semantic information is included in the model in the training process in a manner that the central representation corresponding to the target clusters is obtained through the semantic features to perform iterative training on the data classification model, so that the accuracy of classification retrieval and the correlation of the retrieval result are improved, and the classification retrieval efficiency is improved.
In an alternative embodiment, the candidate clustering result is determined by a clustering center, and referring to fig. 4 schematically, a flowchart of a data classification method provided by an exemplary embodiment of the present application is shown, where the method may be performed by a server or a terminal, or may be performed by the server and the terminal together, and in the embodiment of the present application, the method is performed by the server as an example, and as shown in fig. 4, the method includes the following steps:
In step 401, semantic features of sample data in a sample data set are extracted.
The semantic features are used for indicating classification semantics corresponding to the sample data.
In some embodiments, the classification extraction of semantic features is performed on sample data in a sample data set by employing a feature classification model.
Optionally, the feature classification model includes an embedding layer (Embedding) and a semantic classification layer, wherein the embedding layer includes embedding layer parameters and the semantic classification layer includes classification parameters.
Optionally, the feature classification model further includes a pre-trained base module, where the base module is configured to perform feature extraction on the input sample data, so as to perform embedding processing on the extracted features through an embedding layer and perform semantic classification through a classification layer.
Here, the semantic features of the sample data extracted by the ResNet-101 network structure are described as an example, please refer to the first table and the second table, which form the feature classification model structure.
Illustratively, the Embedding layer extracts the basic feature parameters using a ResNet-101 network as shown in Table one below.
List one
The Conv1, conv2_x, etc. are different network layers in the network structure, the input parameter of each network layer is the output parameter of the previous network layer, the network is trained by learning a residual function in the ResNet-101 feature network, the residual module (block) structure of each network layer is shown in fig. 5, which shows the schematic diagram of the residual module structure ResNet-101 provided by an exemplary embodiment of the present application, as shown in fig. 5, the input is a 256-dimensional parameter vector, the first layer structure 501 (i.e. 1×1 convolutional layer, 64 dimensions) is used for downsampling, the second layer structure 502 (i.e. 3×3 convolutional layer, 64 dimensions) is used for channel convolution, the third layer structure 503 (i.e. 1×1 convolutional layer, 64 dimensions) is used for upsampling, the number of parameters is reduced when the input parameters are convolved, the number of parameters is guaranteed to be convenient to operate, the number of parameters of the output results is kept consistent with the number of input parameters, the parameters in 501 to 503 can be adjusted according to the different parameters of each network layer, as shown in fig. 5, the number of the corresponding layers are also used for training the number of blocks is different from the corresponding layers of each layer, i.e. three layers of blocks are stacked, i.e. the number of blocks is different from the actual layers, the block is different from the number of the block is different, and the number of blocks is different from the actual layer, the block is different, and the number of blocks is different, the number of the block is different, and the number is different, and the block is a corresponding has a corresponding structure is formed by the number has a structure.
The semantic classification layer is shown in the following table two:
Watch II
Layer name Output size Layer type
Pool 1×2048 Max Pool layer Max Pool
Fc 1×10000 Fc
The sample data is input into a ResNet-101 network structure in the first table to be convolved and pooled, then Embedding parameters of a feature extraction model (namely, a result input through a Conv5_x network layer in the first table 1) are output, a Embedding parameter of the feature extraction model in the second table is a 1×2048-dimensional vector, the vector is used for training label classification data in a semantic classification layer (namely, a 1×10000-dimensional vector in the second table, wherein 10000 refers to the number of classes corresponding to labels and the number of classes is configurable), classification semantics corresponding to the sample data are determined based on classification information corresponding to the labels, and finally semantic features corresponding to the sample data are obtained, namely, semantic Embedding for extracting the sample data are obtained.
Step 402, determining n cluster centers corresponding to the sample data set based on semantic features of the sample data, wherein n is a positive integer.
The number of clusters is determined according to actual needs, for example: when the number of the target clusters is n, n cluster centers are corresponding.
In some embodiments, the semantic features of each sample data correspond to distribution positions in a two-dimensional plane or a three-dimensional space, and a plane or space division result of the sample data set in the two-dimensional plane or the three-dimensional space is obtained based on the distribution positions corresponding to the semantic features of each sample data. According to the plane or space division result corresponding to the sample features, n clustering centers are determined, wherein the determination mode comprises random determination or determination according to the distribution condition of the sample features (for example, the clustering centers are determined at some places with similar or same semantic feature distribution density).
Step 403, determining a cluster center to which the ith sample data belongs based on the semantic features of the ith sample data, where i is a positive integer.
In some embodiments, determining feature distances respectively corresponding to semantic features of the ith sample data and the n cluster centers; and taking the cluster center with the characteristic distance meeting the distance requirement as the cluster center to which the ith sample data belongs.
Optionally, after the n cluster centers determined in step 402 are based, feature distances corresponding to the respective cluster centers of each sample data are calculated, where the calculating method includes: the L1 distance calculation method, the L2 distance calculation method, the chebyshev distance calculation method, and the like are not limited herein, and in this embodiment, the L2 distance calculation method is taken as an example for explanation, and the L2 distance calculation formula can refer to formula one:
equation one:
The method comprises the steps that x and y correspond to semantic features and clustering centers of sample data respectively, and the feature distance between the semantic features corresponding to the ith sample data and the ith clustering center is determined by calculating square differences of the semantic features x i corresponding to the ith sample data and the ith clustering center y i.
Optionally, determining the feature distance between each sample data and each cluster center based on the L2 distance calculation method, selecting the cluster center corresponding to each sample data, and aggregating the cluster centers into clusters corresponding to the cluster centers, wherein the feature distance is equal to the feature distance. The method for selecting the clusters comprises at least one of the following modes:
1. selecting a cluster center closest to the semantic features of the sample data as a cluster center of a cluster to which the sample data belongs;
2. setting a distance threshold, and selecting the cluster center as the cluster center of the cluster to which the sample data belongs when the feature distance between the cluster center and the semantic features of the sample data reaches the distance threshold.
It should be noted that the above-mentioned selection manner of the clusters is only an illustrative example, and the specific selection manner of the distances in the present embodiment is not limited.
Illustratively, a sample data correspondence includes one or more cluster centers, which are not limited herein.
And step 404, obtaining candidate clustering results corresponding to the sample data set based on the clustering centers respectively corresponding to the sample data.
And determining a cluster corresponding to each sample data according to the cluster center to which the sample data corresponds, and taking the cluster as a candidate cluster result corresponding to the sample data set.
At step 405, data features of the sample data are extracted by the data classification model.
The data characteristics are used for indicating the data element distribution characteristics corresponding to the sample data.
The process of extracting data features of the sample data in step 405 is described in detail in the following embodiments.
And step 406, determining a center representation corresponding to the target cluster in the candidate cluster result based on the data characteristics.
In some embodiments, data features of sample data in a target cluster are acquired; and determining a mean value result corresponding to the data characteristics of the sample data in the target cluster, and taking the mean value result as a center representation corresponding to the target cluster.
Optionally, for sample data in the same target cluster, average evaluation is performed on data features of the sample data, and the obtained average result is used as a center representation corresponding to the target cluster, which may be the same or different, and the method is not limited to this.
And step 407, performing iterative training on the data classification model based on the center characterization to obtain a target classification model.
The target classification model is used for classifying and searching the input data.
The process of iteratively training the data classification model in step 407 is described in detail in subsequent embodiments.
In summary, in the method provided in this embodiment, in the training process of the target classification model, the candidate clustering result corresponding to the sample data is obtained based on the semantic features of the sample data, the data features of the sample data are extracted through the data classification model, the central representation corresponding to the target clusters in the candidate clustering result is determined based on the data features, the central representation is used as the training parameter to perform iterative training on the data classification model, finally, the target classification model for classification retrieval is obtained, and the semantic information is included in the model in the training process in a manner that the central representation corresponding to the target clusters is obtained through the semantic features to perform iterative training on the data classification model, so that the accuracy of classification retrieval and the correlation of the retrieval result are improved, and the classification retrieval efficiency is improved.
In the embodiment, the semantic features of the sample data are extracted by using the feature classification model, so that each sample data corresponds to the classification semantics, a classification basis is provided for the iterative training of the subsequent data classification model, and meanwhile, the sample data containing the classification semantics is beneficial to faster acquisition of the corresponding classification categories in the application process so as to perform data retrieval; the method avoids manually labeling each sample data by clustering the sample data sets, and determines the center characterization corresponding to each cluster (namely, the classification category information corresponding to the cluster) by clustering.
In an alternative embodiment, the target classification model is obtained by performing parameter adjustment on the data classification model by using prediction loss in the training process, and referring to fig. 6, a flowchart of a data classification method provided by an exemplary embodiment of the present application is shown schematically, where the method may be executed by a server or a terminal, or may be executed by the server and the terminal together, and in an embodiment of the present application, the method is executed by the server as an example, and as shown in fig. 6, the method includes the following steps:
In step 601, semantic features of sample data in a sample data set are extracted.
The semantic features are used for indicating classification semantics corresponding to the sample data.
The process of extracting the semantic features in step 601 is already described in detail in step 401 above, and will not be described here again.
Step 602, clustering the sample data set based on the semantic features to obtain candidate clustering results corresponding to the sample data set.
And the candidate clustering result is used for aggregating the sample data with the semantic association relationship in the sample data set to the same cluster.
The process of obtaining the candidate clustering result in step 602 is described in detail in step 302 and step 402, which are not described herein.
In step 603, data characteristics of the sample data are extracted through the data classification model.
The data characteristics are used for indicating the data element distribution characteristics corresponding to the sample data.
Optionally, the data classification model includes an embedding layer (Embedding), a feature extraction layer and a class classification layer, where the embedding layer includes embedding parameters, the feature extraction layer includes feature parameters, and the class classification layer includes class classification parameters.
In some embodiments, the model parameters of the data classification model are first initialized (the center characterization in the above embodiments is taken as the initialization parameter), set to the state to be learned, and set the learning rate, so that the model parameters are adjusted in the training process. In this embodiment, since the task is to train the data classification model and adjust the model parameters, a smaller learning rate is used for adjustment. The embedding layer and the feature extraction layer adopt the learning rate of lr1 = 0.0005; the class classification layer uses lr2 = 0.005 learning rate. The classification layer is a learning class and is easy to be overfitted, so that the classification retrieval effect of the target classification model is affected, and the two modules adopt different learning rates, so that the gradient influence of classification return on the embedding layer of the target classification model in each learning is less than that of the classification layer.
Illustratively, the Embedding layer extracts the basic feature parameters using a ResNet-101 network as shown in Table one.
The process of extracting the basic feature parameters in this embodiment is described in detail in the above step 402, and will not be described here again.
The Embedding layer feature extraction branches are shown in table three below.
Watch III
Layer name Output size Layer type
Pool 1×2048 Max Pool layer Max Pool
Embedding 1×128 Fc
Fc_class 1×100 Fc
The Pool layer in table three is a Embedding parameter of the feature extraction model (used for obtaining semantics Embedding corresponding to sample data), the Embedding layer is a parameter of the data classification model Embedding, and the sample data is obtained by using the data classification model to obtain data Embedding of each sample data, where the data Embedding is a feature used for searching and is also a feature used for classification. The feature vector is the data feature corresponding to the sample data.
Step 604, determining a center representation corresponding to the target cluster in the candidate cluster result based on the data features.
Optionally, based on the data feature corresponding to the obtained sample data is a1×128 dimensional feature vector, mean value solving is performed on the data feature corresponding to the sample data in the same target cluster, the mean value summation result of the data feature in the target cluster is used as a center feature corresponding to the target cluster, that is, the center feature corresponding to the target cluster is also a1×128 dimensional feature vector, if the candidate cluster result includes 100 target clusters, the center feature corresponding to the candidate cluster result is a 100×128 dimensional feature vector, and the center feature vector is used as an initialization parameter of a class classification layer (fc_class) in the table three.
Step 605, determining a prediction loss corresponding to the data classification model based on the center characterization and the data features.
In some embodiments, the predictive loss includes a classification loss, i.e., determining a cluster to which the data feature belongs; acquiring a clustering center of a cluster to which the data features belong; based on the difference between the data characteristics and the clustering center, determining the classification loss corresponding to the data classification model.
Optionally, determining a cluster to which the data feature of the sample data belongs according to the candidate classification result, and selecting a cluster center closest to the semantic feature of the sample data as a cluster center of the cluster to which the sample data belongs through an L2 formula distance algorithm.
Optionally, the classification loss is determined by the difference between the clusters of the data features and the cluster centers corresponding to the clusters, and is represented by a cross entropy function, and for schematic purposes, please refer to formula two:
Formula II:
wherein y i is the cluster described by the data feature, p i is the cluster center of the cluster to which the data feature belongs, N is the number of clusters, and the cluster center are calculated through the cross entropy function, so that the classification loss is determined.
In some embodiments, the prediction loss further includes a triplet loss, that is, a sample triplet corresponding to the sample data is constructed, the sample triplet includes anchor point data, positive sample data and negative sample data, the similarity between the anchor point data and the positive sample data meets a similarity condition, and the similarity between the anchor point data and the negative sample data does not meet the similarity condition; and determining the triplet loss corresponding to the data classification model based on the clusters of the anchor point data, the positive sample data and the negative sample data in the sample triplet.
In some embodiments, the sample triples are triples of sample pairs marked from the sample dataset, including anchor data (anchor), positive sample data (positive) and negative sample data (negative), wherein the anchor data and the positive sample data are a pair of similar samples and the anchor data and the negative sample data are a pair of dissimilar samples.
Optionally, the similarity of the anchor data and the positive sample data is greater than (or equal to) a similarity threshold, and the similarity between the anchor data and the negative sample data is less than (or equal to) the similarity threshold. In the embodiment of the application, the model training is performed by taking the fact that the anchor point data and the positive sample data are in the same cluster and the fact that the anchor point data and the negative sample data are in different clusters as targets, so that the similarity between the sample data is used as one condition of the clusters, and the condition that similar samples are separated into different clusters is avoided.
In this embodiment, two sample data are randomly not replaced from the sample data set each time, the similarity between the two sample data is calculated, and if the two sample data are identical (or a similarity threshold is set, when the similarity of the two sample data reaches the similarity threshold), the two sample data are used as positive sample data for similarity labeling. Based on the positive sample data, it is also necessary to mine the negative sample data to obtain a sample triplet. For each positive sample data, calculating the distance between the sample data pair and the positive sample data from the sample data pair formed by the rest sample data in the sample data set, obtaining the distance result of each sample data pair and the positive sample data, sorting the distance result from small to large, selecting the first 20 (configurable) positive sample data in the sorting as negative sample data to form sample triples, and for m positive sample data, 20 x m (m is a positive integer) sample triples are provided (m is larger, namely the positive sample data is as much as possible, and the generalization of the model is ensured).
Optionally, the triplet loss is determined by the distance difference between the sample triples, i.e. determining a first characteristic distance of the anchor point data from the positive sample data and determining a second characteristic distance of the anchor point data from the negative sample data; and determining the triplet loss corresponding to the data classification model based on the difference value of the first characteristic distance and the second characteristic distance.
For illustration, please refer to equation three for the triplet loss calculation equation:
And (3) a formula III: l tri=max(||xa-xp||-||xa-xn |+α, 0)
Wherein x a refers to anchor point data, x p refers to positive sample data, x n refers to negative sample data, α refers to an edge threshold (in this embodiment, the edge threshold is set to 6), x a-xp represents a first feature distance between a data feature corresponding to the anchor point data and a data feature corresponding to the positive sample data, and x a-xn represents a second feature distance between a data feature corresponding to the anchor point data and a data feature corresponding to the negative sample data, where the difference is a triplet loss corresponding to the data classification model.
The purpose of the triplet loss is to ensure that the difference between the first characteristic distance and the second characteristic distance is larger than an edge threshold value, so as to ensure the similarity between sample data.
And step 606, adjusting model parameters of the data classification model based on the prediction loss to obtain a target classification model.
In some embodiments, a first weight corresponding to a triplet loss is determined; determining a product of the first weight and the classification loss as a first sub-loss; determining a second weight corresponding to the classification loss; determining a product of the second weight and the triplet loss as a second sub-loss; taking the sum of the first sub-loss and the second sub-loss as a predicted loss; and carrying out gradient adjustment on model parameters of the data classification model based on the prediction loss to obtain a target classification model.
For a schematic way of determining the prediction loss, please refer to equation four:
equation four: l total=w1Lt+w2Lc
Wherein, the predicted loss is L total,Lt and is a triplet loss, L c is a classification loss, w 1 is a first weight, w 2 is a second weight, the first weight and the second weight are preset values (in this embodiment, w 1=1,w2 =0.2) and the first weight and the second weight can be adjusted according to the model training requirement.
During the updating process of the model parameters, sample data iterates in batches, and illustratively, a total of N sample triplets are taken as one batch for 128 batches, and N/128 batches are processed by the network each time (including forward prediction, backward gradient calculation and network parameter updating). When the batch of the whole sample is processed, a stage is calculated. The model learns M phases altogether to stop iteration (M is a preset value, e.g. 10). The updating of the network parameters is carried out by using a random gradient descent method (Stochastic GRADIENT DESCENT, SGD), the predicted loss is subjected to gradient calculation to obtain updated values of model parameters, the data classification network is subjected to parameter updating, and finally the target classification model is obtained. As shown in table three, the triplet loss is used for performing parameter adjustment on Embedding layers in the data classification model, for training and extracting data features in sample data, and the classification loss is used for performing parameter adjustment on the semantic classification layer, namely adjusting the weight of the center characterization.
In step 607, the search data to be classified is input into the target classification model, and the result of classification search is output.
Optionally, in the application process, extracting semantic features of the search data to be classified, inputting the semantic features into a target classification model, extracting data features from the search data to be classified by the target classification model, calculating distances between the data features and clustering centers corresponding to all clusters based on the data features, determining a classification result based on the distances, setting a distance threshold, when the distances between the data features and a certain clustering center reach an example threshold, considering that the clustering center belongs to the cluster (one or more clusters to which the data features belong) of the data features, acquiring sample data under the cluster to which the data features belong, calculating the distances between the sample data and the search data through an L2 distance formula, and selecting the corresponding sample data as a classification search result according to the calculation result (for example, sorting and selecting the sample data in sequence according to the distances from small to large, or selecting topK sample data, which is not limited).
In summary, in the method provided in this embodiment, in the training process of the target classification model, the candidate clustering result corresponding to the sample data is obtained based on the semantic features of the sample data, the data features of the sample data are extracted through the data classification model, the central representation corresponding to the target clusters in the candidate clustering result is determined based on the data features, the central representation is used as the training parameter to perform iterative training on the data classification model, finally, the target classification model for classification retrieval is obtained, and the semantic information is included in the model in the training process in a manner that the central representation corresponding to the target clusters is obtained through the semantic features to perform iterative training on the data classification model, so that the accuracy of classification retrieval and the correlation of the retrieval result are improved, and the classification retrieval efficiency is improved.
In this embodiment, the model parameters of the data classification model are gradient-adjusted by determining the triplet loss and the classification loss, and the Pool layer of the feature classification model is added into the data classification model to perform training of the parameter adjustment auxiliary data classification model, so that the Embedding layer and the Pool layer can perform joint iterative optimization.
In an alternative embodiment, the training process of the data classification model further includes a process of first training the data classification model, where the data classification model adopts a model (shown in the above embodiment) composed of a table one and a table three, and the first training does not include a center characterization, so that the fc_class layer does not need to be trained, only sample triples are needed, and gradient adjustment is performed on model parameters of the data classification model through triplet loss. Wherein Embedding layers adopt normal distribution random numbers with 0 mean and variance of 0.1 as initialization parameters. In the training process, all network layers are trained, model parameters of a data classification model are set to be in a state to be learned, the learning rate is 0.01, an SGD gradient updating method is adopted, the learning rate is changed to be 0.1 after 10 times of training (matching), the total data is trained for 20 times, and the primary training process is consistent with the training process and is not repeated here.
Referring to fig. 7 schematically, a flowchart of a data classification method according to an exemplary embodiment of the present application is shown, where, as shown in fig. 7, sample data 701 is obtained, semantic features of the sample data 701 are obtained through a feature extraction model 702, the sample data 701 is clustered to obtain candidate clustering results corresponding to a sample dataset, the sample data 701 is extracted through a data classification model 703, the data classification model 703 includes basic feature modules 7031 and Embedding layers 7032, the Embedding layers 7032 are used for metric learning, that is, analyzing Embedding features corresponding to input sample data, determining a center token 704 corresponding to a target cluster in the candidate clustering results based on the data features, performing iterative training on the data classification model 703 based on the center token 704, calculating a classification loss 705 through the center token 704 and calculating a triplet loss 706 through the data features to obtain a prediction loss 707, and performing gradient adjustment on model parameters of the data classification model 703 to finally obtain a target classification model 708 for classifying and retrieving input data.
In summary, in the method provided in this embodiment, in the training process of the target classification model, the candidate clustering result corresponding to the sample data is obtained based on the semantic features of the sample data, the data features of the sample data are extracted through the data classification model, the central representation corresponding to the target clusters in the candidate clustering result is determined based on the data features, the central representation is used as the training parameter to perform iterative training on the data classification model, finally, the target classification model for classification retrieval is obtained, and the semantic information is included in the model in the training process in a manner that the central representation corresponding to the target clusters is obtained through the semantic features to perform iterative training on the data classification model, so that the accuracy of classification retrieval and the correlation of the retrieval result are improved, and the classification retrieval efficiency is improved.
The scheme related by the application has the following advantages:
1. Classification initialization without explicit semantics: carrying out classification initialization on sample data in advance according to spatial distribution through a feature extraction model, so that the distribution of training sample data is met on the premise that the classification has semantic association;
2. classification and metric learning are compatible mutually, and learning promotes classification application effects: the center standard is adjusted by means of the measurement learning process, and the classified downstream tasks are transmitted back to the Embedding layer through the gradient, so that the feature extraction learning of the Embedding layer supports the classification tasks as much as possible and the convergence effect is not influenced by the classification tasks besides supporting the similarity measurement of the feature extraction learning of the Embedding layer;
3. classification methods that can be adjusted: because the classification initialization and training depend on the semantic feature distribution of the sample data and the similarity measurement space constraint adjustment, under different services, when the sample data set changes or the sample data set distribution changes (such as cartoon searching, the sample data set is in the cartoon semantic space and the searching space), the proposal related by the application can reestablish the classification initialization along with the change of the sample data and carry out parameter adjustment in the subsequent learning.
The scheme related in the embodiment of the application does not need to manually design semantic categories and realizes the semantic classification effect supporting retrieval: the adoption of irrelevant semantic classification does not necessarily have an effect on similarity retrieval, but the central characterization weight and Embedding layers of feature extraction which are subjected to joint learning adjustment in the scheme can mutually assist in classification retrieval; different services can be classified and adjusted according to sample data, so that the retrieval requirements of the different services are met.
In the embodiment of the application, besides ResnNet-101, different network structures and different pre-training model weights can be used as basic models, such as ResNet-50, inceptionv and the like, and small networks such as ResNet-18 and Embedding dimension reduction such as 64 dimension reduction can be adopted for the search of larger data volume.
Fig. 8 is a block diagram of a data classifying apparatus according to an exemplary embodiment of the present application, and as shown in fig. 8, the apparatus includes the following parts:
an extracting module 810, configured to extract semantic features of sample data in a sample data set, where the semantic features are used to indicate classification semantics corresponding to the sample data;
A clustering module 820, configured to cluster the sample data sets based on the semantic features, to obtain candidate clustering results corresponding to the sample data sets, where the candidate clustering results are used to aggregate sample data with semantic association relationships in the sample data sets to the same cluster;
the extracting module 810 is further configured to extract, by using a data classification model, data features of the sample data, where the data features are used to indicate data element distribution features corresponding to the sample data;
A determining module 830, configured to determine, based on the data features, a center representation corresponding to a target cluster in the candidate cluster results;
The training module 840 is configured to perform iterative training on the data classification model based on the central token, to obtain a target classification model, where the target classification model is used to perform classification search on input data.
In an optional embodiment, the determining module 830 is further configured to obtain a data feature of the sample data in the target cluster; and determining a mean value result corresponding to the data characteristics of the sample data in the target cluster, and taking the mean value result as a center representation corresponding to the target cluster.
In an alternative embodiment, the clustering module 820, as shown in fig. 9, includes:
A determining unit 821, configured to determine n cluster centers corresponding to the sample data set based on semantic features of the sample data, where n is a positive integer;
the determining unit 821 is further configured to determine a cluster center to which the ith sample data belongs based on semantic features of the ith sample data, where i is a positive integer;
And an obtaining unit 822, configured to obtain candidate clustering results corresponding to the sample data sets based on the clustering centers respectively corresponding to the sample data sets.
In an optional embodiment, the determining unit 821 is further configured to determine feature distances corresponding to the semantic features of the ith sample data and the n cluster centers, respectively; taking a cluster center with the characteristic distance meeting the distance requirement as a cluster center to which the ith sample data belongs;
In an alternative embodiment, the training module 840 is further configured to determine a prediction loss corresponding to the data classification model based on the center characterization and the data features; and adjusting model parameters of the data classification model based on the prediction loss to obtain the target classification model.
In an alternative embodiment, the predictive loss includes a classification loss;
The training module 840 is further configured to determine a cluster to which the data feature belongs; acquiring a clustering center of a cluster to which the data feature belongs; and determining the classification loss corresponding to the data classification model based on the difference between the data characteristics and the clustering center.
In an alternative embodiment, the predicted loss includes a triplet loss;
The training module 840 is further configured to construct a sample triplet corresponding to the sample data, where the sample triplet includes anchor point data, positive sample data, and negative sample data, a similarity between the anchor point data and the positive sample data meets a similarity condition, and a similarity between the anchor point data and the negative sample data does not meet the similarity condition; and determining the triplet loss corresponding to the data classification model based on the clusters of the anchor point data, the positive sample data and the negative sample data in the sample triplet.
In an alternative embodiment, the training module 840 is further configured to determine a first feature distance of the anchor point data from the positive sample data, and determine a second feature distance of the anchor point data from the negative sample data; and determining the triplet loss corresponding to the data classification model based on the difference value of the first characteristic distance and the second characteristic distance.
In an alternative embodiment, the predicted loss includes a classification loss and a triplet loss;
The training module 840 is further configured to determine a first weight corresponding to the triplet loss; determining a product of the first weight and the classification loss as a first sub-loss; determining a second weight corresponding to the classification loss; determining a product of the second weight and the triplet loss as a second sub-loss; taking the sum of the first sub-loss and the second sub-loss as the predicted loss; and carrying out gradient adjustment on model parameters of the data classification model based on the prediction loss to obtain the target classification model.
In an alternative embodiment, the apparatus further comprises:
The input module 850 is configured to input the search data to be classified into the target classification model, and output a classification search result.
In summary, in the device provided in this embodiment, in the training process of the target classification model, the candidate clustering result corresponding to the sample data is obtained based on the semantic features of the sample data, the data features of the sample data are extracted through the data classification model, the central representation corresponding to the target clusters in the candidate clustering result is determined based on the data features, the central representation is used as the training parameter to perform iterative training on the data classification model, finally, the target classification model for classification retrieval is obtained, and the semantic information is included in the model in the training process in a manner that the central representation corresponding to the target clusters is obtained through the semantic features to perform iterative training on the data classification model, so that the accuracy of classification retrieval and the correlation of the retrieval result are improved, and the classification retrieval efficiency is improved.
It should be noted that: the data classifying device provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the data classification device and the data classification method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the data classification device and the data classification method are detailed in the method embodiments and are not repeated herein.
Fig. 10 is a schematic diagram illustrating a structure of a server according to an exemplary embodiment of the present application. The server may be a server as shown in fig. 2.
Specifically, the present invention relates to a method for manufacturing a semiconductor device. The server 1000 includes a central processing unit (Central Processing Unit, CPU) 1001, a system Memory 1004 including a random access Memory (Random Access Memory, RAM) 1002 and a Read Only Memory (ROM) 1003, and a system bus 1005 connecting the system Memory 1004 and the central processing unit 1001. The server 1000 also includes a mass storage device 1006 for storing an operating system 1013, application programs 1014, and other program modules 1015.
The mass storage device 1006 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1006 and its associated computer-readable media provide non-volatile storage for the server 1000. That is, the mass storage device 1006 may include a computer readable medium (not shown) such as a hard disk or compact disc read only memory (Compact Disc Read Only Memory, CD-ROM) drive.
Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY, EEPROM), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (DIGITAL VERSATILE DISC, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 1004 and mass storage device 1006 described above may be referred to collectively as memory.
According to various embodiments of the application, the server 1000 may also operate by a remote computer connected to the network through a network, such as the Internet. I.e., the server 1000 may be connected to the network 1012 through a network interface unit 1011 connected to the system bus 1005, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1011.
The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU.
The embodiment of the application also provides a computer device which can be implemented as a terminal or a server as shown in fig. 2. The computer device includes a processor and a memory, where at least one instruction, at least one program, code set, or instruction set is stored, where at least one instruction, at least one program, code set, or instruction set is loaded and executed by the processor to implement the data classification method provided by the above method embodiments.
Embodiments of the present application also provide a computer readable storage medium having stored thereon at least one instruction, at least one program, a code set, or an instruction set, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the data classification method provided by the above method embodiments.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the data classification method according to any one of the above embodiments.
Alternatively, the computer-readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid STATE DRIVES), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, RESISTANCE RANDOM ACCESS MEMORY) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims (13)

1. A method of classifying data, the method comprising:
Extracting semantic features of sample data in a sample data set, wherein the semantic features are used for indicating classification semantics corresponding to the sample data;
Clustering the sample data sets based on the semantic features to obtain candidate clustering results corresponding to the sample data sets, wherein the candidate clustering results are used for aggregating sample data with semantic association relations in the sample data sets into the same cluster;
Extracting data characteristics of the sample data through a data classification model, wherein the data characteristics are used for indicating data element distribution characteristics corresponding to the sample data;
Acquiring data characteristics of sample data in a target cluster;
Determining a mean value result corresponding to the data characteristics of the sample data in the target cluster, and taking the mean value result as a center representation corresponding to the target cluster;
And carrying out iterative training on the data classification model based on the central characterization to obtain a target classification model, wherein the target classification model is used for classifying and searching the input data.
2. The method according to claim 1, wherein the clustering the sample data sets based on the semantic features to obtain candidate clustering results corresponding to the sample data sets includes:
Determining n clustering centers corresponding to the sample data set based on semantic features of the sample data, wherein n is a positive integer;
Determining a clustering center to which the ith sample data belongs based on semantic features of the ith sample data, wherein i is a positive integer;
and obtaining candidate clustering results corresponding to the sample data sets based on the clustering centers respectively corresponding to the sample data.
3. The method according to claim 2, wherein the determining a cluster center to which the ith sample data belongs based on the semantic features of the ith sample data includes:
Determining feature distances respectively corresponding to semantic features of the ith sample data and n clustering centers;
And taking the cluster center with the characteristic distance meeting the distance requirement as the cluster center to which the ith sample data belongs.
4. The method of claim 2, wherein the iteratively training the data classification model based on the center characterization to obtain a target classification model comprises:
determining a prediction loss corresponding to the data classification model based on the center characterization and the data features;
and adjusting model parameters of the data classification model based on the prediction loss to obtain the target classification model.
5. The method of claim 4, wherein the predicted loss comprises a classification loss;
the determining, based on the center characterization and the data feature, a prediction loss corresponding to the data classification model includes:
determining a cluster to which the data feature belongs;
acquiring a clustering center of a cluster to which the data feature belongs;
and determining the classification loss corresponding to the data classification model based on the difference between the data characteristics and the clustering center.
6. The method of claim 4, wherein the predicted loss comprises a triplet loss;
the determining, based on the center characterization and the data feature, a prediction loss corresponding to the data classification model includes:
Constructing a sample triplet corresponding to the sample data, wherein the sample triplet comprises anchor point data, positive sample data and negative sample data, the similarity between the anchor point data and the positive sample data accords with a similarity condition, and the similarity between the anchor point data and the negative sample data does not accord with the similarity condition;
And determining the triplet loss corresponding to the data classification model based on the clusters of the anchor point data, the positive sample data and the negative sample data in the sample triplet.
7. The method of claim 6, wherein the determining the triplet loss corresponding to the data classification model based on the clusters to which the anchor point data, the positive sample data, and the negative sample data belong in the sample triplet comprises:
determining a first characteristic distance between the anchor point data and the positive sample data, and determining a second characteristic distance between the anchor point data and the negative sample data;
And determining the triplet loss corresponding to the data classification model based on the difference value of the first characteristic distance and the second characteristic distance.
8. The method according to claim 5 or 6, wherein the predicted loss comprises a classification loss and a triplet loss;
the adjusting the model parameters of the data classification model based on the prediction loss to obtain the target classification model comprises the following steps:
Determining a first weight corresponding to the triplet loss;
Determining a product of the first weight and the classification loss as a first sub-loss;
Determining a second weight corresponding to the classification loss;
Determining a product of the second weight and the triplet loss as a second sub-loss;
Taking the sum of the first sub-loss and the second sub-loss as the predicted loss;
and carrying out gradient adjustment on model parameters of the data classification model based on the prediction loss to obtain the target classification model.
9. The method of claim 1, wherein the iteratively training the data classification model based on the center characterization further comprises, after obtaining a target classification model:
and inputting the search data to be classified into the target classification model, and outputting to obtain a classification search result.
10. A data sorting apparatus, the apparatus comprising:
The extraction module is used for extracting semantic features of sample data in the sample data set, wherein the semantic features are used for indicating classification semantics corresponding to the sample data;
the clustering module is used for clustering the sample data sets based on the semantic features to obtain candidate clustering results corresponding to the sample data sets, and the candidate clustering results are used for aggregating the sample data with semantic association relations in the sample data sets into the same cluster;
The extraction module is further used for extracting data features of the sample data through a data classification model, wherein the data features are used for indicating data element distribution features corresponding to the sample data;
The determining module is used for acquiring the data characteristics of the sample data in the target cluster; determining a mean value result corresponding to the data characteristics of the sample data in the target cluster, and taking the mean value result as a center representation corresponding to the target cluster;
The training module is used for carrying out iterative training on the data classification model based on the center characterization to obtain a target classification model, and the target classification model is used for carrying out classified retrieval on input data.
11. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set, the at least one instruction, at least one program, code set or instruction set being loaded and executed by the processor to implement the data classification method of any of claims 1 to 9.
12. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the data classification method of any of claims 1 to 9.
13. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the data classification method of any of claims 1 to 9.
CN202111232559.3A 2021-10-22 2021-10-22 Data classification method, apparatus, device, storage medium and computer program product Active CN114298122B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111232559.3A CN114298122B (en) 2021-10-22 2021-10-22 Data classification method, apparatus, device, storage medium and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111232559.3A CN114298122B (en) 2021-10-22 2021-10-22 Data classification method, apparatus, device, storage medium and computer program product

Publications (2)

Publication Number Publication Date
CN114298122A CN114298122A (en) 2022-04-08
CN114298122B true CN114298122B (en) 2024-06-18

Family

ID=80964652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111232559.3A Active CN114298122B (en) 2021-10-22 2021-10-22 Data classification method, apparatus, device, storage medium and computer program product

Country Status (1)

Country Link
CN (1) CN114298122B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913339B (en) * 2022-04-21 2023-12-05 北京百度网讯科技有限公司 Training method and device for feature map extraction model
CN115859128B (en) * 2023-02-23 2023-05-09 成都瑞安信信息安全技术有限公司 Analysis method and system based on interaction similarity of archive data
CN117334186B (en) * 2023-10-13 2024-04-30 北京智诚鹏展科技有限公司 Speech recognition method and NLP platform based on machine learning
CN117349344B (en) * 2023-10-23 2024-03-05 广州欧派创意家居设计有限公司 Intelligent product sales data acquisition method and system based on big data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059807A (en) * 2019-04-26 2019-07-26 腾讯科技(深圳)有限公司 Image processing method, device and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569856B (en) * 2018-08-24 2020-07-21 阿里巴巴集团控股有限公司 Sample labeling method and device, and damage category identification method and device
CN111767737A (en) * 2019-05-30 2020-10-13 北京京东尚科信息技术有限公司 Text intention similarity determining method and device, electronic equipment and storage medium
CN111860674B (en) * 2020-07-28 2023-09-19 平安科技(深圳)有限公司 Sample category identification method, sample category identification device, computer equipment and storage medium
CN111767405B (en) * 2020-07-30 2023-12-08 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of text classification model
CN111950254B (en) * 2020-09-22 2023-07-25 北京百度网讯科技有限公司 Word feature extraction method, device and equipment for searching samples and storage medium
CN113392866A (en) * 2020-11-19 2021-09-14 腾讯科技(深圳)有限公司 Image processing method and device based on artificial intelligence and storage medium
CN113392179A (en) * 2020-12-21 2021-09-14 腾讯科技(深圳)有限公司 Text labeling method and device, electronic equipment and storage medium
CN113221905B (en) * 2021-05-18 2022-05-17 浙江大学 Semantic segmentation unsupervised domain adaptation method, device and system based on uniform clustering and storage medium
CN113360700B (en) * 2021-06-30 2023-09-29 北京百度网讯科技有限公司 Training of image-text retrieval model, image-text retrieval method, device, equipment and medium
CN113298197B (en) * 2021-07-28 2021-11-02 腾讯科技(深圳)有限公司 Data clustering method, device, equipment and readable storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059807A (en) * 2019-04-26 2019-07-26 腾讯科技(深圳)有限公司 Image processing method, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于隐含语义分析的抖音短视频语义检测方法;赵楠等;科技资讯;20200203;第18卷(第04期);第9-10页 *

Also Published As

Publication number Publication date
CN114298122A (en) 2022-04-08

Similar Documents

Publication Publication Date Title
CN110866140B (en) Image feature extraction model training method, image searching method and computer equipment
CN108132968B (en) Weak supervision learning method for associated semantic elements in web texts and images
CN114298122B (en) Data classification method, apparatus, device, storage medium and computer program product
CN111582409B (en) Training method of image tag classification network, image tag classification method and device
CN109993102B (en) Similar face retrieval method, device and storage medium
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
EP4002161A1 (en) Image retrieval method and apparatus, storage medium, and device
CN111667022A (en) User data processing method and device, computer equipment and storage medium
CN113761153B (en) Picture-based question-answering processing method and device, readable medium and electronic equipment
CN113298197B (en) Data clustering method, device, equipment and readable storage medium
CN113821670B (en) Image retrieval method, device, equipment and computer readable storage medium
CN113177616B (en) Image classification method, device, equipment and storage medium
CN114358188A (en) Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment
CN112395487A (en) Information recommendation method and device, computer-readable storage medium and electronic equipment
CN113569895A (en) Image processing model training method, processing method, device, equipment and medium
CN114329029B (en) Object retrieval method, device, equipment and computer storage medium
CN113537304A (en) Cross-modal semantic clustering method based on bidirectional CNN
CN110457523B (en) Cover picture selection method, model training method, device and medium
CN114330514A (en) Data reconstruction method and system based on depth features and gradient information
CN114358109A (en) Feature extraction model training method, feature extraction model training device, sample retrieval method, sample retrieval device and computer equipment
CN114330476A (en) Model training method for media content recognition and media content recognition method
CN114329004A (en) Digital fingerprint generation method, digital fingerprint generation device, data push method, data push device and storage medium
CN113705293A (en) Image scene recognition method, device, equipment and readable storage medium
CN114708449B (en) Similar video determination method, and training method and device of example characterization model
CN116958624A (en) Method, device, equipment, medium and program product for identifying appointed material

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40071964

Country of ref document: HK

GR01 Patent grant