WO2022127037A1

WO2022127037A1 - Data classification method and apparatus, and related device

Info

Publication number: WO2022127037A1
Application number: PCT/CN2021/096647
Authority: WO
Inventors: 张楠; 王健宗; 瞿晓阳
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-17
Filing date: 2021-05-28
Publication date: 2022-06-23
Also published as: CN112613550A

Abstract

The present application discloses a data classification method, said method comprising: acquiring training data, the training data comprising k categories, and k being a positive integer greater than 1; determining, by means of a WOA, vectors corresponding to n individuals, so as to obtain a first vector set; calculating, by using a target optimization function, a function value corresponding to each vector in the first vector set, so as to obtain an optimal vector; then performing a preset number of update operations on each individual by means of the WOA, and taking the finally obtained optimal vector as a clustering center; and finally completing the classification of voice data to be classified by means of the clustering centers of the k categories. According to the embodiments of the present application, the clustering centers of voice data of various categories can be acquired, and then the voice data to be classified is classified according to the clustering centers and is dispatched to corresponding persons, so that the same batch of annotation personnel only process data of one category as much as possible, improving the efficiency of data annotation, and thus reducing the time of the whole AI project.

Description

A data classification method, device and related equipment

This application claims the priority of the Chinese patent application filed on December 17, 2020 with the application number 202011503667.5 and the title of the invention is "a data classification method, device and related equipment", the entire contents of which are incorporated by reference in in this application.

technical field

The present application relates to the field of data processing, and in particular, to a data classification method, apparatus and related equipment.

Background technique

The data labeling platform is a very important part of the outbound robot project team. Every day, the actual outbound voice data of the robot will be transferred to the platform for verification and corresponding data labeling, and then sent back to the model for training.

The inventor realized that data labeling, as a basis for the above-mentioned artificial intelligence (AI) projects, is usually done manually, and high-quality data labeling is time-consuming and labor-intensive, and the processing of massive data almost consumes the entire AI. most of the project. Moreover, in the massive data, there will be a large number of various scenarios and various types of data, so before dispatching to the corresponding personnel for manual annotation, certain preprocessing needs to be carried out.

SUMMARY OF THE INVENTION

The embodiments of the present application provide a data classification method, device, and related equipment, which can obtain the clustering centers of each category of voice data, and then classify the voice data to be classified into the corresponding categories through the clustering centers, and then distribute them to the corresponding categories. It can greatly improve the efficiency of manual annotation.

In a first aspect, the present application provides a data classification method, which includes the following steps:

Acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;

The whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;

Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;

To perform an update operation:

The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;

Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;

Using the target optimization function, calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;

Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;

Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;

Performing the update operation for a preset number of times, and using the new optimal vector obtained by the last update operation as the cluster center of the target training data;

Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.

In a second aspect, the present application provides a data classification device, the device comprising:

an acquisition module for acquiring training data, wherein the training data includes k categories, and k is a positive integer greater than 1;

The processing module is used to determine the vectors corresponding to n individuals from the target training data through the whale optimization algorithm WOA, and obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1;

The processing module is also used for:

To perform an update operation:

Using the objective optimization function, the function value corresponding to each vector in the second vector set and the function value corresponding to each vector in the third vector set are calculated, and the corresponding function values of the n individuals are determined by the second preset condition. target vector, get the target vector set;

In a third aspect, the present application provides a computing device including a processor and a memory, and the processor and the memory can be connected to each other through a bus, or can be integrated together. The processor executes code stored in memory to implement the following methods:

To perform an update operation:

In a fourth aspect, the present application provides a computer-readable storage medium, including a program or an instruction, and when the program or instruction is run on a computer device, the computer device can be made to execute the following method:

To perform an update operation:

Based on the traditional whale optimization algorithm, this application can obtain the clustering centers of each category of voice data, and then classify the voice data to be classified into the corresponding categories through the clustering centers of each category, and then distribute them to the corresponding personnel for voice Data labeling enables the same group of labelers to only process data under one category, which is more targeted, which can greatly improve the efficiency of manual labeling, thereby shortening the time of the entire AI project.

Description of drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

1 is a schematic flowchart of a traditional whale optimization algorithm provided by an embodiment of the present application;

2 is a schematic flowchart of a data classification method provided by an embodiment of the present application;

3 is a schematic flowchart of another data classification method provided by an embodiment of the present application;

4 is a schematic structural diagram of a data classification apparatus provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a computing device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.

It should be noted that the terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. As used in the embodiments of this application and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

The technical solutions of the present application may relate to the technical field of artificial intelligence and/or big data, for example, may specifically relate to machine learning technology, which may be used in scenarios such as data processing to realize data classification. Optionally, the data involved in this application, such as training data, voice data and/or classification result information, may be stored in a database, or may be stored in a blockchain, such as distributed storage through a blockchain, which is not included in this application. limited.

To facilitate understanding of the embodiments of the present application, some related algorithms are introduced below.

Whale Optimization Algorithm (WOA) is a meta-heuristic swarm intelligence algorithm proposed by Mirjalili and Lewis in 2016, which is inspired by the hunting behavior of humpback whales. Humpback whales are social animals, and they cooperate with each other to drive and round up their prey when hunting. They have a special hunting method called bubble-net feeding method. It is done by continuously releasing unique bubbles on the path in the shape of the number "9". The whale optimization algorithm mathematically models the hunting behavior of humpback whales and is used to solve various optimization problems. In the whale optimization algorithm, a whale population consists of multiple whale individuals, which can also be called search agents. Each individual represents a possible solution to a problem to be solved, and the solution is in is encoded in a computer as a vector representation. Such a set of possible solutions is called a population, and the whole population has a strong diversity of solutions. In the whale algorithm, the position of each individual whale is controlled by three parts: surround prey, bubble net attack and random search for prey.

1. Surround the prey. The humpback whale itself can identify the prey position and surround it, but since the position of the optimal solution to the problem to be solved (i.e. the target prey) in the search space is not known a priori, the WOA algorithm assumes the current best individual whale (Best possible solution) is the target prey or close to the optimal solution. After the best whale individual is defined, other whale individuals will try to update their positions towards the current best whale individual (reference whale), the new position of each whale individual can be defined as the original position of the whale individual and the current best whale individual anywhere in between, this behavior is represented by equations (1)(2):

D=|CX ^* (t)-X(t)| (1)

X(t+1)=X ^* (t)-A·D (2)

Among them, t is the current number of iterations, A and C are coefficient vectors, X ^* is the position vector of the current optimal whale individual (current optimal solution), X is the position vector of the current whale individual, || represents the absolute value operation, · Represents element-wise multiplication. X ^* needs to be updated when a better solution emerges during each iteration. The computation of A and C is expressed by equations (3)(4):

A=2a r-a (3)

C=2 r (4)

where a decreases linearly from 2 to 0 in the iterative process, r is a random vector between [0, 1], and the fluctuation range of A is also reduced by a, in other words, A is an interval [-a, a] random value inside. Equation (2) allows any individual whale to update its position within the domain of the current optimal solution, thus simulating the whale's surrounding prey behavior.

2. Bubble net attack. Humpback whales also constantly update their positions in order to use bubble nets to drive away prey. The method first calculates the distance between the individual whale's position and the position of the prey (i.e., the current best individual whale), and then creates a spiral equation between the individual whale and the prey to mimic the spiral movement of humpback whales. Its spiral position update formula is expressed by formula (5):

X(t+1)=D′·e ^bl ·cos(2πl)+X ^* (t) (5)

where D′=|X ^* (t)-X(t)|, represents the distance between the current whale individual and the prey, b is a constant (generally 1 by default), b defines the shape of the logarithmic spiral, and l is A random number between [-1,1].

It is worth noting that during the hunting process of whales, the above-mentioned contraction and encircling of the prey and the bubble net attack behavior of the spiral path are carried out at the same time. Therefore, in order to model this simultaneous behavior, it is assumed that the probability p of individual whales choosing the shrinkage encirclement mechanism and the bubble net attack to update the position is the same, and both are 0.5. The mathematical model can be expressed by Equation (6):

where p is a random number between [0,1]. If the generated random number p < 0.5, the whale individual chooses to surround the prey to update the position; if the generated random number p ≥ 0.5, the bubble net attack method is used to update the position.

3. Randomly search for prey. In addition to the above two methods, humpback whales also randomly search for prey, also based on a variable A vector. In fact, humpback whales do random searches based on each other's positions, so using a random value of A greater or less than -1 to force the current individual whale away from the reference whale. Different from the previous stage, here a randomly selected whale individual in the population is used as the reference whale to update the position of the current whale individual, instead of using the current best whale individual as the reference whale to update the position. In the random search prey mechanism, |A|>1 emphasizes the exploration in the search space and allows the WOA algorithm to perform a global search. The mathematical model is expressed by equations (7) (8):

D=|C·X _rand -X| (7)

X(t+1)=X _rand -A·D (8)

where X _rand is a random position vector selected from the current whale population (representing a random individual whale)

To sum up, in each iteration of the WOA algorithm, each individual whale in the whale population selects one of the three methods of encircling the prey, bubble net attack and random search for the prey to update the position. The flowchart of the traditional whale optimization algorithm can be exemplarily shown in Figure 1, and the entire execution process can be simply summarized as the following steps:

S101: Define boundaries and determine algorithm parameters.

S102: Initialize the whale population X _i (i=1, 2, . . . , n), where n is the number of individual whales in the whale population.

S103: Calculate the fitness of each individual whale. The fitness is usually measured by the selected objective optimization function, and the current best individual is marked as X ^* .

S104: WOA algorithm iterative calculation, the pseudo code of this step is as follows:

While(t<Maximum number of iterations T)

for(i=1:n)#for each individual whale

Update the values of parameters a, A, C, l, p;

if1(p<0.5)

if2(|A|<1)

Surround the prey, and update the current individual's position by formula (2);

else if2(|A|≥1)

randomly select a whale individual X _rand ;

Randomly search for prey, and update the position of the current individual through formula (8);

end if2

else if1(p≥0.5)

Bubble net attack, update the position of the current individual through formula (5);

end if1

end for

Calculate the fitness of each individual whale, and update the best individual X ^* with a better fitness value;

t=t+1

end while

return X ^*

It should be understood that the above description of the traditional whale optimization algorithm is only for the convenience of understanding the basic idea of the algorithm, and does not limit the present application. Although the traditional whale optimization algorithm has good performance in solving simple and small-scale problems, it still has low search accuracy, slow convergence speed and easy to fall into local optimum in complex and large-scale optimization problems. problem that needs to be improved.

Genetic Algorithm (GA) is a computational model of the biological evolution process that simulates the natural selection and genetic mechanism of Darwin's theory of biological evolution. Genetic algorithm takes all individuals in a population as the object, and selection, crossover and mutation constitute the genetic operation of genetic algorithm. There are many mathematical implementation methods for genetic manipulation, and generally, a suitable mathematical implementation method can be selected according to specific problems.

The selection operation is usually a random selection of parent and parent individuals for crossover. For example, the wheel selection method, which is commonly used in selection operations, is a selection strategy based on the proportion of fitness. The fitness can be measured by selecting an appropriate fitness function (or objective optimization function) according to specific problems. The better the fitness of the individual, the greater the probability of the individual being selected, but at the same time, the individual with small probability also has the opportunity to be selected, thus maintaining the diversity of the population. Since the parent and parent individuals used for crossover in the wheel selection method are randomly selected, it can be said that this is a less perfect selection method. There are other options, which will not be introduced here.

Crossover operation refers to the process of simulating chromosome crossover and exchanging part of genetic material in natural evolution by mathematical methods. Crossover operation is implemented in vectors, that is, the vector elements of parent and mother generation individuals are replaced and recombined to generate new children. Generation of individuals, Equation (9) gives one of the crossover methods:

Among them, r∈[1,2,...,n] and r≠i, n is the population size, Cr is the crossover probability, x _i,m is the mth dimension element of the current individual X _i , rand _i,m is the corresponding A random number of elements x _i,m . Perform the crossover operation of formula (9) on the current individual X _i , first select a parent individual X _r , if the generated random number rand _i,m is less than the crossover probability Cr, use the m-th dimension element x of the parent individual X _r _r,m replaces the mth dimension element x _i,m of the current individual X _i (that is, the parent individual); if the generated random number rand _i,m is greater than or equal to the crossover probability Cr, then the mth dimension element of the current individual X _i x _i,m remain unchanged. After the current individual completes the above crossover operation, a new individual is finally obtained. It should be noted that the above example only cross-replaces the elements of one dimension of the vector. The cross operation can also cross-replace the elements of multiple dimensions of the vector, and there are other cross methods, such as uniform arithmetic cross, etc., The present application does not limit the specific implementation method of the crossover operation.

The mutation operation is also the process of using mathematical methods to simulate the mutation in nature and the change of some genes in the chromosome under a certain probability. The realization of the mutation operation in the vector is to make changes and adjustments to the vector elements of the parent individual. Equation (10) gives one of the mathematical implementations of the mutation operation:

Among them, r∈[1,2,…,n] and r≠i, n is the population size, Mu is the mutation probability, x _i,m is the mth dimension element of the current individual Xi _, rand _i,m is the A random number of elements x _i,m . If the generated random number rand _i,m is less than the mutation probability Mu, change the mth dimension element of X _i to x _c , which is different from x _i,m , and x _c can be any value in the search space; If the number rand _i,m is greater than or equal to the mutation probability Mu, the m-th dimension element x _i,m of the current individual X _i remains unchanged. It should be understood that, in addition to the above method, the mutation operation may also have other mathematical implementation manners, which are not specifically limited in this application.

The application scenarios involved in this application are described below.

Nowadays, massive amounts of raw data can be obtained everywhere, but in order to use these raw data to train machine learning and deep learning models, it is necessary to perform certain processing on these raw data in advance, that is, data labeling. Data can be better released after labeling. For example, the data labeling platform is a very important part of the outbound robot project team. Every day, the voice data actually called by the robot will be transferred to the platform for verification and corresponding data annotation, and then sent back to the model for training.

The quality and quantity of training data provided often have a significant impact on the machine learning model. The better the data quality, the more stable the model performance. However, data labeling, which is the basis of artificial intelligence projects, is usually operated by humans, which can be described as the "artificial" behind artificial intelligence. High-quality data labeling is time-consuming and labor-intensive, and data labeling almost accounts for most of the time of the entire AI project. . Moreover, there will be a large number of various scenarios and various types of data in the massive raw voice data, and a labeler may get multiple types of data, which affects the efficiency of labeling. Therefore, before distributing the original voice data to the corresponding personnel for voice data annotation, if a certain preprocessing can be performed, the original voice data can be classified into the same category as much as possible, and then the voice data of each category is distributed to the corresponding annotator. This allows the same batch of labelers to process only one category of data, which is more targeted and helps improve the efficiency of voice data labeling.

In view of the above problems, the embodiment of the present application discloses a data classification method, which can obtain the clustering center of each category of voice data, and then classify the voice data to be classified into the corresponding category through the clustering center, and then distribute it to the corresponding category. The same group of labelers can only process data in one category as much as possible, which is more targeted and can improve the efficiency of manual labeling, thereby shortening the time of the entire AI project.

2 is a flowchart of a data classification method provided by an embodiment of the present application, and the method includes the following steps:

S201: The computing device acquires training data.

The training data includes k categories, and k is a positive integer greater than 1. The source of the training data is not limited, it can be obtained by the computing device 500 sending a request to the data server, or it can be taken out from the data labeling platform, or it can be manually input data directly, which is not limited in this application.

In a possible embodiment, before acquiring the training data, the computing device extracts the speech feature vector of the training data.

S202: Determine vectors corresponding to n individuals from the target training data through WOA to obtain a first vector set.

The target training data is the training data of any one of the above k categories, each of the n individuals corresponds to a vector in the target training data, and n is a positive integer greater than 1. It should be understood that each individual in the whale population of the WOA algorithm is a possible solution to the clustering center of the category where the target training data is located. The steps of the whale optimization algorithm can be found in Figure 1 and the aforementioned related content. For the brevity of the description, it is not included here. Repeat.

For example, assuming that there are 1000 target training data in the target training data of the first category, the corresponding vectors are d ₁ , d ₂ . . . d ₁₀₀₀ . First, define the boundary for the WOA algorithm, that is, determine the search space of the cluster center c ₁ of the first category. Specifically, you can set the search range of each dimension element of the c ₁ vector, and then determine the algorithm parameters, including the whale population size n, the algorithm The maximum number of iterations T and so on. The search space and algorithm parameters can be determined manually based on experience. Here, the whale population size n is set to 5, and the maximum number of iterations T of the algorithm is 50. Then initialize a whale population

Among them, n is the number of whale individuals in the whale population, "0" represents the initial value and the 0th iteration, the superscript "1" represents the first category, and the subscript "i" represents the population in the population. The i-th individual, each individual is a possible solution of the cluster center c ₁ . The five individuals in the population are

Randomly select 5 vectors (assuming d ₃ , d ₁ , d ₅ , d ₁₂ , d ₃₀ ) from the vectors corresponding to the above 1000 target training data as the initial corresponding vectors of these 5 individuals, namely

The initialization of the whale population is completed, and the first vector sets d ₃ , d ₁ , d ₅ , d ₁₂ , and d ₃₀ are obtained. It should be understood that only one category is used here as an example, and other categories also perform the same operations.

S203: Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set to obtain the optimal vector.

Specifically, the objective optimization function is used to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and the vector corresponding to the smallest first function value among the n first function values is taken as the most good vector.

In a possible embodiment, the above-mentioned objective optimization function is used to calculate the sum of the distances between the candidate vector and each data in the target training data, wherein the candidate vector is a vector corresponding to any one of the n individuals. It should be understood that the smaller the function value calculated by the objective optimization function, the better the fitness of the individual, and the closer the vector corresponding to the individual is to the optimal solution of the cluster center.

In a possible embodiment, the above distance is any one of a Hamming distance, a Min-type distance, or an included angle cosine distance. It should be understood that there are many ways to calculate the distance between vectors, and other ways other than the above-mentioned calculation ways may also be used to calculate the distance between vectors in this embodiment of the present application.

S204: Update the vectors corresponding to the n individuals respectively through WOA to obtain a second vector set.

Specifically, through an iterative process of the WOA algorithm, the vectors corresponding to each of the n individuals are respectively updated, and the vectors corresponding to the updated n individuals are set as the second vector set. For the iterative process of the whale optimization algorithm, please refer to the flowchart of the traditional whale optimization algorithm and related descriptions in Fig. 1. For the sake of brevity of the description, it will not be repeated here.

For example, the first individual

The corresponding vector in the first vector set is d ₃ , for the individual

Execute an iterative process of the WOA algorithm: first randomly obtain the values of p and A, and find that p < 0.5 and |A|

update, assuming individual

The corresponding vector has changed from the original d ₃ to another vector d ₈ , and the vector d ₈ is the first individual

The corresponding vector in the second vector set. The above content only takes an individual as an example. The same operation is performed on each individual in each category. The corresponding vectors of n individuals are updated, and the updated corresponding vectors of the n individuals are set as the second vector set.

S205: Calculate the distance between each vector in the second vector set and the optimal vector respectively, update the vector corresponding to each individual according to the first preset condition, and obtain a third vector set.

Specifically, the distance between each vector in the second vector set and the optimal vector is calculated respectively, and the vector corresponding to each individual is updated according to the above distance and the first preset condition to obtain a third vector set, wherein the third vector set is A third vector corresponding to each of the n individuals is included.

In a possible embodiment, as shown in FIG. 3 , the distance between the vector corresponding to the target individual in the second vector set and the optimal vector is first calculated. When the above distance is greater than the first threshold, a cross operation is performed on the vector corresponding to the target individual in the second vector set and the optimal vector to obtain a third vector corresponding to the target individual in the third vector set, wherein the above target individuals are n any of the individuals.

In a possible embodiment, as shown in FIG. 3 , the distance between the vector corresponding to the target individual in the second vector set and the optimal vector is calculated. When the above distance is less than or equal to the first threshold, perform mutation operation on the vector corresponding to the target individual in the second vector set to obtain a third vector corresponding to the target individual in the third vector set, wherein the above target individual is one of the n individuals any one of .

For example, suppose the current optimal vector is d ₅ , and the target individual is the first individual among n individuals

at this time

The corresponding vector in the second vector set is d ₈ , and the Hamming distance between d ₈ and the best vector d ₅ is first calculated. When the above Hamming distance is greater than the first threshold, the target individual

The corresponding vector d ₈ in the second vector set is crossed with the best vector d ₅ to obtain a new vector (assuming the cross to obtain d ₄₃ ), and the vector d ₄₃ is used as the target individual

The corresponding third vector in the third vector set; when the above Hamming distance is less than or equal to the first threshold, the target individual

At this time, the mutation operation is performed on the corresponding vector d ₈ in the second vector set to obtain a new vector (assuming that the mutation obtains d ₄₄ ), and then the vector d ₄₄ is used as the target individual

The corresponding third vector in the third vector set. It should be understood that the above content is only an example of an individual, and the same operation is performed on each individual to obtain a third vector corresponding to each individual, forming a third vector set. It should be noted that this application does not specifically limit the mathematical implementation of the crossover and mutation operations. For the introduction of the crossover and mutation operations, please refer to the foregoing content, which will not be repeated here.

S206: Use the objective optimization function to calculate the function value corresponding to each vector in the second vector set and the third vector set, and determine the target vector corresponding to the n individuals according to the second preset condition to obtain the target vector set.

The target vector set includes a target vector corresponding to each of the n individuals.

In a possible embodiment, as shown in FIG. 3 , the target optimization function is used to calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set. When the function value of the corresponding vector in the vector set is greater than the function value of the corresponding vector in the third vector set, the corresponding vector in the third vector set is used as the target vector corresponding to the target individual;

In a possible embodiment, as shown in FIG. 3 , the target optimization function is used to calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set. When the function value of the corresponding vector in the vector set is less than or equal to the function value of the corresponding vector in the third vector set, use the traditional krill swarm algorithm (KHA) for the vector corresponding to the target individual in the second vector set to obtain the target corresponding to the target individual vector, where the target individual is any one of the n individuals.

For example, suppose the target individual is the first individual among n individuals

target individual

The corresponding vector in the second vector set is d ₈ ,

The corresponding vector in the third vector set is d ₄₃ . The function values corresponding to d ₈ and d ₄₃ are calculated respectively through the objective optimization function, and the magnitude relationship between the two is judged. When the function value corresponding to d ₈ is greater than the function value corresponding to d ₄₃ , the vector d ₄₃ is used as the target individual

The corresponding target vector; when the function value corresponding to d ₈ is less than or equal to the function value corresponding to d ₄₃ , the target individual

The corresponding vector d ₈ in the second vector set uses the traditional krill swarm algorithm (KHA) to obtain the target individual

the corresponding target vector. The above content is only an example of an individual, and the

and

The above operations are also performed separately. Finally, each of the n individuals determines a corresponding target vector, and the n target vectors are set as the target vector set. The operation of other categories is the same.

S207: Use the objective optimization function to calculate the function value corresponding to each vector in the objective vector set to obtain n objective function values.

S208: Compare the smallest objective function value among the n objective function values with the function value corresponding to the optimal vector to determine a new optimal vector.

Specifically, the smallest objective function value among the n objective function values is compared with the function value corresponding to the optimal vector, and when the smallest objective function value is smaller than the function value corresponding to the optimal vector, the smallest objective function value is compared. The target vector corresponding to the value is used as the new optimal vector.

S209: Use the new optimal vector obtained by the last update as the cluster center of the target training data.

Specifically, the update operations of steps S204 to S208 of the preset number of times (that is, the maximum number of iterations T) are performed, wherein, in the t-th calculation, the new optimal vector obtained by the t-1th calculation is used as the t-th The optimal vector when the above S204 to S208 are executed for the second time, and the new optimal vector obtained by the last update operation is used as the cluster center of the target training data.

S210: Acquire the speech data to be classified, and complete the classification of the speech data to be classified through the clustering center.

Specifically, the speech data to be classified is obtained, the distances between the speech data to be classified and the cluster centers of the k categories are calculated respectively, a cluster center with the smallest distance from the speech data to be classified is obtained, and the speech data to be classified is divided into The data is classified into the category corresponding to the cluster center with the smallest distance from the speech data to be classified. It should be understood that the distance value between the speech data to be classified and the cluster center can be used as a measure of similarity between the data. The closer the distance is, the higher the similarity between the speech data to be classified and the data in the category corresponding to the distance center is. Therefore, the speech data to be classified can be classified into the category where the cluster center with the closest distance is located to complete the classification of the original data to be classified.

For example, if there is a piece of speech data to be classified, extract its speech feature vector to obtain d _new , calculate the distance between d _new and the obtained 10 cluster centers c ₁ to c ₁₀ respectively, and find the distance between d _new and c ₅ is the smallest, so d _new is classified into the fifth category where the cluster center c ₅ is located to complete the classification of the speech data, and other speech data to be classified are also classified in the same way.

It can be seen that, based on the traditional whale optimization algorithm WOA, the embodiment of the present application can obtain the clustering centers of each category of voice data, and then classify the voice data to be classified into corresponding categories through the above-mentioned clustering centers, and then distribute them to the corresponding categories. Corresponding personnel perform data labeling, so that the same group of labelers can only process voice data under one category, which is more targeted, which can greatly improve the efficiency of manual labeling, thereby shortening the time of the entire AI project.

It should be understood that the above steps S201 to S210 can be used for classification of other types of data, including classification of text data, video data, image data and other types of data, in addition to the classification of voice data. Specifically, corresponding feature extraction can be performed according to the type of data, such as facial feature extraction for video data, semantic feature extraction for text data, etc., which are not specifically limited in this application.

FIG. 4 is a schematic structural diagram of a data classification apparatus 400 provided by an embodiment of the present application. The data classification apparatus includes:

an acquisition module 401, configured to acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;

The processing module 402 is used for determining vectors corresponding to n individuals from the target training data through the whale optimization algorithm WOA to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1;

The processing module 402 is also used for:

To perform an update operation:

Each module of the data classification apparatus 400 is specifically used to implement steps S201 to S210 in the embodiment of the data classification method in FIG. 2 , and for the sake of brevity of the description, details are not repeated here.

FIG. 5 is a schematic structural diagram of a computing device 500 provided by an embodiment of the present application, and the computing device 500 may be the data classification apparatus 400 in the foregoing content. The computing device may be a notebook computer, a tablet computer, a cloud server and other computing devices, which are not limited in this application. It should be understood that the computing device may also be a computer cluster composed of at least one server, which is not specifically limited in this application. The computing device may include memory and a processor. Optionally, the computing device may also include a communication interface.

For example, the computing device 500 includes: a processor 501, a communication interface 502, and a memory 503, and the computing device is configured to execute the steps in each of the foregoing data classification method embodiments. The processor 501 , the communication interface 502 and the memory 503 can be connected to each other through the internal bus 504 , and can also communicate through other means such as wireless transmission. The embodiment of the present application takes the connection through the bus 504 as an example, and the bus 504 may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus or the like. The bus 504 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 5, but it does not mean that there is only one bus or one type of bus.

The processor 501 may be composed of at least one general-purpose processor, such as a central processing unit (Central Processing Unit, CPU), or a combination of a CPU and a hardware chip. The above-mentioned hardware chip can be an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a programmable logic device (Programmable Logic Device, PLD) or a combination thereof. The above-mentioned PLD can be a complex programmable logic device (Complex Programmable Logic Device, CPLD), a field programmable gate array (Field-Programmable Gate Array, FPGA), a general array logic (Generic Array Logic, GAL) or any combination thereof. Processor 501 executes various types of digitally stored instructions, such as software or firmware programs stored in memory 503, which enable computing device 500 to provide various services.

The memory 503 is used to store program codes, and is controlled and executed by the processor 501, so as to execute the processing steps in each of the foregoing embodiments of the data classification methods. The program code may include one or more software modules, and the one or more software modules may be the software modules provided in the embodiment of FIG. 4, such as an acquisition module and a processing module. Steps S201 to S210 will not be repeated here.

It should be noted that this embodiment can be implemented by a general physical server, for example, an ARM server or an X86 server, or can be implemented by a virtual machine based on a general physical server combined with NFV technology. A complete computer system with complete hardware system functions and running in a completely isolated environment is not specifically limited in this application.

The memory 503 may include a volatile memory (Volatile Memory), such as a random access memory (Random Access Memory, RAM); the memory 503 may also include a non-volatile memory (Non-Volatile Memory), such as a read-only memory (Read- Only Memory (ROM), flash memory (Flash Memory), hard disk (Hard Disk Drive, HDD) or solid-state drive (Solid-State Drive, SSD); the memory 503 may also include a combination of the above types. The memory 503 may store program codes, and may specifically include program codes for executing the steps described in the embodiment of FIG. 2 , which will not be repeated here.

The communication interface 502 can be a wired interface (such as an Ethernet interface), an internal interface (such as a high-speed serial computer expansion bus (Peripheral Component Interconnect express, PCIe) bus interface), a wired interface (such as an Ethernet interface), or a wireless interface ( such as a cellular network interface or using a wireless local area network interface) to communicate with other devices or modules.

It should be noted that FIG. 5 is only a possible implementation manner of the embodiment of the present application. In practical applications, the computing device 500 may further include more or less components, which is not limited here. For content not shown or described in the embodiments of the present application, reference may be made to the relevant descriptions in the foregoing embodiment in FIG. 2 , and details are not repeated here.

Embodiments of the present application further provide a computer-readable storage medium, where a program or an instruction is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a processor, the method flow shown in FIG. 2 is implemented.

Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

The embodiment of the present application further provides a computer program product, when the computer program product runs on the processor, the method flow shown in FIG. 2 is realized.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium. During execution, the processes of the embodiments of the above-mentioned methods may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

What is disclosed above is only a preferred embodiment of the present application, and of course, it cannot limit the scope of the right of the present application. Those skilled in the art can understand that all or part of the process of implementing the above-mentioned embodiment can be realized according to the right of the present application. The equivalent changes required are still within the scope of the application.

Claims

A data classification method, wherein the method comprises:

Acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;

The whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;

Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;

To perform an update operation:

The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;

Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;

Using the target optimization function, calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;

Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;

Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;

Carry out the described update operation of the preset number of times, and use the new optimal vector obtained by the last described update operation as the cluster center of the target training data;

Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
The method of claim 1, wherein the third vector set includes a third vector corresponding to each individual;

The vector corresponding to each individual is updated according to the distance and the first preset condition to obtain a third vector set, including:

Calculate the distance between the vector corresponding to the target individual in the second vector set and the optimal vector, wherein the target individual is any one of the n individuals;

When the distance is greater than the first threshold, perform a crossover operation on the vector corresponding to the target individual in the second vector set and the optimal vector to obtain the first vector corresponding to the target individual in the third vector set Three vector.
The method of claim 2, wherein the method further comprises:

When the distance is less than or equal to the first threshold, a mutation operation is performed on the vector corresponding to the target individual in the second vector set to obtain a third vector corresponding to the target individual in the third vector set .
The method according to any one of claims 1 to 3, wherein the determining the target vectors corresponding to the n individuals by the second preset condition to obtain a target vector set, comprising:

Using the objective optimization function, calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set, wherein the target individual is the n any one of the individuals;

When the function value of the vector corresponding to the second vector set is greater than the function value of the vector corresponding to the third vector set, the vector corresponding to the third vector set is used as the target vector corresponding to the target individual.
The method of claim 4, wherein the method further comprises:

When the function value of the corresponding vector in the second vector set is less than or equal to the function value of the corresponding vector in the third vector set, the traditional krill swarm algorithm KHA is used to obtain the target vector corresponding to the target individual.
The method according to claim 5, wherein the target optimization function is used to calculate the sum of distances between a candidate vector and each data in the target training data, wherein the candidate vector is the n individuals The vector corresponding to any individual in .
The method according to claim 6, wherein the distance is any one of Hamming distance, Min-type distance or included angle cosine distance.
A data classification device, wherein the device comprises:

an acquisition module for acquiring training data, wherein the training data includes k categories, and k is a positive integer greater than 1;

The processing module is used to determine vectors corresponding to n individuals from the target training data through the whale optimization algorithm WOA, and obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1;

The processing module is also used for:

Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;

To perform an update operation:

The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;

Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;

Using the objective optimization function, the function value corresponding to each vector in the second vector set and the function value corresponding to each vector in the third vector set are calculated, and the corresponding function values of the n individuals are determined by the second preset condition. target vector, get the target vector set;

Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;

Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;

Performing the update operation for a preset number of times, and using the new optimal vector obtained by the last update operation as the cluster center of the target training data;

Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
A computing device including memory and a processor:

the memory for storing computer programs;

The processor is configured to execute a computer program stored in the memory, so that the computing device executes the following methods:

Acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;

The whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;

Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;

To perform an update operation:

The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;

Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;

Using the objective optimization function, calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;

Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;

Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;

Performing the update operation for a preset number of times, and using the new optimal vector obtained by the last update operation as the cluster center of the target training data;

Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
The computing device of claim 9, wherein the third vector set includes a third vector corresponding to each individual;

Execute the described distance and the first preset condition to update the vector corresponding to each individual to obtain a third vector set, including:

Calculate the distance between the vector corresponding to the target individual in the second vector set and the optimal vector, wherein the target individual is any one of the n individuals;

When the distance is greater than the first threshold, perform a crossover operation on the vector corresponding to the target individual in the second vector set and the optimal vector to obtain the first vector corresponding to the target individual in the third vector set Three vector.
The computing device of claim 10, wherein the processor is further configured to perform:

When the distance is less than or equal to the first threshold, a mutation operation is performed on the vector corresponding to the target individual in the second vector set to obtain a third vector corresponding to the target individual in the third vector set .
The computing device according to any one of claims 9 to 11, wherein the determining of the target vectors corresponding to the n individuals by the second preset condition is performed to obtain a target vector set, comprising:

Using the objective optimization function, calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set, wherein the target individual is the n any one of the individuals;

When the function value of the vector corresponding to the second vector set is greater than the function value of the vector corresponding to the third vector set, the vector corresponding to the third vector set is used as the target vector corresponding to the target individual.
The computing device of claim 12, wherein the processor is further configured to perform:

When the function value of the corresponding vector in the second vector set is less than or equal to the function value of the corresponding vector in the third vector set, the traditional krill swarm algorithm KHA is used to obtain the target vector corresponding to the target individual.
The computing device according to claim 13, wherein the target optimization function is used to calculate the sum of distances between candidate vectors and each data in the target training data, wherein the candidate vectors are the n The vector corresponding to any one of the individuals;

The distance is any one of Hamming distance, Min-type distance or included angle cosine distance.
A computer-readable storage medium, comprising a program or an instruction, when the program or instruction is executed on a computer device, the following method is performed:

Acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;

The whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;

Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;

To perform an update operation:

The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;

Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;

Using the target optimization function, calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;

Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;

Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;

Carry out the described update operation of the preset number of times, and use the new optimal vector obtained by the last described update operation as the cluster center of the target training data;

Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
The computer-readable storage medium of claim 15, wherein the third set of vectors includes a third vector corresponding to each of the individuals;

Execute the described distance and the first preset condition, update the vector corresponding to each individual, and obtain a third vector set, including:

Calculate the distance between the vector corresponding to the target individual in the second vector set and the optimal vector, wherein the target individual is any one of the n individuals;

When the distance is greater than the first threshold, perform a crossover operation on the vector corresponding to the target individual in the second vector set and the optimal vector to obtain the first vector corresponding to the target individual in the third vector set Three vector.
The computer-readable storage medium of claim 16, wherein, when the program or instruction is executed on a computer device, it is further configured to:

When the distance is less than or equal to the first threshold, a mutation operation is performed on the vector corresponding to the target individual in the second vector set to obtain a third vector corresponding to the target individual in the third vector set .
The computer-readable storage medium according to any one of claims 15 to 17, wherein the determining of the target vectors corresponding to the n individuals by the second preset condition is performed to obtain a target vector set, comprising:

Using the objective optimization function, calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set, wherein the target individual is the n any one of the individuals;

When the function value of the vector corresponding to the second vector set is greater than the function value of the vector corresponding to the third vector set, the vector corresponding to the third vector set is used as the target vector corresponding to the target individual.
The computer-readable storage medium of claim 18, wherein the program or instruction, when executed on a computer device, is further configured to:

When the function value of the corresponding vector in the second vector set is less than or equal to the function value of the corresponding vector in the third vector set, the traditional krill swarm algorithm KHA is used to obtain the target vector corresponding to the target individual.
The computer-readable storage medium of claim 19, wherein the objective optimization function is used to calculate the sum of distances between a candidate vector and each data in the target training data, wherein the candidate vector is the the vector corresponding to any one of the n individuals;

The distance is any one of Hamming distance, Min-type distance or included angle cosine distance.