WO2022127037A1 - Data classification method and apparatus, and related device - Google Patents

Data classification method and apparatus, and related device Download PDF

Info

Publication number
WO2022127037A1
WO2022127037A1 PCT/CN2021/096647 CN2021096647W WO2022127037A1 WO 2022127037 A1 WO2022127037 A1 WO 2022127037A1 CN 2021096647 W CN2021096647 W CN 2021096647W WO 2022127037 A1 WO2022127037 A1 WO 2022127037A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
target
function value
vector set
individual
Prior art date
Application number
PCT/CN2021/096647
Other languages
French (fr)
Chinese (zh)
Inventor
张楠
王健宗
瞿晓阳
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022127037A1 publication Critical patent/WO2022127037A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Definitions

  • the present application relates to the field of data processing, and in particular, to a data classification method, apparatus and related equipment.
  • the data labeling platform is a very important part of the outbound robot project team. Every day, the actual outbound voice data of the robot will be transferred to the platform for verification and corresponding data labeling, and then sent back to the model for training.
  • AI artificial intelligence
  • the embodiments of the present application provide a data classification method, device, and related equipment, which can obtain the clustering centers of each category of voice data, and then classify the voice data to be classified into the corresponding categories through the clustering centers, and then distribute them to the corresponding categories. It can greatly improve the efficiency of manual annotation.
  • the present application provides a data classification method, which includes the following steps:
  • Acquire training data wherein the training data includes k categories, and k is a positive integer greater than 1;
  • the whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;
  • the vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set
  • the target optimization function calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;
  • the present application provides a data classification device, the device comprising:
  • an acquisition module for acquiring training data, wherein the training data includes k categories, and k is a positive integer greater than 1;
  • the processing module is used to determine the vectors corresponding to n individuals from the target training data through the whale optimization algorithm WOA, and obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1;
  • the processing module is also used for:
  • the vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set
  • the function value corresponding to each vector in the second vector set and the function value corresponding to each vector in the third vector set are calculated, and the corresponding function values of the n individuals are determined by the second preset condition.
  • target vector get the target vector set;
  • the present application provides a computing device including a processor and a memory, and the processor and the memory can be connected to each other through a bus, or can be integrated together.
  • the processor executes code stored in memory to implement the following methods:
  • Acquire training data wherein the training data includes k categories, and k is a positive integer greater than 1;
  • the whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;
  • the vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set
  • the target optimization function calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;
  • the present application provides a computer-readable storage medium, including a program or an instruction, and when the program or instruction is run on a computer device, the computer device can be made to execute the following method:
  • Acquire training data wherein the training data includes k categories, and k is a positive integer greater than 1;
  • the whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;
  • the vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set
  • the target optimization function calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;
  • this application can obtain the clustering centers of each category of voice data, and then classify the voice data to be classified into the corresponding categories through the clustering centers of each category, and then distribute them to the corresponding personnel for voice Data labeling enables the same group of labelers to only process data under one category, which is more targeted, which can greatly improve the efficiency of manual labeling, thereby shortening the time of the entire AI project.
  • FIG. 1 is a schematic flowchart of a traditional whale optimization algorithm provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a data classification method provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of another data classification method provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a data classification apparatus provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • the technical solutions of the present application may relate to the technical field of artificial intelligence and/or big data, for example, may specifically relate to machine learning technology, which may be used in scenarios such as data processing to realize data classification.
  • the data involved in this application such as training data, voice data and/or classification result information, may be stored in a database, or may be stored in a blockchain, such as distributed storage through a blockchain, which is not included in this application. limited.
  • Whale Optimization Algorithm is a meta-heuristic swarm intelligence algorithm proposed by Mirjalili and Lewis in 2016, which is inspired by the hunting behavior of humpback whales.
  • Humpback whales are social animals, and they cooperate with each other to drive and round up their prey when hunting. They have a special hunting method called bubble-net feeding method. It is done by continuously releasing unique bubbles on the path in the shape of the number "9".
  • the whale optimization algorithm mathematically models the hunting behavior of humpback whales and is used to solve various optimization problems.
  • a whale population consists of multiple whale individuals, which can also be called search agents.
  • Each individual represents a possible solution to a problem to be solved, and the solution is in is encoded in a computer as a vector representation.
  • Such a set of possible solutions is called a population, and the whole population has a strong diversity of solutions.
  • the position of each individual whale is controlled by three parts: surround prey, bubble net attack and random search for prey.
  • the humpback whale itself can identify the prey position and surround it, but since the position of the optimal solution to the problem to be solved (i.e. the target prey) in the search space is not known a priori, the WOA algorithm assumes the current best individual whale (Best possible solution) is the target prey or close to the optimal solution. After the best whale individual is defined, other whale individuals will try to update their positions towards the current best whale individual (reference whale), the new position of each whale individual can be defined as the original position of the whale individual and the current best whale individual anywhere in between, this behavior is represented by equations (1)(2):
  • Equation (2) allows any individual whale to update its position within the domain of the current optimal solution, thus simulating the whale's surrounding prey behavior.
  • Humpback whales also constantly update their positions in order to use bubble nets to drive away prey.
  • the method first calculates the distance between the individual whale's position and the position of the prey (i.e., the current best individual whale), and then creates a spiral equation between the individual whale and the prey to mimic the spiral movement of humpback whales. Its spiral position update formula is expressed by formula (5):
  • D′
  • b is a constant (generally 1 by default)
  • b defines the shape of the logarithmic spiral
  • l is A random number between [-1,1].
  • p is a random number between [0,1]. If the generated random number p ⁇ 0.5, the whale individual chooses to surround the prey to update the position; if the generated random number p ⁇ 0.5, the bubble net attack method is used to update the position.
  • humpback whales also randomly search for prey, also based on a variable A vector.
  • humpback whales do random searches based on each other's positions, so using a random value of A greater or less than -1 to force the current individual whale away from the reference whale.
  • a randomly selected whale individual in the population is used as the reference whale to update the position of the current whale individual, instead of using the current best whale individual as the reference whale to update the position.
  • >1 emphasizes the exploration in the search space and allows the WOA algorithm to perform a global search.
  • >1 emphasizes the exploration in the search space and allows the WOA algorithm to perform a global search.
  • >1 emphasizes the exploration in the search space and allows the WOA algorithm to perform a global search.
  • the mathematical model is expressed by equations (7) (8):
  • X rand is a random position vector selected from the current whale population (representing a random individual whale)
  • each individual whale in the whale population selects one of the three methods of encircling the prey, bubble net attack and random search for the prey to update the position.
  • the flowchart of the traditional whale optimization algorithm can be exemplarily shown in Figure 1, and the entire execution process can be simply summarized as the following steps:
  • S101 Define boundaries and determine algorithm parameters.
  • S103 Calculate the fitness of each individual whale. The fitness is usually measured by the selected objective optimization function, and the current best individual is marked as X * .
  • Bubble net attack update the position of the current individual through formula (5);
  • Genetic Algorithm is a computational model of the biological evolution process that simulates the natural selection and genetic mechanism of Darwin's theory of biological evolution. Genetic algorithm takes all individuals in a population as the object, and selection, crossover and mutation constitute the genetic operation of genetic algorithm. There are many mathematical implementation methods for genetic manipulation, and generally, a suitable mathematical implementation method can be selected according to specific problems.
  • the selection operation is usually a random selection of parent and parent individuals for crossover.
  • the wheel selection method which is commonly used in selection operations, is a selection strategy based on the proportion of fitness.
  • the fitness can be measured by selecting an appropriate fitness function (or objective optimization function) according to specific problems. The better the fitness of the individual, the greater the probability of the individual being selected, but at the same time, the individual with small probability also has the opportunity to be selected, thus maintaining the diversity of the population. Since the parent and parent individuals used for crossover in the wheel selection method are randomly selected, it can be said that this is a less perfect selection method. There are other options, which will not be introduced here.
  • Crossover operation refers to the process of simulating chromosome crossover and exchanging part of genetic material in natural evolution by mathematical methods.
  • Crossover operation is implemented in vectors, that is, the vector elements of parent and mother generation individuals are replaced and recombined to generate new children.
  • Generation of individuals, Equation (9) gives one of the crossover methods:
  • r ⁇ [1,2,...,n] and r ⁇ i, n is the population size
  • Cr is the crossover probability
  • x i,m is the mth dimension element of the current individual X i
  • rand i,m is the corresponding A random number of elements x i,m .
  • the mutation operation is also the process of using mathematical methods to simulate the mutation in nature and the change of some genes in the chromosome under a certain probability.
  • the realization of the mutation operation in the vector is to make changes and adjustments to the vector elements of the parent individual. Equation (10) gives one of the mathematical implementations of the mutation operation:
  • r ⁇ [1,2,...,n] and r ⁇ i, n is the population size
  • Mu is the mutation probability
  • x i,m is the mth dimension element of the current individual Xi
  • rand i,m is the A random number of elements x i,m . If the generated random number rand i,m is less than the mutation probability Mu, change the mth dimension element of X i to x c , which is different from x i,m , and x c can be any value in the search space; If the number rand i,m is greater than or equal to the mutation probability Mu, the m-th dimension element x i,m of the current individual X i remains unchanged.
  • the mutation operation may also have other mathematical implementation manners, which are not specifically limited in this application.
  • data labeling is a very important part of the outbound robot project team. Every day, the voice data actually called by the robot will be transferred to the platform for verification and corresponding data annotation, and then sent back to the model for training.
  • the original voice data can be classified into the same category as much as possible, and then the voice data of each category is distributed to the corresponding annotator. This allows the same batch of labelers to process only one category of data, which is more targeted and helps improve the efficiency of voice data labeling.
  • the embodiment of the present application discloses a data classification method, which can obtain the clustering center of each category of voice data, and then classify the voice data to be classified into the corresponding category through the clustering center, and then distribute it to the corresponding category.
  • the same group of labelers can only process data in one category as much as possible, which is more targeted and can improve the efficiency of manual labeling, thereby shortening the time of the entire AI project.
  • FIG. 2 is a flowchart of a data classification method provided by an embodiment of the present application, and the method includes the following steps:
  • S201 The computing device acquires training data.
  • the training data includes k categories, and k is a positive integer greater than 1.
  • the source of the training data is not limited, it can be obtained by the computing device 500 sending a request to the data server, or it can be taken out from the data labeling platform, or it can be manually input data directly, which is not limited in this application.
  • the computing device before acquiring the training data, extracts the speech feature vector of the training data.
  • S202 Determine vectors corresponding to n individuals from the target training data through WOA to obtain a first vector set.
  • the target training data is the training data of any one of the above k categories, each of the n individuals corresponds to a vector in the target training data, and n is a positive integer greater than 1. It should be understood that each individual in the whale population of the WOA algorithm is a possible solution to the clustering center of the category where the target training data is located.
  • the steps of the whale optimization algorithm can be found in Figure 1 and the aforementioned related content. For the brevity of the description, it is not included here. Repeat.
  • the corresponding vectors are d 1 , d 2 . . . d 1000 .
  • the boundary for the WOA algorithm that is, determine the search space of the cluster center c 1 of the first category. Specifically, you can set the search range of each dimension element of the c 1 vector, and then determine the algorithm parameters, including the whale population size n, the algorithm The maximum number of iterations T and so on.
  • the search space and algorithm parameters can be determined manually based on experience.
  • the whale population size n is set to 5, and the maximum number of iterations T of the algorithm is 50.
  • n is the number of whale individuals in the whale population
  • "0” represents the initial value and the 0th iteration
  • the superscript "1” represents the first category
  • the subscript "i” represents the population in the population.
  • the i-th individual, each individual is a possible solution of the cluster center c 1 .
  • the five individuals in the population are Randomly select 5 vectors (assuming d 3 , d 1 , d 5 , d 12 , d 30 ) from the vectors corresponding to the above 1000 target training data as the initial corresponding vectors of these 5 individuals, namely The initialization of the whale population is completed, and the first vector sets d 3 , d 1 , d 5 , d 12 , and d 30 are obtained. It should be understood that only one category is used here as an example, and other categories also perform the same operations.
  • S203 Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set to obtain the optimal vector.
  • the objective optimization function is used to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and the vector corresponding to the smallest first function value among the n first function values is taken as the most good vector.
  • the above-mentioned objective optimization function is used to calculate the sum of the distances between the candidate vector and each data in the target training data, wherein the candidate vector is a vector corresponding to any one of the n individuals. It should be understood that the smaller the function value calculated by the objective optimization function, the better the fitness of the individual, and the closer the vector corresponding to the individual is to the optimal solution of the cluster center.
  • the above distance is any one of a Hamming distance, a Min-type distance, or an included angle cosine distance. It should be understood that there are many ways to calculate the distance between vectors, and other ways other than the above-mentioned calculation ways may also be used to calculate the distance between vectors in this embodiment of the present application.
  • the vectors corresponding to each of the n individuals are respectively updated, and the vectors corresponding to the updated n individuals are set as the second vector set.
  • the iterative process of the whale optimization algorithm please refer to the flowchart of the traditional whale optimization algorithm and related descriptions in Fig. 1. For the sake of brevity of the description, it will not be repeated here.
  • the first individual The corresponding vector in the first vector set is d 3 , for the individual Execute an iterative process of the WOA algorithm: first randomly obtain the values of p and A, and find that p ⁇ 0.5 and
  • the above content only takes an individual as an example. The same operation is performed on each individual in each category. The corresponding vectors of n individuals are updated, and the updated corresponding vectors of the n individuals are set as the second vector set.
  • S205 Calculate the distance between each vector in the second vector set and the optimal vector respectively, update the vector corresponding to each individual according to the first preset condition, and obtain a third vector set.
  • the distance between each vector in the second vector set and the optimal vector is calculated respectively, and the vector corresponding to each individual is updated according to the above distance and the first preset condition to obtain a third vector set, wherein the third vector set is A third vector corresponding to each of the n individuals is included.
  • the distance between the vector corresponding to the target individual in the second vector set and the optimal vector is first calculated.
  • a cross operation is performed on the vector corresponding to the target individual in the second vector set and the optimal vector to obtain a third vector corresponding to the target individual in the third vector set, wherein the above target individuals are n any of the individuals.
  • the distance between the vector corresponding to the target individual in the second vector set and the optimal vector is calculated.
  • the above distance is less than or equal to the first threshold, perform mutation operation on the vector corresponding to the target individual in the second vector set to obtain a third vector corresponding to the target individual in the third vector set, wherein the above target individual is one of the n individuals any one of .
  • the target individual is the first individual among n individuals at this time
  • the corresponding vector in the second vector set is d 8
  • the Hamming distance between d 8 and the best vector d 5 is first calculated.
  • the target individual When the above Hamming distance is greater than the first threshold, the target individual The corresponding vector d 8 in the second vector set is crossed with the best vector d 5 to obtain a new vector (assuming the cross to obtain d 43 ), and the vector d 43 is used as the target individual
  • the corresponding third vector in the third vector set when the above Hamming distance is less than or equal to the first threshold, the target individual
  • the mutation operation is performed on the corresponding vector d 8 in the second vector set to obtain a new vector (assuming that the mutation obtains d 44 ), and then the vector d 44 is used as the target individual
  • the corresponding third vector in the third vector set when the above Hamming distance is greater than the first threshold, the target individual The corresponding vector d 8 in the second vector set is crossed with the best vector d 5 to obtain a new vector (assuming the cross to obtain d 43 ), and the vector d 43 is used as the target individual
  • the corresponding third vector in the third vector set when the above Hamming distance is less than
  • S206 Use the objective optimization function to calculate the function value corresponding to each vector in the second vector set and the third vector set, and determine the target vector corresponding to the n individuals according to the second preset condition to obtain the target vector set.
  • the target vector set includes a target vector corresponding to each of the n individuals.
  • the target optimization function is used to calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set.
  • the function value of the corresponding vector in the vector set is greater than the function value of the corresponding vector in the third vector set, the corresponding vector in the third vector set is used as the target vector corresponding to the target individual;
  • the target optimization function is used to calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set.
  • the function value of the corresponding vector in the vector set is less than or equal to the function value of the corresponding vector in the third vector set
  • KHA traditional krill swarm algorithm
  • the corresponding vector in the second vector set is d 8
  • the corresponding vector in the third vector set is d 43 .
  • the function values corresponding to d 8 and d 43 are calculated respectively through the objective optimization function, and the magnitude relationship between the two is judged.
  • the vector d 43 is used as the target individual The corresponding target vector;
  • the function value corresponding to d 8 is less than or equal to the function value corresponding to d 43
  • the corresponding vector d 8 in the second vector set uses the traditional krill swarm algorithm (KHA) to obtain the target individual the corresponding target vector.
  • KHA traditional krill swarm algorithm
  • the smallest objective function value among the n objective function values is compared with the function value corresponding to the optimal vector, and when the smallest objective function value is smaller than the function value corresponding to the optimal vector, the smallest objective function value is compared.
  • the target vector corresponding to the value is used as the new optimal vector.
  • S209 Use the new optimal vector obtained by the last update as the cluster center of the target training data.
  • the update operations of steps S204 to S208 of the preset number of times are performed, wherein, in the t-th calculation, the new optimal vector obtained by the t-1th calculation is used as the t-th
  • the optimal vector when the above S204 to S208 are executed for the second time, and the new optimal vector obtained by the last update operation is used as the cluster center of the target training data.
  • S210 Acquire the speech data to be classified, and complete the classification of the speech data to be classified through the clustering center.
  • the speech data to be classified is obtained, the distances between the speech data to be classified and the cluster centers of the k categories are calculated respectively, a cluster center with the smallest distance from the speech data to be classified is obtained, and the speech data to be classified is divided into The data is classified into the category corresponding to the cluster center with the smallest distance from the speech data to be classified.
  • the distance value between the speech data to be classified and the cluster center can be used as a measure of similarity between the data. The closer the distance is, the higher the similarity between the speech data to be classified and the data in the category corresponding to the distance center is. Therefore, the speech data to be classified can be classified into the category where the cluster center with the closest distance is located to complete the classification of the original data to be classified.
  • d new For example, if there is a piece of speech data to be classified, extract its speech feature vector to obtain d new , calculate the distance between d new and the obtained 10 cluster centers c 1 to c 10 respectively, and find the distance between d new and c 5 is the smallest, so d new is classified into the fifth category where the cluster center c 5 is located to complete the classification of the speech data, and other speech data to be classified are also classified in the same way.
  • the embodiment of the present application can obtain the clustering centers of each category of voice data, and then classify the voice data to be classified into corresponding categories through the above-mentioned clustering centers, and then distribute them to the corresponding categories.
  • Corresponding personnel perform data labeling, so that the same group of labelers can only process voice data under one category, which is more targeted, which can greatly improve the efficiency of manual labeling, thereby shortening the time of the entire AI project.
  • steps S201 to S210 can be used for classification of other types of data, including classification of text data, video data, image data and other types of data, in addition to the classification of voice data.
  • corresponding feature extraction can be performed according to the type of data, such as facial feature extraction for video data, semantic feature extraction for text data, etc., which are not specifically limited in this application.
  • FIG. 4 is a schematic structural diagram of a data classification apparatus 400 provided by an embodiment of the present application.
  • the data classification apparatus includes:
  • an acquisition module 401 configured to acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;
  • the processing module 402 is used for determining vectors corresponding to n individuals from the target training data through the whale optimization algorithm WOA to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1;
  • the processing module 402 is also used for:
  • the vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set
  • the function value corresponding to each vector in the second vector set and the function value corresponding to each vector in the third vector set are calculated, and the corresponding function values of the n individuals are determined by the second preset condition.
  • target vector get the target vector set;
  • Each module of the data classification apparatus 400 is specifically used to implement steps S201 to S210 in the embodiment of the data classification method in FIG. 2 , and for the sake of brevity of the description, details are not repeated here.
  • FIG. 5 is a schematic structural diagram of a computing device 500 provided by an embodiment of the present application, and the computing device 500 may be the data classification apparatus 400 in the foregoing content.
  • the computing device may be a notebook computer, a tablet computer, a cloud server and other computing devices, which are not limited in this application. It should be understood that the computing device may also be a computer cluster composed of at least one server, which is not specifically limited in this application.
  • the computing device may include memory and a processor.
  • the computing device may also include a communication interface.
  • the computing device 500 includes: a processor 501, a communication interface 502, and a memory 503, and the computing device is configured to execute the steps in each of the foregoing data classification method embodiments.
  • the processor 501 , the communication interface 502 and the memory 503 can be connected to each other through the internal bus 504 , and can also communicate through other means such as wireless transmission.
  • the embodiment of the present application takes the connection through the bus 504 as an example, and the bus 504 may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus or the like.
  • the bus 504 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 5, but it does not mean that there is only one bus or one type of bus.
  • the processor 501 may be composed of at least one general-purpose processor, such as a central processing unit (Central Processing Unit, CPU), or a combination of a CPU and a hardware chip.
  • the above-mentioned hardware chip can be an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a programmable logic device (Programmable Logic Device, PLD) or a combination thereof.
  • the above-mentioned PLD can be a complex programmable logic device (Complex Programmable Logic Device, CPLD), a field programmable gate array (Field-Programmable Gate Array, FPGA), a general array logic (Generic Array Logic, GAL) or any combination thereof.
  • Processor 501 executes various types of digitally stored instructions, such as software or firmware programs stored in memory 503, which enable computing device 500 to provide various services.
  • the memory 503 is used to store program codes, and is controlled and executed by the processor 501, so as to execute the processing steps in each of the foregoing embodiments of the data classification methods.
  • the program code may include one or more software modules, and the one or more software modules may be the software modules provided in the embodiment of FIG. 4, such as an acquisition module and a processing module. Steps S201 to S210 will not be repeated here.
  • this embodiment can be implemented by a general physical server, for example, an ARM server or an X86 server, or can be implemented by a virtual machine based on a general physical server combined with NFV technology.
  • a complete computer system with complete hardware system functions and running in a completely isolated environment is not specifically limited in this application.
  • the memory 503 may include a volatile memory (Volatile Memory), such as a random access memory (Random Access Memory, RAM); the memory 503 may also include a non-volatile memory (Non-Volatile Memory), such as a read-only memory (Read- Only Memory (ROM), flash memory (Flash Memory), hard disk (Hard Disk Drive, HDD) or solid-state drive (Solid-State Drive, SSD); the memory 503 may also include a combination of the above types.
  • the memory 503 may store program codes, and may specifically include program codes for executing the steps described in the embodiment of FIG. 2 , which will not be repeated here.
  • the communication interface 502 can be a wired interface (such as an Ethernet interface), an internal interface (such as a high-speed serial computer expansion bus (Peripheral Component Interconnect express, PCIe) bus interface), a wired interface (such as an Ethernet interface), or a wireless interface (such as a cellular network interface or using a wireless local area network interface) to communicate with other devices or modules.
  • a wired interface such as an Ethernet interface
  • an internal interface such as a high-speed serial computer expansion bus (Peripheral Component Interconnect express, PCIe) bus interface
  • PCIe Peripheral Component Interconnect express
  • Ethernet interface such as an Ethernet interface
  • a wireless interface such as a cellular network interface or using a wireless local area network interface
  • FIG. 5 is only a possible implementation manner of the embodiment of the present application.
  • the computing device 500 may further include more or less components, which is not limited here.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program or an instruction is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a processor, the method flow shown in FIG. 2 is implemented.
  • the storage medium involved in this application such as a computer-readable storage medium, may be non-volatile or volatile.
  • the embodiment of the present application further provides a computer program product, when the computer program product runs on the processor, the method flow shown in FIG. 2 is realized.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

Abstract

The present application discloses a data classification method, said method comprising: acquiring training data, the training data comprising k categories, and k being a positive integer greater than 1; determining, by means of a WOA, vectors corresponding to n individuals, so as to obtain a first vector set; calculating, by using a target optimization function, a function value corresponding to each vector in the first vector set, so as to obtain an optimal vector; then performing a preset number of update operations on each individual by means of the WOA, and taking the finally obtained optimal vector as a clustering center; and finally completing the classification of voice data to be classified by means of the clustering centers of the k categories. According to the embodiments of the present application, the clustering centers of voice data of various categories can be acquired, and then the voice data to be classified is classified according to the clustering centers and is dispatched to corresponding persons, so that the same batch of annotation personnel only process data of one category as much as possible, improving the efficiency of data annotation, and thus reducing the time of the whole AI project.

Description

一种数据分类方法、装置及相关设备A data classification method, device and related equipment
本申请要求于2020年12月17日提交中国专利局、申请号为202011503667.5,发明名称为“一种数据分类方法、装置及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 17, 2020 with the application number 202011503667.5 and the title of the invention is "a data classification method, device and related equipment", the entire contents of which are incorporated by reference in in this application.
技术领域technical field
本申请涉及数据处理领域,尤其涉及到一种数据分类方法、装置及相关设备。The present application relates to the field of data processing, and in particular, to a data classification method, apparatus and related equipment.
背景技术Background technique
数据标注平台是外呼机器人项目组当中非常重要的一个环节,每天通过机器人实际外呼的语音数据都会流转到该平台进行核验及相应的数据标注,然后再次回传给模型进行训练。The data labeling platform is a very important part of the outbound robot project team. Every day, the actual outbound voice data of the robot will be transferred to the platform for verification and corresponding data labeling, and then sent back to the model for training.
发明人意识到,数据标注作为上述人工智能(Artificial Intelligence,AI)项目的一个基础,通常是由人工完成的,高质量的数据标注更是费时费力,对海量数据相关的处理几乎消耗了整个AI项目的大部分时间。而且在海量的数据中,会存在大批量各个场景及各个类型的数据,因此在派发给相应的人员进行人工标注前,需要进行一定的预处理。The inventor realized that data labeling, as a basis for the above-mentioned artificial intelligence (AI) projects, is usually done manually, and high-quality data labeling is time-consuming and labor-intensive, and the processing of massive data almost consumes the entire AI. most of the project. Moreover, in the massive data, there will be a large number of various scenarios and various types of data, so before dispatching to the corresponding personnel for manual annotation, certain preprocessing needs to be carried out.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种数据分类方法、装置及相关设备,能够获取各个类别语音数据的聚类中心,然后通过聚类中心将待分类的语音数据归到相应的类别中,之后再派发给相应的人员进行语音数据标注,大大提升人工标注的效率。The embodiments of the present application provide a data classification method, device, and related equipment, which can obtain the clustering centers of each category of voice data, and then classify the voice data to be classified into the corresponding categories through the clustering centers, and then distribute them to the corresponding categories. It can greatly improve the efficiency of manual annotation.
第一方面,本申请提供了一种数据分类方法,该方法包括以下步骤:In a first aspect, the present application provides a data classification method, which includes the following steps:
获取训练数据,其中,所述训练数据包括k个类别,k为大于1的正整数;Acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;
通过鲸鱼优化算法WOA从目标训练数据中确定n个个体对应的向量,得到第一向量集,其中,所述目标训练数据是所述k个类别中的任意一个类别,n为大于1的正整数;The whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;
使用目标优化函数分别计算所述第一向量集中每个向量对应的函数值,得到n个第一函数值,将所述n个第一函数值中最小第一函数值对应的向量作为最佳向量;Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;
执行更新操作:To perform an update operation:
通过WOA分别更新所述n个个体对应的向量,得到第二向量集;The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;
分别计算所述第二向量集中每个向量与所述最佳向量之间的距离,由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集;Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;
使用所述目标优化函数,计算所述第二向量集以及所述第三向量集中每个向量对应的函数值,由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集;Using the target optimization function, calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;
使用所述目标优化函数计算所述目标向量集中每个目标向量对应的函数值,得到n个目标函数值;Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;
将所述n个目标函数值中最小目标函数值与所述最佳向量对应的函数值进行比较,在所述最小目标函数值小于所述最佳向量对应的函数值时,确定所述最小目标函数值对应的目标向量作为新的最佳向量;Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;
执行预设次数的所述更新操作,将最后一次所述更新操作得到的所述新的最佳向量作为所述目标训练数据的聚类中心;Performing the update operation for a preset number of times, and using the new optimal vector obtained by the last update operation as the cluster center of the target training data;
获取待分类语音数据,分别计算所述待分类语音数据与所述k个类别的聚类中心的距离,将所述待分类语音数据归到与所述待分类语音数据距离最小的所述聚类中心对应的类别中。Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
第二方面,本申请提供了一种数据分类装置,该装置包括:In a second aspect, the present application provides a data classification device, the device comprising:
获取模块,用于获取训练数据,其中,所述训练数据包括k个类别,k为大于1的正整数;an acquisition module for acquiring training data, wherein the training data includes k categories, and k is a positive integer greater than 1;
处理模块,用于通过鲸鱼优化算法WOA从目标训练数据中确定n个个体对应的向量, 得到第一向量集,其中,所述目标训练数据是所述k个类别中的任意一个类别,n为大于1的正整数;The processing module is used to determine the vectors corresponding to n individuals from the target training data through the whale optimization algorithm WOA, and obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1;
所述处理模块还用于:The processing module is also used for:
使用目标优化函数分别计算所述第一向量集中每个向量对应的函数值,得到n个第一函数值,将所述n个第一函数值中最小第一函数值对应的向量作为最佳向量;Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;
执行更新操作:To perform an update operation:
通过WOA分别更新所述n个个体对应的向量,得到第二向量集;The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;
分别计算所述第二向量集中每个向量与所述最佳向量之间的距离,由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集;Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;
使用所述目标优化函数,计算所述第二向量集中每个向量对应的函数值以及所述第三向量集中每个向量对应的函数值,由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集;Using the objective optimization function, the function value corresponding to each vector in the second vector set and the function value corresponding to each vector in the third vector set are calculated, and the corresponding function values of the n individuals are determined by the second preset condition. target vector, get the target vector set;
使用所述目标优化函数计算所述目标向量集中每个目标向量对应的函数值,得到n个目标函数值;Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;
将所述n个目标函数值中最小目标函数值与所述最佳向量对应的函数值进行比较,在所述最小目标函数值小于所述最佳向量对应的函数值时,确定所述最小目标函数值对应的目标向量作为新的最佳向量;Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;
执行预设次数的所述更新操作,将最后一次所述更新操作得到的所述新的最佳向量作为所述目标训练数据的聚类中心;Performing the update operation for a preset number of times, and using the new optimal vector obtained by the last update operation as the cluster center of the target training data;
获取待分类语音数据,分别计算所述待分类语音数据与所述k个类别的聚类中心的距离,将所述待分类语音数据归到与所述待分类语音数据距离最小的所述聚类中心对应的类别中。Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
第三方面,本申请提供了一种计算设备,包括处理器和存储器,所述处理器和存储器可通过总线相互连接,也可以集成在一起。该处理器执行存储器中存储的代码实现以下方法:In a third aspect, the present application provides a computing device including a processor and a memory, and the processor and the memory can be connected to each other through a bus, or can be integrated together. The processor executes code stored in memory to implement the following methods:
获取训练数据,其中,所述训练数据包括k个类别,k为大于1的正整数;Acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;
通过鲸鱼优化算法WOA从目标训练数据中确定n个个体对应的向量,得到第一向量集,其中,所述目标训练数据是所述k个类别中的任意一个类别,n为大于1的正整数;The whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;
使用目标优化函数分别计算所述第一向量集中每个向量对应的函数值,得到n个第一函数值,将所述n个第一函数值中最小第一函数值对应的向量作为最佳向量;Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;
执行更新操作:To perform an update operation:
通过WOA分别更新所述n个个体对应的向量,得到第二向量集;The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;
分别计算所述第二向量集中每个向量与所述最佳向量之间的距离,由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集;Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;
使用所述目标优化函数,计算所述第二向量集以及所述第三向量集中每个向量对应的函数值,由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集;Using the target optimization function, calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;
使用所述目标优化函数计算所述目标向量集中每个目标向量对应的函数值,得到n个目标函数值;Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;
将所述n个目标函数值中最小目标函数值与所述最佳向量对应的函数值进行比较,在所述最小目标函数值小于所述最佳向量对应的函数值时,确定所述最小目标函数值对应的目标向量作为新的最佳向量;Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;
执行预设次数的所述更新操作,将最后一次所述更新操作得到的所述新的最佳向量作为所述目标训练数据的聚类中心;Performing the update operation for a preset number of times, and using the new optimal vector obtained by the last update operation as the cluster center of the target training data;
获取待分类语音数据,分别计算所述待分类语音数据与所述k个类别的聚类中心的距 离,将所述待分类语音数据归到与所述待分类语音数据距离最小的所述聚类中心对应的类别中。Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
第四方面,本申请提供了一种计算机可读存储介质,包括程序或指令,当上述程序或指令在计算机设备上运行时,可使上述计算机设备执行以下方法:In a fourth aspect, the present application provides a computer-readable storage medium, including a program or an instruction, and when the program or instruction is run on a computer device, the computer device can be made to execute the following method:
获取训练数据,其中,所述训练数据包括k个类别,k为大于1的正整数;Acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;
通过鲸鱼优化算法WOA从目标训练数据中确定n个个体对应的向量,得到第一向量集,其中,所述目标训练数据是所述k个类别中的任意一个类别,n为大于1的正整数;The whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;
使用目标优化函数分别计算所述第一向量集中每个向量对应的函数值,得到n个第一函数值,将所述n个第一函数值中最小第一函数值对应的向量作为最佳向量;Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;
执行更新操作:To perform an update operation:
通过WOA分别更新所述n个个体对应的向量,得到第二向量集;The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;
分别计算所述第二向量集中每个向量与所述最佳向量之间的距离,由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集;Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;
使用所述目标优化函数,计算所述第二向量集以及所述第三向量集中每个向量对应的函数值,由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集;Using the target optimization function, calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;
使用所述目标优化函数计算所述目标向量集中每个目标向量对应的函数值,得到n个目标函数值;Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;
将所述n个目标函数值中最小目标函数值与所述最佳向量对应的函数值进行比较,在所述最小目标函数值小于所述最佳向量对应的函数值时,确定所述最小目标函数值对应的目标向量作为新的最佳向量;Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;
执行预设次数的所述更新操作,将最后一次所述更新操作得到的所述新的最佳向量作为所述目标训练数据的聚类中心;Performing the update operation for a preset number of times, and using the new optimal vector obtained by the last update operation as the cluster center of the target training data;
获取待分类语音数据,分别计算所述待分类语音数据与所述k个类别的聚类中心的距离,将所述待分类语音数据归到与所述待分类语音数据距离最小的所述聚类中心对应的类别中。Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
本申请基于传统鲸鱼优化算法,能够获取各个类别语音数据的聚类中心,然后通过各类别的聚类中心,将待分类的语音数据归到相应的类别中,之后再派发给相应的人员进行语音数据标注,使得同一批标注人员尽量只处理一个类别下的数据,更加地有针对性,可以大大提升人工标注的效率,进而缩短整个AI项目的时间。Based on the traditional whale optimization algorithm, this application can obtain the clustering centers of each category of voice data, and then classify the voice data to be classified into the corresponding categories through the clustering centers of each category, and then distribute them to the corresponding personnel for voice Data labeling enables the same group of labelers to only process data under one category, which is more targeted, which can greatly improve the efficiency of manual labeling, thereby shortening the time of the entire AI project.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.
图1是本申请实施例提供的一种传统鲸鱼优化算法的流程示意图;1 is a schematic flowchart of a traditional whale optimization algorithm provided by an embodiment of the present application;
图2是本申请实施例提供的一种数据分类方法的流程示意图;2 is a schematic flowchart of a data classification method provided by an embodiment of the present application;
图3是本申请实施例提供的又一种数据分类方法的流程示意图;3 is a schematic flowchart of another data classification method provided by an embodiment of the present application;
图4是本申请实施例提供的一种数据分类装置的结构示意图;4 is a schematic structural diagram of a data classification apparatus provided by an embodiment of the present application;
图5是本申请实施例提供的一种计算设备的结构示意图。FIG. 5 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.
需要说明的是,在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。It should be noted that the terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. As used in the embodiments of this application and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
本申请的技术方案可涉及人工智能和/或大数据技术领域,如可具体涉及机器学习技术,用于数据处理等场景,以实现数据分类。可选的,本申请涉及的数据如训练数据、语音数据和/或分类结果信息等可存储于数据库中,或者可以存储于区块链中,比如通过区块链分布式存储,本申请不做限定。The technical solutions of the present application may relate to the technical field of artificial intelligence and/or big data, for example, may specifically relate to machine learning technology, which may be used in scenarios such as data processing to realize data classification. Optionally, the data involved in this application, such as training data, voice data and/or classification result information, may be stored in a database, or may be stored in a blockchain, such as distributed storage through a blockchain, which is not included in this application. limited.
为了便于理解本申请实施例,下面介绍一些相关的算法。To facilitate understanding of the embodiments of the present application, some related algorithms are introduced below.
鲸鱼优化算法(Whale Optimization Algorithm,WOA)是由Mirjalili和Lewis在2016年提出的一种元启发式(meta-heuristic)群智能算法,该算法的灵感来自于座头鲸的狩猎行为。座头鲸是群居动物,在捕猎时会相互合作对猎物进行驱赶和围捕,它们有一种特殊的捕猎方式,被叫做气泡网觅食法(bubble-net feeding method),是通过在圆形或者类似于数字“9”形状的路径上不断释放独特的气泡来完成的。鲸鱼优化算法就是对座头鲸的狩猎行为进行了数学建模,并用于解决各种优化问题。在鲸鱼优化算法中,一个鲸鱼种群由多个鲸鱼个体组成,鲸鱼个体也可以称为搜索代理(search agent),每一个个体都代表了所要解决的某个问题的一个可能解,并且该解在计算机中被编码为一个向量表示。这样一组可能解的集合就叫做种群,整个群体具有很强的解的多样性。在鲸鱼算法中,每一个鲸鱼个体的位置由三部分控制:包围猎物、气泡网攻击和随机搜索猎物。Whale Optimization Algorithm (WOA) is a meta-heuristic swarm intelligence algorithm proposed by Mirjalili and Lewis in 2016, which is inspired by the hunting behavior of humpback whales. Humpback whales are social animals, and they cooperate with each other to drive and round up their prey when hunting. They have a special hunting method called bubble-net feeding method. It is done by continuously releasing unique bubbles on the path in the shape of the number "9". The whale optimization algorithm mathematically models the hunting behavior of humpback whales and is used to solve various optimization problems. In the whale optimization algorithm, a whale population consists of multiple whale individuals, which can also be called search agents. Each individual represents a possible solution to a problem to be solved, and the solution is in is encoded in a computer as a vector representation. Such a set of possible solutions is called a population, and the whole population has a strong diversity of solutions. In the whale algorithm, the position of each individual whale is controlled by three parts: surround prey, bubble net attack and random search for prey.
1、包围猎物。座头鲸本身可以识别猎物位置并将其包围,但由于要解决的问题的最优解(即目标猎物)在搜索空间中的位置不是先验已知的,WOA算法假设当前的最佳鲸鱼个体(最佳可能解)就是目标猎物或接近最优解。在定义了最佳鲸鱼个体之后,其他鲸鱼个体将尝试向着当前最佳鲸鱼个体(参考鲸鱼)更新它们的位置,每个鲸鱼个体的新位置可以定义为鲸鱼个体的原始位置与当前最佳鲸鱼个体之间的任意位置,这种行为由方程式(1)(2)表示:1. Surround the prey. The humpback whale itself can identify the prey position and surround it, but since the position of the optimal solution to the problem to be solved (i.e. the target prey) in the search space is not known a priori, the WOA algorithm assumes the current best individual whale (Best possible solution) is the target prey or close to the optimal solution. After the best whale individual is defined, other whale individuals will try to update their positions towards the current best whale individual (reference whale), the new position of each whale individual can be defined as the original position of the whale individual and the current best whale individual anywhere in between, this behavior is represented by equations (1)(2):
D=|CX *(t)-X(t)|                         (1) D=|CX * (t)-X(t)| (1)
X(t+1)=X *(t)-A·D                        (2) X(t+1)=X * (t)-A·D (2)
其中,t为当前迭代次数,A和C为系数向量,X *是当前最佳鲸鱼个体(当前最优解)的位置向量,X是当前鲸鱼个体的位置向量,||表示绝对值操作,·表示元素相乘。每次迭代过程中有更优解出现时就需要更新X *,A和C的计算由方程式(3)(4)表示: Among them, t is the current number of iterations, A and C are coefficient vectors, X * is the position vector of the current optimal whale individual (current optimal solution), X is the position vector of the current whale individual, || represents the absolute value operation, · Represents element-wise multiplication. X * needs to be updated when a better solution emerges during each iteration. The computation of A and C is expressed by equations (3)(4):
A=2a·r-a                            (3)A=2a r-a (3)
C=2·r                               (4)C=2 r (4)
其中a在迭代过程中从2线性地下降至0,r为[0,1]之间的随机向量,A的波动范围也通过a降低,换句话说,A是一个区间[-a,a]内的随机值。等式(2)允许任何鲸鱼个体在当前最优解的领域内更新其位置,从而模拟了鲸鱼的包围猎物行为。where a decreases linearly from 2 to 0 in the iterative process, r is a random vector between [0, 1], and the fluctuation range of A is also reduced by a, in other words, A is an interval [-a, a] random value inside. Equation (2) allows any individual whale to update its position within the domain of the current optimal solution, thus simulating the whale's surrounding prey behavior.
2、气泡网攻击。座头鲸为了使用气泡网来驱赶猎物,也会不断更新自身的位置。该方法首先计算鲸鱼个体位置和猎物(即当前最佳鲸鱼个体)位置之间的距离,然后在鲸鱼个体与猎物之间创建一个螺旋等式来模仿座头鲸的螺旋状移动。其螺旋形的位置更新公式由式(5)表示:2. Bubble net attack. Humpback whales also constantly update their positions in order to use bubble nets to drive away prey. The method first calculates the distance between the individual whale's position and the position of the prey (i.e., the current best individual whale), and then creates a spiral equation between the individual whale and the prey to mimic the spiral movement of humpback whales. Its spiral position update formula is expressed by formula (5):
X(t+1)=D′·e bl·cos(2πl)+X *(t)                     (5) X(t+1)=D′·e bl ·cos(2πl)+X * (t) (5)
其中D′=|X *(t)-X(t)|,表示当前鲸鱼个体与猎物之间的距离,b为常数(一般默认取1),b定义了对数螺旋线的形状,l是[-1,1]之间的随机数。 where D′=|X * (t)-X(t)|, represents the distance between the current whale individual and the prey, b is a constant (generally 1 by default), b defines the shape of the logarithmic spiral, and l is A random number between [-1,1].
值得注意的是,鲸鱼在捕猎过程中,上述收缩包围猎物与螺旋形路径的气泡网攻击行为是同时进行的。因此,为了对这种同时发生的行为进行建模,假设鲸鱼个体选择收缩包围机制和气泡网攻击来更新位置的概率p相同,均为0.5,其数学模型可由式(6)表示:It is worth noting that during the hunting process of whales, the above-mentioned contraction and encircling of the prey and the bubble net attack behavior of the spiral path are carried out at the same time. Therefore, in order to model this simultaneous behavior, it is assumed that the probability p of individual whales choosing the shrinkage encirclement mechanism and the bubble net attack to update the position is the same, and both are 0.5. The mathematical model can be expressed by Equation (6):
Figure PCTCN2021096647-appb-000001
Figure PCTCN2021096647-appb-000001
其中,p为[0,1]之间的随机数。若产生的随机数p<0.5,则鲸鱼个体选择包围猎物机制来更新位置;若产生的随机数p≥0.5,则选用气泡网攻击方式来更新位置。where p is a random number between [0,1]. If the generated random number p < 0.5, the whale individual chooses to surround the prey to update the position; if the generated random number p ≥ 0.5, the bubble net attack method is used to update the position.
3、随机搜索猎物。除了上述两种方式,座头鲸还会随机寻找猎物,同样基于可变的A向量。事实上,座头鲸是根据彼此的位置进行随机搜索的,因此使用随机值大于或小于-1的A来迫使当前鲸鱼个体远离参考鲸鱼。与前述阶段不同,这里用种群中随机选择的一个鲸鱼个体作为参考鲸鱼来更新当前鲸鱼个体的位置,而不是用当前最佳鲸鱼个体作为参考鲸鱼来更新位置。随机搜索猎物机制中|A|>1,强调了在搜索空间中的探索,并允许WOA算法执行全局搜索,数学模型由式(7)(8)表示:3. Randomly search for prey. In addition to the above two methods, humpback whales also randomly search for prey, also based on a variable A vector. In fact, humpback whales do random searches based on each other's positions, so using a random value of A greater or less than -1 to force the current individual whale away from the reference whale. Different from the previous stage, here a randomly selected whale individual in the population is used as the reference whale to update the position of the current whale individual, instead of using the current best whale individual as the reference whale to update the position. In the random search prey mechanism, |A|>1 emphasizes the exploration in the search space and allows the WOA algorithm to perform a global search. The mathematical model is expressed by equations (7) (8):
D=|C·X rand-X|             (7) D=|C·X rand -X| (7)
X(t+1)=X rand-A·D          (8) X(t+1)=X rand -A·D (8)
其中X rand为从当前鲸鱼种群中选择的一个随机位置向量(表示一个随机鲸鱼个体) where X rand is a random position vector selected from the current whale population (representing a random individual whale)
综上所述,在WOA算法的每一次迭代过程中,鲸鱼种群中的每一个鲸鱼个体都在包围猎物、气泡网攻击和随机搜索猎物三种方式中选择一个来更新位置。传统鲸鱼优化算法的流程图可以示例性地参见图1,整个执行过程可以简单概括为以下步骤:To sum up, in each iteration of the WOA algorithm, each individual whale in the whale population selects one of the three methods of encircling the prey, bubble net attack and random search for the prey to update the position. The flowchart of the traditional whale optimization algorithm can be exemplarily shown in Figure 1, and the entire execution process can be simply summarized as the following steps:
S101:定义边界,确定算法参数。S101: Define boundaries and determine algorithm parameters.
S102:初始化鲸鱼种群X i(i=1,2,…,n),其中,n为鲸鱼种群中鲸鱼个体的个数。 S102: Initialize the whale population X i (i=1, 2, . . . , n), where n is the number of individual whales in the whale population.
S103:计算每一个鲸鱼个体的适应度,适应度通常用选定的目标优化函数来衡量,将当前最佳个体标记为X *S103: Calculate the fitness of each individual whale. The fitness is usually measured by the selected objective optimization function, and the current best individual is marked as X * .
S104:WOA算法迭代计算,该步骤的伪代码如下:S104: WOA algorithm iterative calculation, the pseudo code of this step is as follows:
While(t<最大迭代次数T)While(t<Maximum number of iterations T)
for(i=1:n)#对于每一个鲸鱼个体for(i=1:n)#for each individual whale
更新参数a,A,C,l,p的值;Update the values of parameters a, A, C, l, p;
if1(p<0.5)if1(p<0.5)
if2(|A|<1)if2(|A|<1)
包围猎物,通过式(2)更新当前个体的位置;Surround the prey, and update the current individual's position by formula (2);
else if2(|A|≥1)else if2(|A|≥1)
随机选择一个鲸鱼个体X randrandomly select a whale individual X rand ;
随机搜索猎物,通过式(8)更新当前个体的位置;Randomly search for prey, and update the position of the current individual through formula (8);
end if2end if2
else if1(p≥0.5)else if1(p≥0.5)
气泡网攻击,通过式(5)更新当前个体的位置;Bubble net attack, update the position of the current individual through formula (5);
end if1end if1
end forend for
计算每一个鲸鱼个体的适应度,用适应度更好的值更新最佳个体X *Calculate the fitness of each individual whale, and update the best individual X * with a better fitness value;
t=t+1t=t+1
end whileend while
return X * return X *
应理解,上述关于传统鲸鱼优化算法的介绍仅是为了便于理解该算法的基本思想,并非限制本申请。传统鲸鱼优化算法虽然在简单的、较小规模的问题求解中具有不俗的性能,但是在复杂、较大规模的寻优问题中还是存在着搜索精度低、收敛速度慢且容易陷入局部最优解的问题,需要对其进行改进。It should be understood that the above description of the traditional whale optimization algorithm is only for the convenience of understanding the basic idea of the algorithm, and does not limit the present application. Although the traditional whale optimization algorithm has good performance in solving simple and small-scale problems, it still has low search accuracy, slow convergence speed and easy to fall into local optimum in complex and large-scale optimization problems. problem that needs to be improved.
遗传算法(Genetic Algorithm,GA)是模拟达尔文生物进化论的自然选择和遗传学机理的生物进化过程的计算模型。遗传算法以一个种群中的所有个体为对象,选择(selection)、交叉(crossover)和变异(mutation)则构成了遗传算法的遗传操作。遗传操作有多种数学实现方式,一般可根据具体问题选择合适的数学实现方法。Genetic Algorithm (GA) is a computational model of the biological evolution process that simulates the natural selection and genetic mechanism of Darwin's theory of biological evolution. Genetic algorithm takes all individuals in a population as the object, and selection, crossover and mutation constitute the genetic operation of genetic algorithm. There are many mathematical implementation methods for genetic manipulation, and generally, a suitable mathematical implementation method can be selected according to specific problems.
选择操作通常是随机选择用于交叉的父代和母代个体。例如,选择操作中较为常用的赌轮选择法,是一种基于适应度比例的选择策略,适应度可以根据具体问题选择合适的适应度函数(或者说目标优化函数)来衡量。个体的适应度越好,该个体被选择的概率越大,但同时,概率小的个体也有机会被选中,从而保持种群的多样性。由于赌轮选择法中用于交叉的父代个体和母代个体是随机选择的,可以说这是一种不那么完美的选择方式。还有其他选择方式,这里不做过多介绍。The selection operation is usually a random selection of parent and parent individuals for crossover. For example, the wheel selection method, which is commonly used in selection operations, is a selection strategy based on the proportion of fitness. The fitness can be measured by selecting an appropriate fitness function (or objective optimization function) according to specific problems. The better the fitness of the individual, the greater the probability of the individual being selected, but at the same time, the individual with small probability also has the opportunity to be selected, thus maintaining the diversity of the population. Since the parent and parent individuals used for crossover in the wheel selection method are randomly selected, it can be said that this is a less perfect selection method. There are other options, which will not be introduced here.
交叉操作指的是用数学方法来模拟自然进化中的染色体交叉、交换部分遗传物质的过程,交叉操作在向量中实现,就是由父代、母代个体的向量元素通过替换重组而生成新的子代个体,式(9)给出其中一种交叉方式:Crossover operation refers to the process of simulating chromosome crossover and exchanging part of genetic material in natural evolution by mathematical methods. Crossover operation is implemented in vectors, that is, the vector elements of parent and mother generation individuals are replaced and recombined to generate new children. Generation of individuals, Equation (9) gives one of the crossover methods:
Figure PCTCN2021096647-appb-000002
Figure PCTCN2021096647-appb-000002
其中,r∈[1,2,…,n]且r≠i,n为种群规模,Cr为交叉概率,x i,m为当前个体X i的第m维元素,rand i,m是对应于元素x i,m的一个随机数。对当前个体X i执行式(9)的交叉操作,先选择一个父代个体X r,若产生的随机数rand i,m小于交叉概率Cr,则用父代个体X r的第m维元素x r,m替换当前个体X i(即母代个体)的第m维元素x i,m;若产生的随机数rand i,m大于或等于交叉概率Cr,则当前个体X i的第m维元素x i,m保持不变。当前个体完成上述交叉操作后最终得到一个新个体。需要注意的是,上述例子只是对向量其中一维的元素进行了交叉替换,交叉操作也可以对向量多个维度的元素进行交叉替换,还可以有其他的交叉方式,比如说均匀算数交叉等,本申请对交叉操作的具体实现方法不做限定。 Among them, r∈[1,2,...,n] and r≠i, n is the population size, Cr is the crossover probability, x i,m is the mth dimension element of the current individual X i , rand i,m is the corresponding A random number of elements x i,m . Perform the crossover operation of formula (9) on the current individual X i , first select a parent individual X r , if the generated random number rand i,m is less than the crossover probability Cr, use the m-th dimension element x of the parent individual X r r,m replaces the mth dimension element x i,m of the current individual X i (that is, the parent individual); if the generated random number rand i,m is greater than or equal to the crossover probability Cr, then the mth dimension element of the current individual X i x i,m remain unchanged. After the current individual completes the above crossover operation, a new individual is finally obtained. It should be noted that the above example only cross-replaces the elements of one dimension of the vector. The cross operation can also cross-replace the elements of multiple dimensions of the vector, and there are other cross methods, such as uniform arithmetic cross, etc., The present application does not limit the specific implementation method of the crossover operation.
变异操作也是利用数学方法来模拟自然界中的变异、染色体中部分基因在一定概率下发生改变的过程,变异操作在向量中实现就是对父代个体的向量元素做变动调整。式(10)给出变异操作的其中一种数学实现方式:The mutation operation is also the process of using mathematical methods to simulate the mutation in nature and the change of some genes in the chromosome under a certain probability. The realization of the mutation operation in the vector is to make changes and adjustments to the vector elements of the parent individual. Equation (10) gives one of the mathematical implementations of the mutation operation:
Figure PCTCN2021096647-appb-000003
Figure PCTCN2021096647-appb-000003
其中,r∈[1,2,…,n]且r≠i,n为种群规模,Mu为变异概率,x i,m为当前个体X i的第m维元素,rand i,m是对应于元素x i,m的一个随机数。若产生的随机数rand i,m小于突变概率Mu,则将X i的第m维元素变化为不同于x i,m的x c,x c可以是搜索空间中的任意值;若产生的随机数rand i,m大于或等于突变概率Mu,则当前个体X i的第m维元素x i,m保持不变。应理解,除了上述方法,变异操作还可以有其他数学实现方式,本申请也不作具体限定。 Among them, r∈[1,2,…,n] and r≠i, n is the population size, Mu is the mutation probability, x i,m is the mth dimension element of the current individual Xi , rand i,m is the A random number of elements x i,m . If the generated random number rand i,m is less than the mutation probability Mu, change the mth dimension element of X i to x c , which is different from x i,m , and x c can be any value in the search space; If the number rand i,m is greater than or equal to the mutation probability Mu, the m-th dimension element x i,m of the current individual X i remains unchanged. It should be understood that, in addition to the above method, the mutation operation may also have other mathematical implementation manners, which are not specifically limited in this application.
下面对本申请涉及的应用场景进行说明。The application scenarios involved in this application are described below.
现如今,随处可以获得海量的原始数据,但是要想用这些原始数据来训练机器学习和 深度学习模型,就需要预先对这些原始数据进行一定的处理,也就是进行数据标注,原始数据只有在经过数据标注后才能更好地释放其价值。例如,数据标注平台是外呼机器人项目组当中非常重要的一个环节,每天通过机器人实际外呼的语音数据都会流转到该平台进行核验及相应的数据标注,然后再次回传给模型进行训练。Nowadays, massive amounts of raw data can be obtained everywhere, but in order to use these raw data to train machine learning and deep learning models, it is necessary to perform certain processing on these raw data in advance, that is, data labeling. Data can be better released after labeling. For example, the data labeling platform is a very important part of the outbound robot project team. Every day, the voice data actually called by the robot will be transferred to the platform for verification and corresponding data annotation, and then sent back to the model for training.
提供的训练数据的质量和数量,往往会对机器学习模型产生重大影响,数据质量越好,模型性能越稳定。然而,作为人工智能项目基础的数据标注通常是由人工进行操作的,可谓是人工智能背后的“人工”,高质量的数据标注更是费时费力,数据标注几乎占了整个AI项目的大部分时间。而且,在海量的原始语音数据中会存在大批量各个场景和各个类型的数据,一个标注人员可能会拿到多种类型的数据,影响标注的效率。因此,将原始语音数据派发给相应的人员进行语音数据标注之前,如果能进行一定的预处理,让原始语音数据尽量分到相同的类别,再将各类别的语音数据派发给对应的标注人员,就能让同一批标注人员尽量只处理一个类别下的数据,更加有针对性,也有助于提高语音数据标注的效率。The quality and quantity of training data provided often have a significant impact on the machine learning model. The better the data quality, the more stable the model performance. However, data labeling, which is the basis of artificial intelligence projects, is usually operated by humans, which can be described as the "artificial" behind artificial intelligence. High-quality data labeling is time-consuming and labor-intensive, and data labeling almost accounts for most of the time of the entire AI project. . Moreover, there will be a large number of various scenarios and various types of data in the massive raw voice data, and a labeler may get multiple types of data, which affects the efficiency of labeling. Therefore, before distributing the original voice data to the corresponding personnel for voice data annotation, if a certain preprocessing can be performed, the original voice data can be classified into the same category as much as possible, and then the voice data of each category is distributed to the corresponding annotator. This allows the same batch of labelers to process only one category of data, which is more targeted and helps improve the efficiency of voice data labeling.
针对上述问题,本申请实施例公开了一种数据分类方法,能够获取各个类别语音数据的聚类中心,然后通过聚类中心将待分类的语音数据归到相应的类别中,之后再派发给相应的人员进行语音数据标注,使得同一批标注人员尽量只处理一个类别下的数据,更加地有针对性,可以提升人工标注的效率,从而缩短整个AI项目的时间。In view of the above problems, the embodiment of the present application discloses a data classification method, which can obtain the clustering center of each category of voice data, and then classify the voice data to be classified into the corresponding category through the clustering center, and then distribute it to the corresponding category. The same group of labelers can only process data in one category as much as possible, which is more targeted and can improve the efficiency of manual labeling, thereby shortening the time of the entire AI project.
图2是本申请实施例提供的一种数据分类方法的流程图,该方法包括如下步骤:2 is a flowchart of a data classification method provided by an embodiment of the present application, and the method includes the following steps:
S201:计算设备获取训练数据。S201: The computing device acquires training data.
其中,训练数据包括k个类别,k为大于1的正整数。该训练数据的来源不限,可以是计算设备500向数据服务器发出请求而获得的,也可以是从数据标注平台取出的,还可以是人工直接输入数据等方式,本申请不作限定。The training data includes k categories, and k is a positive integer greater than 1. The source of the training data is not limited, it can be obtained by the computing device 500 sending a request to the data server, or it can be taken out from the data labeling platform, or it can be manually input data directly, which is not limited in this application.
在一种可能的实施例中,在获取训练数据之前,计算设备提取训练数据的语音特征向量。In a possible embodiment, before acquiring the training data, the computing device extracts the speech feature vector of the training data.
S202:通过WOA从目标训练数据中确定n个个体对应的向量,得到第一向量集。S202: Determine vectors corresponding to n individuals from the target training data through WOA to obtain a first vector set.
其中,目标训练数据是上述k个类别中的任意一个类别的训练数据,n个个体中每个个体对应目标训练数据中的一个向量,n为大于1的正整数。应理解,WOA算法鲸鱼种群中的每个个体都是目标训练数据所在类别的聚类中心的一个可能解,鲸鱼优化算法的步骤过程可以参见图1及前述相关内容,为了说明书的简洁,这里不再赘述。The target training data is the training data of any one of the above k categories, each of the n individuals corresponds to a vector in the target training data, and n is a positive integer greater than 1. It should be understood that each individual in the whale population of the WOA algorithm is a possible solution to the clustering center of the category where the target training data is located. The steps of the whale optimization algorithm can be found in Figure 1 and the aforementioned related content. For the brevity of the description, it is not included here. Repeat.
举例来说,假设在第1个类别的目标训练数据中有1000个目标训练数据,对应的向量分别为d 1、d 2…d 1000。首先为WOA算法定义边界,即确定第1个类别的聚类中心c 1的搜索空间,具体可以设定c 1向量每一维元素的搜索范围,再确定算法参数,包括鲸鱼种群规模n、算法的最大迭代次数T等。搜索空间、算法参数都可以是人工根据经验确定的,这里设置鲸鱼种群规模n为5,算法的最大迭代次数T为50。然后初始化一个鲸鱼种群
Figure PCTCN2021096647-appb-000004
其中,n为鲸鱼种群中鲸鱼个体的个数,“0”代表的是初始值、第0次迭代,上标“1”表示的是第1个类别,下标“i”表示的是种群中第i个个体,每一个个体都是聚类中心c 1的一个可能解。种群中的5个个体分别为
Figure PCTCN2021096647-appb-000005
Figure PCTCN2021096647-appb-000006
从上述1000个目标训练数据对应的向量中随机选择5个向量(假设选择d 3、d 1、d 5、d 12、d 30)作为这5个个体初始对应的向量,即
Figure PCTCN2021096647-appb-000007
Figure PCTCN2021096647-appb-000008
完成鲸鱼种群的初始化,得到第一向量集d 3、d 1、d 5、d 12、d 30。应理解,这里只是以一个类别为例,其他类别也是执行同样的操作。
For example, assuming that there are 1000 target training data in the target training data of the first category, the corresponding vectors are d 1 , d 2 . . . d 1000 . First, define the boundary for the WOA algorithm, that is, determine the search space of the cluster center c 1 of the first category. Specifically, you can set the search range of each dimension element of the c 1 vector, and then determine the algorithm parameters, including the whale population size n, the algorithm The maximum number of iterations T and so on. The search space and algorithm parameters can be determined manually based on experience. Here, the whale population size n is set to 5, and the maximum number of iterations T of the algorithm is 50. Then initialize a whale population
Figure PCTCN2021096647-appb-000004
Among them, n is the number of whale individuals in the whale population, "0" represents the initial value and the 0th iteration, the superscript "1" represents the first category, and the subscript "i" represents the population in the population. The i-th individual, each individual is a possible solution of the cluster center c 1 . The five individuals in the population are
Figure PCTCN2021096647-appb-000005
Figure PCTCN2021096647-appb-000006
Randomly select 5 vectors (assuming d 3 , d 1 , d 5 , d 12 , d 30 ) from the vectors corresponding to the above 1000 target training data as the initial corresponding vectors of these 5 individuals, namely
Figure PCTCN2021096647-appb-000007
Figure PCTCN2021096647-appb-000008
The initialization of the whale population is completed, and the first vector sets d 3 , d 1 , d 5 , d 12 , and d 30 are obtained. It should be understood that only one category is used here as an example, and other categories also perform the same operations.
S203:使用目标优化函数计算第一向量集中每个向量对应的函数值,得到最佳向量。S203: Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set to obtain the optimal vector.
具体的,使用目标优化函数分别计算第一向量集中每个向量对应的函数值,得到n个第一函数值,将这n个第一函数值中最小的第一函数值所对应的向量作为最佳向量。Specifically, the objective optimization function is used to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and the vector corresponding to the smallest first function value among the n first function values is taken as the most good vector.
在一种可能的实施例中,上述目标优化函数用于计算候选向量与目标训练数据中每个 数据之间的距离之和,其中,候选向量为n个个体中任意一个个体对应的向量。应理解,通过目标优化函数计算得到的函数值越小,说明个体的适应度越好,该个体对应的向量越接近聚类中心的最优解。In a possible embodiment, the above-mentioned objective optimization function is used to calculate the sum of the distances between the candidate vector and each data in the target training data, wherein the candidate vector is a vector corresponding to any one of the n individuals. It should be understood that the smaller the function value calculated by the objective optimization function, the better the fitness of the individual, and the closer the vector corresponding to the individual is to the optimal solution of the cluster center.
在一种可能的实施例中,上述距离为汉明距离、闵式距离或夹角余弦距离中的任意一种。应理解,向量间距离的计算方式有很多种,本申请实施例还可以采用除上述计算方式以外的其他方式来计算向量间的距离。In a possible embodiment, the above distance is any one of a Hamming distance, a Min-type distance, or an included angle cosine distance. It should be understood that there are many ways to calculate the distance between vectors, and other ways other than the above-mentioned calculation ways may also be used to calculate the distance between vectors in this embodiment of the present application.
S204:通过WOA分别更新n个个体对应的向量,得到第二向量集。S204: Update the vectors corresponding to the n individuals respectively through WOA to obtain a second vector set.
具体的,通过WOA算法的一次迭代过程分别更新n个个体中每一个个体对应的向量,将更新后n个个体所对应的向量设为第二向量集。关于鲸鱼优化算法的迭代过程,可参见图1传统鲸鱼优化算法流程图及相关的描述内容,为了说明书的简洁,这里不再赘述。Specifically, through an iterative process of the WOA algorithm, the vectors corresponding to each of the n individuals are respectively updated, and the vectors corresponding to the updated n individuals are set as the second vector set. For the iterative process of the whale optimization algorithm, please refer to the flowchart of the traditional whale optimization algorithm and related descriptions in Fig. 1. For the sake of brevity of the description, it will not be repeated here.
举例来说,第1个个体
Figure PCTCN2021096647-appb-000009
在第一向量集中对应的向量为d 3,对个体
Figure PCTCN2021096647-appb-000010
执行WOA算法的一次迭代过程:首先随机得到p、A的值,发现此时的p<0.5且|A|<1,则通过式(2)执行包围猎物操作,个体
Figure PCTCN2021096647-appb-000011
更新,假设个体
Figure PCTCN2021096647-appb-000012
对应的向量由原来的d 3变为了另一个向量d 8,向量d 8即为第1个个体
Figure PCTCN2021096647-appb-000013
在第二向量集中对应的向量。上述内容只是以一个个体为例,对每一个类别中的每个个体都执行同样的操作,n个个体都更新了对应的向量,将n个个体更新后对应的向量设为第二向量集。
For example, the first individual
Figure PCTCN2021096647-appb-000009
The corresponding vector in the first vector set is d 3 , for the individual
Figure PCTCN2021096647-appb-000010
Execute an iterative process of the WOA algorithm: first randomly obtain the values of p and A, and find that p < 0.5 and |A|
Figure PCTCN2021096647-appb-000011
update, assuming individual
Figure PCTCN2021096647-appb-000012
The corresponding vector has changed from the original d 3 to another vector d 8 , and the vector d 8 is the first individual
Figure PCTCN2021096647-appb-000013
The corresponding vector in the second vector set. The above content only takes an individual as an example. The same operation is performed on each individual in each category. The corresponding vectors of n individuals are updated, and the updated corresponding vectors of the n individuals are set as the second vector set.
S205:分别计算第二向量集中每个向量与最佳向量之间的距离,由第一预设条件更新每个个体对应的向量,得到第三向量集。S205: Calculate the distance between each vector in the second vector set and the optimal vector respectively, update the vector corresponding to each individual according to the first preset condition, and obtain a third vector set.
具体的,分别计算第二向量集中每个向量与最佳向量之间的距离,由上述距离和第一预设条件更新每个个体对应的向量,得到第三向量集,其中,第三向量集包括n个个体中每个个体对应的第三向量。Specifically, the distance between each vector in the second vector set and the optimal vector is calculated respectively, and the vector corresponding to each individual is updated according to the above distance and the first preset condition to obtain a third vector set, wherein the third vector set is A third vector corresponding to each of the n individuals is included.
在一种可能的实施例中,如图3所示,首先计算目标个体在第二向量集中对应的向量与最佳向量之间的距离。在上述距离大于第一阈值时,对目标个体在第二向量集中对应的向量与最佳向量执行交叉操作,得到目标个体在第三向量集中对应的第三向量,其中,上述目标个体是n个个体中的任意一个。In a possible embodiment, as shown in FIG. 3 , the distance between the vector corresponding to the target individual in the second vector set and the optimal vector is first calculated. When the above distance is greater than the first threshold, a cross operation is performed on the vector corresponding to the target individual in the second vector set and the optimal vector to obtain a third vector corresponding to the target individual in the third vector set, wherein the above target individuals are n any of the individuals.
在一种可能的实施例中,如图3所示,计算目标个体在第二向量集中对应的向量与最佳向量之间的距离。在上述距离小于或等于第一阈值时,对目标个体在第二向量集中对应的向量执行突变操作,得到目标个体在第三向量集中对应的第三向量,其中,上述目标个体是n个个体中的任意一个。In a possible embodiment, as shown in FIG. 3 , the distance between the vector corresponding to the target individual in the second vector set and the optimal vector is calculated. When the above distance is less than or equal to the first threshold, perform mutation operation on the vector corresponding to the target individual in the second vector set to obtain a third vector corresponding to the target individual in the third vector set, wherein the above target individual is one of the n individuals any one of .
举例来说,假设当前的最佳向量为d 5,目标个体为n个个体中的第一个个体
Figure PCTCN2021096647-appb-000014
此时
Figure PCTCN2021096647-appb-000015
在第二向量集中对应的向量为d 8,首先计算d 8与最佳向量d 5之间的汉明距离。在上述汉明距离大于第一阈值时,对目标个体
Figure PCTCN2021096647-appb-000016
在第二向量集中对应的向量d 8与最佳向量d 5执行交叉操作,得到一个新的向量(假设交叉得到d 43),将向量d 43作为目标个体
Figure PCTCN2021096647-appb-000017
在第三向量集中对应的第三向量;在上述汉明距离小于或等于第一阈值时,对目标个体
Figure PCTCN2021096647-appb-000018
此时在第二向量集中对应的向量d 8执行变异操作,得到一个新的向量(假设变异得到d 44),然后将向量d 44作为目标个体
Figure PCTCN2021096647-appb-000019
在第三向量集中对应的第三向量。应理解,上述内容只是以一个个体进行举例,对每一个个体都执行同样的操作,可以分别得到每个个体对应的第三向量,组成第三向量集。需要注意的是,本申请不对交叉操作和变异操作的数学实现方式做具体限定,关于交叉和变异操作的介绍请参照前述内容,这里不再赘述。
For example, suppose the current optimal vector is d 5 , and the target individual is the first individual among n individuals
Figure PCTCN2021096647-appb-000014
at this time
Figure PCTCN2021096647-appb-000015
The corresponding vector in the second vector set is d 8 , and the Hamming distance between d 8 and the best vector d 5 is first calculated. When the above Hamming distance is greater than the first threshold, the target individual
Figure PCTCN2021096647-appb-000016
The corresponding vector d 8 in the second vector set is crossed with the best vector d 5 to obtain a new vector (assuming the cross to obtain d 43 ), and the vector d 43 is used as the target individual
Figure PCTCN2021096647-appb-000017
The corresponding third vector in the third vector set; when the above Hamming distance is less than or equal to the first threshold, the target individual
Figure PCTCN2021096647-appb-000018
At this time, the mutation operation is performed on the corresponding vector d 8 in the second vector set to obtain a new vector (assuming that the mutation obtains d 44 ), and then the vector d 44 is used as the target individual
Figure PCTCN2021096647-appb-000019
The corresponding third vector in the third vector set. It should be understood that the above content is only an example of an individual, and the same operation is performed on each individual to obtain a third vector corresponding to each individual, forming a third vector set. It should be noted that this application does not specifically limit the mathematical implementation of the crossover and mutation operations. For the introduction of the crossover and mutation operations, please refer to the foregoing content, which will not be repeated here.
S206:使用目标优化函数计算第二向量集和第三向量集中每个向量对应的函数值,由第二预设条件确定n个个体对应的目标向量,得到目标向量集。S206: Use the objective optimization function to calculate the function value corresponding to each vector in the second vector set and the third vector set, and determine the target vector corresponding to the n individuals according to the second preset condition to obtain the target vector set.
其中,目标向量集包括n个个体中每个个体对应的目标向量。The target vector set includes a target vector corresponding to each of the n individuals.
在一种可能的实施例中,如图3所示,使用目标优化函数,计算目标个体在第二向量集中对应的向量的函数值与在第三向量集中对应的向量的函数值,在第二向量集中对应的 向量的函数值大于第三向量集中对应的向量的函数值时,将第三向量集中对应的向量作为该目标个体对应的目标向量;In a possible embodiment, as shown in FIG. 3 , the target optimization function is used to calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set. When the function value of the corresponding vector in the vector set is greater than the function value of the corresponding vector in the third vector set, the corresponding vector in the third vector set is used as the target vector corresponding to the target individual;
在一种可能的实施例中,如图3所示,使用目标优化函数,计算目标个体在第二向量集中对应的向量的函数值与在第三向量集中对应的向量的函数值,在第二向量集中对应的向量的函数值小于或等于第三向量集中对应的向量的函数值时,对目标个体在第二向量集中对应的向量使用传统磷虾群算法(KHA)得到该目标个体对应的目标向量,其中,目标个体是所述n个个体中的任意一个。In a possible embodiment, as shown in FIG. 3 , the target optimization function is used to calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set. When the function value of the corresponding vector in the vector set is less than or equal to the function value of the corresponding vector in the third vector set, use the traditional krill swarm algorithm (KHA) for the vector corresponding to the target individual in the second vector set to obtain the target corresponding to the target individual vector, where the target individual is any one of the n individuals.
举例来说,假设目标个体为n个个体中的第一个个体
Figure PCTCN2021096647-appb-000020
此时目标个体
Figure PCTCN2021096647-appb-000021
在第二向量集中对应的向量为d 8
Figure PCTCN2021096647-appb-000022
在第三向量集中对应的向量为d 43。通过目标优化函数分别计算d 8和d 43对应的函数值,判断二者的大小关系。在d 8对应的函数值大于d 43对应的函数值时,将向量d 43作为目标个体
Figure PCTCN2021096647-appb-000023
对应的目标向量;在d 8对应的函数值小于或等于d 43对应的函数值时,对目标个体
Figure PCTCN2021096647-appb-000024
在第二向量集中对应的向量d 8使用传统磷虾群算法(KHA),得到该目标个体
Figure PCTCN2021096647-appb-000025
对应的目标向量。上述内容只是以一个个体为例,对其他个体
Figure PCTCN2021096647-appb-000026
Figure PCTCN2021096647-appb-000027
也分别执行上述操作,最终n个个体都确定了一个对应的目标向量,将这n个目标向量设为目标向量集。其他类别的操作同理。
For example, suppose the target individual is the first individual among n individuals
Figure PCTCN2021096647-appb-000020
target individual
Figure PCTCN2021096647-appb-000021
The corresponding vector in the second vector set is d 8 ,
Figure PCTCN2021096647-appb-000022
The corresponding vector in the third vector set is d 43 . The function values corresponding to d 8 and d 43 are calculated respectively through the objective optimization function, and the magnitude relationship between the two is judged. When the function value corresponding to d 8 is greater than the function value corresponding to d 43 , the vector d 43 is used as the target individual
Figure PCTCN2021096647-appb-000023
The corresponding target vector; when the function value corresponding to d 8 is less than or equal to the function value corresponding to d 43 , the target individual
Figure PCTCN2021096647-appb-000024
The corresponding vector d 8 in the second vector set uses the traditional krill swarm algorithm (KHA) to obtain the target individual
Figure PCTCN2021096647-appb-000025
the corresponding target vector. The above content is only an example of an individual, and the
Figure PCTCN2021096647-appb-000026
and
Figure PCTCN2021096647-appb-000027
The above operations are also performed separately. Finally, each of the n individuals determines a corresponding target vector, and the n target vectors are set as the target vector set. The operation of other categories is the same.
S207:用目标优化函数计算目标向量集中每个向量对应的函数值,得到n个目标函数值。S207: Use the objective optimization function to calculate the function value corresponding to each vector in the objective vector set to obtain n objective function values.
S208:将n个目标函数值中最小目标函数值与最佳向量对应的函数值比较,确定新的最佳向量。S208: Compare the smallest objective function value among the n objective function values with the function value corresponding to the optimal vector to determine a new optimal vector.
具体的,将n个目标函数值中最小的目标函数值与所述最佳向量对应的函数值进行比较,在上述最小的目标函数值小于最佳向量对应的函数值时,将最小的目标函数值对应的目标向量作为新的最佳向量。Specifically, the smallest objective function value among the n objective function values is compared with the function value corresponding to the optimal vector, and when the smallest objective function value is smaller than the function value corresponding to the optimal vector, the smallest objective function value is compared. The target vector corresponding to the value is used as the new optimal vector.
S209:将最后更新得到的新的最佳向量作为目标训练数据的聚类中心。S209: Use the new optimal vector obtained by the last update as the cluster center of the target training data.
具体的,执行预设次数(即最大迭代次数T)的步骤S204~S208的更新操作,其中,在第t次计算时,将第t-1次计算得到的新的最佳向量,作为第t次执行上述S204至S208时的最佳向量,将最后一次更新操作得到的新的最佳向量作为目标训练数据的聚类中心。Specifically, the update operations of steps S204 to S208 of the preset number of times (that is, the maximum number of iterations T) are performed, wherein, in the t-th calculation, the new optimal vector obtained by the t-1th calculation is used as the t-th The optimal vector when the above S204 to S208 are executed for the second time, and the new optimal vector obtained by the last update operation is used as the cluster center of the target training data.
S210:获取待分类语音数据,通过聚类中心完成待分类语音数据的分类。S210: Acquire the speech data to be classified, and complete the classification of the speech data to be classified through the clustering center.
具体的,获取待分类语音数据,分别计算所述待分类语音数据与所述k个类别的聚类中心的距离,得到与待分类语音数据距离最小的一个聚类中心,将所述待分类语音数据归到与所述待分类语音数据距离最小的聚类中心对应的类别中。应理解,待分类语音数据与聚类中心之间的距离值可以作为数据间相似性的衡量标准,距离越近,说明待分类语音数据和距离中心对应类别中的数据相似度越高。因此,可以将待分类语音数据归到与其距离最近的聚类中心所在的类别中,完成待分类原始数据的分类。Specifically, the speech data to be classified is obtained, the distances between the speech data to be classified and the cluster centers of the k categories are calculated respectively, a cluster center with the smallest distance from the speech data to be classified is obtained, and the speech data to be classified is divided into The data is classified into the category corresponding to the cluster center with the smallest distance from the speech data to be classified. It should be understood that the distance value between the speech data to be classified and the cluster center can be used as a measure of similarity between the data. The closer the distance is, the higher the similarity between the speech data to be classified and the data in the category corresponding to the distance center is. Therefore, the speech data to be classified can be classified into the category where the cluster center with the closest distance is located to complete the classification of the original data to be classified.
举例来说,有一个待分类语音数据,提取其语音特征向量得到d new,分别计算d new与得到的10个聚类中心c 1~c 10之间的距离,发现d new与c 5的距离最小,于是将d new归到聚类中心c 5所在的第五个类别中,完成该语音数据的分类,其他待分类语音数据也通过同样的方式进行分类。 For example, if there is a piece of speech data to be classified, extract its speech feature vector to obtain d new , calculate the distance between d new and the obtained 10 cluster centers c 1 to c 10 respectively, and find the distance between d new and c 5 is the smallest, so d new is classified into the fifth category where the cluster center c 5 is located to complete the classification of the speech data, and other speech data to be classified are also classified in the same way.
可以看到,本申请实施例基于传统鲸鱼优化算法WOA,能够获取各个类别语音数据的聚类中心,然后通过上述聚类中心将待分类的语音数据归类到相应的类别中,之后再派发给相应的人员进行数据标注,使得同一批标注人员尽量只处理一个类别下的语音数据,更加地有针对性,可以大大提升人工标注的效率,从而缩短整个AI项目的时间。It can be seen that, based on the traditional whale optimization algorithm WOA, the embodiment of the present application can obtain the clustering centers of each category of voice data, and then classify the voice data to be classified into corresponding categories through the above-mentioned clustering centers, and then distribute them to the corresponding categories. Corresponding personnel perform data labeling, so that the same group of labelers can only process voice data under one category, which is more targeted, which can greatly improve the efficiency of manual labeling, thereby shortening the time of the entire AI project.
应理解,上述步骤S201~S210除了用于语音数据分类,还可以用于其他类型数据的分类,包括文本数据、视频数据、图像数据等类型数据的分类。具体可根据数据的类型来进行相应的特征提取,比如对于视频数据进行人脸特征提取,对于文本数据进行语义特征提 取等等,本申请不作具体限定。It should be understood that the above steps S201 to S210 can be used for classification of other types of data, including classification of text data, video data, image data and other types of data, in addition to the classification of voice data. Specifically, corresponding feature extraction can be performed according to the type of data, such as facial feature extraction for video data, semantic feature extraction for text data, etc., which are not specifically limited in this application.
图4是本申请实施例提供的一种数据分类装置400的结构示意图,该数据分类装置包括:FIG. 4 is a schematic structural diagram of a data classification apparatus 400 provided by an embodiment of the present application. The data classification apparatus includes:
获取模块401,用于获取训练数据,其中,所述训练数据包括k个类别,k为大于1的正整数;an acquisition module 401, configured to acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;
处理模块402,用于通过鲸鱼优化算法WOA从目标训练数据中确定n个个体对应的向量,得到第一向量集,其中,所述目标训练数据是所述k个类别中的任意一个类别,n为大于1的正整数;The processing module 402 is used for determining vectors corresponding to n individuals from the target training data through the whale optimization algorithm WOA to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1;
所述处理模块402还用于:The processing module 402 is also used for:
使用目标优化函数分别计算所述第一向量集中每个向量对应的函数值,得到n个第一函数值,将所述n个第一函数值中最小第一函数值对应的向量作为最佳向量;Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;
执行更新操作:To perform an update operation:
通过WOA分别更新所述n个个体对应的向量,得到第二向量集;The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;
分别计算所述第二向量集中每个向量与所述最佳向量之间的距离,由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集;Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;
使用所述目标优化函数,计算所述第二向量集中每个向量对应的函数值以及所述第三向量集中每个向量对应的函数值,由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集;Using the objective optimization function, the function value corresponding to each vector in the second vector set and the function value corresponding to each vector in the third vector set are calculated, and the corresponding function values of the n individuals are determined by the second preset condition. target vector, get the target vector set;
使用所述目标优化函数计算所述目标向量集中每个目标向量对应的函数值,得到n个目标函数值;Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;
将所述n个目标函数值中最小目标函数值与所述最佳向量对应的函数值进行比较,在所述最小目标函数值小于所述最佳向量对应的函数值时,确定所述最小目标函数值对应的目标向量作为新的最佳向量;Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;
执行预设次数的所述更新操作,将最后一次所述更新操作得到的所述新的最佳向量作为所述目标训练数据的聚类中心;Performing the update operation for a preset number of times, and using the new optimal vector obtained by the last update operation as the cluster center of the target training data;
获取待分类语音数据,分别计算所述待分类语音数据与所述k个类别的聚类中心的距离,将所述待分类语音数据归到与所述待分类语音数据距离最小的所述聚类中心对应的类别中。Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
所述数据分类装置400的各个模块具体用于实现图2数据分类方法实施例中的步骤S201~S210,为了说明书的简洁,这里不再赘述。Each module of the data classification apparatus 400 is specifically used to implement steps S201 to S210 in the embodiment of the data classification method in FIG. 2 , and for the sake of brevity of the description, details are not repeated here.
图5是本申请实施例提供的一种计算设备500的结构示意图,该计算设备500可以是前述内容中的数据分类装置400。所述计算设备可以是笔记本电脑、平板电脑以及云端服务器等计算设备,本申请不做限制。应理解,所述计算设备还可以是至少一个服务器构成的计算机集群,本申请不做具体限定。该计算设备可包括存储器和处理器。可选的,该计算设备还可包括通信接口。FIG. 5 is a schematic structural diagram of a computing device 500 provided by an embodiment of the present application, and the computing device 500 may be the data classification apparatus 400 in the foregoing content. The computing device may be a notebook computer, a tablet computer, a cloud server and other computing devices, which are not limited in this application. It should be understood that the computing device may also be a computer cluster composed of at least one server, which is not specifically limited in this application. The computing device may include memory and a processor. Optionally, the computing device may also include a communication interface.
例如,计算设备500包括:处理器501、通信接口502以及存储器503,所述计算设备用于实行上述各个数据分类方法实施例中的步骤。其中,处理器501、通信接口502以及存储器503可以通过内部总线504相互连接,也可通过无线传输等其他手段实现通信。本申请实施例以通过总线504连接为例,总线504可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。总线504可以分为地址总线、数据总线、控制总线等。为便于表示,图5中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。For example, the computing device 500 includes: a processor 501, a communication interface 502, and a memory 503, and the computing device is configured to execute the steps in each of the foregoing data classification method embodiments. The processor 501 , the communication interface 502 and the memory 503 can be connected to each other through the internal bus 504 , and can also communicate through other means such as wireless transmission. The embodiment of the present application takes the connection through the bus 504 as an example, and the bus 504 may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus or the like. The bus 504 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 5, but it does not mean that there is only one bus or one type of bus.
处理器501可以由至少一个通用处理器构成,例如中央处理器(Central Processing Unit,CPU),或者CPU和硬件芯片的组合。上述硬件芯片可以是专用集成电路 (Application-Specific Integrated Circuit,ASIC)、可编程逻辑器件(Programmable Logic Device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(Complex Programmable Logic Device,CPLD)、现场可编程逻辑门阵列(Field-Programmable Gate Array,FPGA)、通用阵列逻辑(Generic Array Logic,GAL)或其任意组合。处理器501执行各种类型的数字存储指令,例如存储在存储器503中的软件或者固件程序,它能使计算设备500提供多种服务。The processor 501 may be composed of at least one general-purpose processor, such as a central processing unit (Central Processing Unit, CPU), or a combination of a CPU and a hardware chip. The above-mentioned hardware chip can be an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a programmable logic device (Programmable Logic Device, PLD) or a combination thereof. The above-mentioned PLD can be a complex programmable logic device (Complex Programmable Logic Device, CPLD), a field programmable gate array (Field-Programmable Gate Array, FPGA), a general array logic (Generic Array Logic, GAL) or any combination thereof. Processor 501 executes various types of digitally stored instructions, such as software or firmware programs stored in memory 503, which enable computing device 500 to provide various services.
存储器503用于存储程序代码,并由处理器501来控制执行,以执行上述各个数据分类方法实施例中的处理步骤。程序代码中可以包括一个或多个软件模块,这一个或多个软件模块可以为图4实施例中提供的软件模块,如获取模块、处理模块,各个模块具体可用于执行图2实施例中的步骤S201~S210,这里不再进行赘述。The memory 503 is used to store program codes, and is controlled and executed by the processor 501, so as to execute the processing steps in each of the foregoing embodiments of the data classification methods. The program code may include one or more software modules, and the one or more software modules may be the software modules provided in the embodiment of FIG. 4, such as an acquisition module and a processing module. Steps S201 to S210 will not be repeated here.
需要说明的是,本实施例可以是通用的物理服务器实现的,例如,ARM服务器或者X86服务器,也可以是基于通用的物理服务器结合NFV技术实现的虚拟机实现的,虚拟机指通过软件模拟的具有完整硬件系统功能的、运行在一个完全隔离环境中的完整计算机系统,本申请不作具体限定。It should be noted that this embodiment can be implemented by a general physical server, for example, an ARM server or an X86 server, or can be implemented by a virtual machine based on a general physical server combined with NFV technology. A complete computer system with complete hardware system functions and running in a completely isolated environment is not specifically limited in this application.
存储器503可以包括易失性存储器(Volatile Memory),例如随机存取存储器(Random Access Memory,RAM);存储器503也可以包括非易失性存储器(Non-Volatile Memory),例如只读存储器(Read-Only Memory,ROM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive,HDD)或固态硬盘(Solid-State Drive,SSD);存储器503还可以包括上述种类的组合。存储器503可以存储有程序代码,具体可以包括用于执行图2实施例描述的步骤的程序代码,这里不再进行赘述。The memory 503 may include a volatile memory (Volatile Memory), such as a random access memory (Random Access Memory, RAM); the memory 503 may also include a non-volatile memory (Non-Volatile Memory), such as a read-only memory (Read- Only Memory (ROM), flash memory (Flash Memory), hard disk (Hard Disk Drive, HDD) or solid-state drive (Solid-State Drive, SSD); the memory 503 may also include a combination of the above types. The memory 503 may store program codes, and may specifically include program codes for executing the steps described in the embodiment of FIG. 2 , which will not be repeated here.
通信接口502可以为有线接口(例如以太网接口),可以为内部接口(例如高速串行计算机扩展总线(Peripheral Component Interconnect express,PCIe)总线接口)、有线接口(例如以太网接口)或无线接口(例如蜂窝网络接口或使用无线局域网接口),用于与与其他设备或模块进行通信。The communication interface 502 can be a wired interface (such as an Ethernet interface), an internal interface (such as a high-speed serial computer expansion bus (Peripheral Component Interconnect express, PCIe) bus interface), a wired interface (such as an Ethernet interface), or a wireless interface ( such as a cellular network interface or using a wireless local area network interface) to communicate with other devices or modules.
需要说明的,图5仅仅是本申请实施例的一种可能的实现方式,实际应用中,计算设备500还可以包括更多或更少的部件,这里不作限制。关于本申请实施例中未出示或未描述的内容,可参见前述图2实施例中的相关阐述,这里不再赘述。It should be noted that FIG. 5 is only a possible implementation manner of the embodiment of the present application. In practical applications, the computing device 500 may further include more or less components, which is not limited here. For content not shown or described in the embodiments of the present application, reference may be made to the relevant descriptions in the foregoing embodiment in FIG. 2 , and details are not repeated here.
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质中存储有程序或指令,当其在处理器上运行时,图2所示的方法流程得以实现。Embodiments of the present application further provide a computer-readable storage medium, where a program or an instruction is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a processor, the method flow shown in FIG. 2 is implemented.
可选的,本申请涉及的存储介质如计算机可读存储介质可以是非易失性的,也可以是易失性的。Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.
本申请实施例还提供一种计算机程序产品,当计算机程序产品在处理器上运行时,图2所示的方法流程得以实现。The embodiment of the present application further provides a computer program product, when the computer program product runs on the processor, the method flow shown in FIG. 2 is realized.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium. During execution, the processes of the embodiments of the above-mentioned methods may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
以上所揭露的仅为本申请一种较佳实施例而已,当然不能以此来限定本申请之权利范围,本领域普通技术人员可以理解实现上述实施例的全部或部分流程,并依本申请权利要求所作的等同变化,仍属于申请所涵盖的范围。What is disclosed above is only a preferred embodiment of the present application, and of course, it cannot limit the scope of the right of the present application. Those skilled in the art can understand that all or part of the process of implementing the above-mentioned embodiment can be realized according to the right of the present application. The equivalent changes required are still within the scope of the application.

Claims (20)

  1. 一种数据分类方法,其中,所述方法包括:A data classification method, wherein the method comprises:
    获取训练数据,其中,所述训练数据包括k个类别,k为大于1的正整数;Acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;
    通过鲸鱼优化算法WOA从目标训练数据中确定n个个体对应的向量,得到第一向量集,其中,所述目标训练数据是所述k个类别中的任意一个类别,n为大于1的正整数;The whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;
    使用目标优化函数分别计算所述第一向量集中每个向量对应的函数值,得到n个第一函数值,将所述n个第一函数值中最小第一函数值对应的向量作为最佳向量;Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;
    执行更新操作:To perform an update operation:
    通过WOA分别更新所述n个个体对应的向量,得到第二向量集;The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;
    分别计算所述第二向量集中每个向量与所述最佳向量之间的距离,由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集;Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;
    使用所述目标优化函数,计算所述第二向量集以及所述第三向量集中每个向量对应的函数值,由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集;Using the target optimization function, calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;
    使用所述目标优化函数计算所述目标向量集中每个目标向量对应的函数值,得到n个目标函数值;Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;
    将所述n个目标函数值中最小目标函数值与所述最佳向量对应的函数值进行比较,在所述最小目标函数值小于所述最佳向量对应的函数值时,确定所述最小目标函数值对应的目标向量作为新的最佳向量;Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;
    执行预设次数的所述更新操作,将最后一次所述更新操作得到的所述新的最佳向量作为所述目标训练数据的聚类中心;Carry out the described update operation of the preset number of times, and use the new optimal vector obtained by the last described update operation as the cluster center of the target training data;
    获取待分类语音数据,分别计算所述待分类语音数据与所述k个类别的聚类中心的距离,将所述待分类语音数据归到与所述待分类语音数据距离最小的所述聚类中心对应的类别中。Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
  2. 根据权利要求1所述的方法,其中,所述第三向量集包括所述每个个体对应的第三向量;The method of claim 1, wherein the third vector set includes a third vector corresponding to each individual;
    所述由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集,包括:The vector corresponding to each individual is updated according to the distance and the first preset condition to obtain a third vector set, including:
    计算目标个体在所述第二向量集中对应的向量与所述最佳向量之间的距离,其中,所述目标个体是所述n个个体中的任意一个;Calculate the distance between the vector corresponding to the target individual in the second vector set and the optimal vector, wherein the target individual is any one of the n individuals;
    在所述距离大于第一阈值时,对所述目标个体在所述第二向量集中对应的向量与所述最佳向量执行交叉操作,得到所述目标个体在所述第三向量集中对应的第三向量。When the distance is greater than the first threshold, perform a crossover operation on the vector corresponding to the target individual in the second vector set and the optimal vector to obtain the first vector corresponding to the target individual in the third vector set Three vector.
  3. 根据权利要求2所述的方法,其中,所述方法还包括:The method of claim 2, wherein the method further comprises:
    在所述距离小于或等于所述第一阈值时,对所述目标个体在所述第二向量集中对应的向量执行变异操作,得到所述目标个体在所述第三向量集中对应的第三向量。When the distance is less than or equal to the first threshold, a mutation operation is performed on the vector corresponding to the target individual in the second vector set to obtain a third vector corresponding to the target individual in the third vector set .
  4. 根据权利要求1至3任一项所述的方法,其中,所述由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集,包括:The method according to any one of claims 1 to 3, wherein the determining the target vectors corresponding to the n individuals by the second preset condition to obtain a target vector set, comprising:
    使用所述目标优化函数,计算所述目标个体在所述第二向量集中对应的向量的函数值与在所述第三向量集中对应的向量的函数值,其中,所述目标个体是所述n个个体中的任意一个;Using the objective optimization function, calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set, wherein the target individual is the n any one of the individuals;
    在所述第二向量集中对应的向量的函数值大于所述第三向量集中对应的向量的函数值时,将所述第三向量集中对应的向量作为所述目标个体对应的目标向量。When the function value of the vector corresponding to the second vector set is greater than the function value of the vector corresponding to the third vector set, the vector corresponding to the third vector set is used as the target vector corresponding to the target individual.
  5. 根据权利要求4所述的方法,其中,所述方法还包括:The method of claim 4, wherein the method further comprises:
    在所述第二向量集中对应的向量的函数值小于或等于所述第三向量集中对应的向量的函数值时,使用传统磷虾群算法KHA得到所述目标个体对应的目标向量。When the function value of the corresponding vector in the second vector set is less than or equal to the function value of the corresponding vector in the third vector set, the traditional krill swarm algorithm KHA is used to obtain the target vector corresponding to the target individual.
  6. 根据权利要求5所述的方法,其中,所述目标优化函数用于计算候选向量与所述目标训练数据中每个数据之间的距离之和,其中,所述候选向量为所述n个个体中任意一个个体对应的向量。The method according to claim 5, wherein the target optimization function is used to calculate the sum of distances between a candidate vector and each data in the target training data, wherein the candidate vector is the n individuals The vector corresponding to any individual in .
  7. 根据权利要求6所述的方法,其中,所述距离为汉明距离、闵式距离或夹角余弦距离中的任意一种。The method according to claim 6, wherein the distance is any one of Hamming distance, Min-type distance or included angle cosine distance.
  8. 一种数据分类装置,其中,所述装置包括:A data classification device, wherein the device comprises:
    获取模块,用于获取训练数据,其中,所述训练数据包括k个类别,k为大于1的正整数;an acquisition module for acquiring training data, wherein the training data includes k categories, and k is a positive integer greater than 1;
    处理模块,用于通过鲸鱼优化算法WOA从目标训练数据中确定n个个体对应的向量,得到第一向量集,其中,所述目标训练数据是所述k个类别中的任意一个类别,n为大于1的正整数;The processing module is used to determine vectors corresponding to n individuals from the target training data through the whale optimization algorithm WOA, and obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1;
    所述处理模块还用于:The processing module is also used for:
    使用目标优化函数分别计算所述第一向量集中每个向量对应的函数值,得到n个第一函数值,将所述n个第一函数值中最小第一函数值对应的向量作为最佳向量;Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;
    执行更新操作:To perform an update operation:
    通过WOA分别更新所述n个个体对应的向量,得到第二向量集;The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;
    分别计算所述第二向量集中每个向量与所述最佳向量之间的距离,由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集;Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;
    使用所述目标优化函数,计算所述第二向量集中每个向量对应的函数值以及所述第三向量集中每个向量对应的函数值,由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集;Using the objective optimization function, the function value corresponding to each vector in the second vector set and the function value corresponding to each vector in the third vector set are calculated, and the corresponding function values of the n individuals are determined by the second preset condition. target vector, get the target vector set;
    使用所述目标优化函数计算所述目标向量集中每个目标向量对应的函数值,得到n个目标函数值;Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;
    将所述n个目标函数值中最小目标函数值与所述最佳向量对应的函数值进行比较,在所述最小目标函数值小于所述最佳向量对应的函数值时,确定所述最小目标函数值对应的目标向量作为新的最佳向量;Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;
    执行预设次数的所述更新操作,将最后一次所述更新操作得到的所述新的最佳向量作为所述目标训练数据的聚类中心;Performing the update operation for a preset number of times, and using the new optimal vector obtained by the last update operation as the cluster center of the target training data;
    获取待分类语音数据,分别计算所述待分类语音数据与所述k个类别的聚类中心的距离,将所述待分类语音数据归到与所述待分类语音数据距离最小的所述聚类中心对应的类别中。Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
  9. 一种计算设备,其中,包括存储器和处理器:A computing device including memory and a processor:
    所述存储器,用于存储计算机程序;the memory for storing computer programs;
    所述处理器,用于执行所述存储器中存储的计算机程序,以使得所述计算设备执行以下方法:The processor is configured to execute a computer program stored in the memory, so that the computing device executes the following methods:
    获取训练数据,其中,所述训练数据包括k个类别,k为大于1的正整数;Acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;
    通过鲸鱼优化算法WOA从目标训练数据中确定n个个体对应的向量,得到第一向量集,其中,所述目标训练数据是所述k个类别中的任意一个类别,n为大于1的正整数;The whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;
    使用目标优化函数分别计算所述第一向量集中每个向量对应的函数值,得到n个第一函数值,将所述n个第一函数值中最小第一函数值对应的向量作为最佳向量;Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;
    执行更新操作:To perform an update operation:
    通过WOA分别更新所述n个个体对应的向量,得到第二向量集;The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;
    分别计算所述第二向量集中每个向量与所述最佳向量之间的距离,由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集;Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;
    使用所述目标优化函数,计算所述第二向量集以及所述第三向量集中每个向量对应的函数值,由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集;Using the objective optimization function, calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;
    使用所述目标优化函数计算所述目标向量集中每个目标向量对应的函数值,得到n个目标函数值;Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;
    将所述n个目标函数值中最小目标函数值与所述最佳向量对应的函数值进行比较,在所述最小目标函数值小于所述最佳向量对应的函数值时,确定所述最小目标函数值对应的目标向量作为新的最佳向量;Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;
    执行预设次数的所述更新操作,将最后一次所述更新操作得到的所述新的最佳向量作为所述目标训练数据的聚类中心;Performing the update operation for a preset number of times, and using the new optimal vector obtained by the last update operation as the cluster center of the target training data;
    获取待分类语音数据,分别计算所述待分类语音数据与所述k个类别的聚类中心的距离,将所述待分类语音数据归到与所述待分类语音数据距离最小的所述聚类中心对应的类别中。Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
  10. 根据权利要求9所述的计算设备,其中,所述第三向量集包括所述每个个体对应的第三向量;The computing device of claim 9, wherein the third vector set includes a third vector corresponding to each individual;
    执行所述由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集,包括:Execute the described distance and the first preset condition to update the vector corresponding to each individual to obtain a third vector set, including:
    计算目标个体在所述第二向量集中对应的向量与所述最佳向量之间的距离,其中,所述目标个体是所述n个个体中的任意一个;Calculate the distance between the vector corresponding to the target individual in the second vector set and the optimal vector, wherein the target individual is any one of the n individuals;
    在所述距离大于第一阈值时,对所述目标个体在所述第二向量集中对应的向量与所述最佳向量执行交叉操作,得到所述目标个体在所述第三向量集中对应的第三向量。When the distance is greater than the first threshold, perform a crossover operation on the vector corresponding to the target individual in the second vector set and the optimal vector to obtain the first vector corresponding to the target individual in the third vector set Three vector.
  11. 根据权利要求10所述的计算设备,其中,所述处理器还用于执行:The computing device of claim 10, wherein the processor is further configured to perform:
    在所述距离小于或等于所述第一阈值时,对所述目标个体在所述第二向量集中对应的向量执行变异操作,得到所述目标个体在所述第三向量集中对应的第三向量。When the distance is less than or equal to the first threshold, a mutation operation is performed on the vector corresponding to the target individual in the second vector set to obtain a third vector corresponding to the target individual in the third vector set .
  12. 根据权利要求9至11任一项所述的计算设备,其中,执行所述由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集,包括:The computing device according to any one of claims 9 to 11, wherein the determining of the target vectors corresponding to the n individuals by the second preset condition is performed to obtain a target vector set, comprising:
    使用所述目标优化函数,计算所述目标个体在所述第二向量集中对应的向量的函数值与在所述第三向量集中对应的向量的函数值,其中,所述目标个体是所述n个个体中的任意一个;Using the objective optimization function, calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set, wherein the target individual is the n any one of the individuals;
    在所述第二向量集中对应的向量的函数值大于所述第三向量集中对应的向量的函数值时,将所述第三向量集中对应的向量作为所述目标个体对应的目标向量。When the function value of the vector corresponding to the second vector set is greater than the function value of the vector corresponding to the third vector set, the vector corresponding to the third vector set is used as the target vector corresponding to the target individual.
  13. 根据权利要求12所述的计算设备,其中,所述处理器还用于执行:The computing device of claim 12, wherein the processor is further configured to perform:
    在所述第二向量集中对应的向量的函数值小于或等于所述第三向量集中对应的向量的函数值时,使用传统磷虾群算法KHA得到所述目标个体对应的目标向量。When the function value of the corresponding vector in the second vector set is less than or equal to the function value of the corresponding vector in the third vector set, the traditional krill swarm algorithm KHA is used to obtain the target vector corresponding to the target individual.
  14. 根据权利要求13所述的计算设备,其中,所述目标优化函数用于计算候选向量与所述目标训练数据中每个数据之间的距离之和,其中,所述候选向量为所述n个个体中任意一个个体对应的向量;The computing device according to claim 13, wherein the target optimization function is used to calculate the sum of distances between candidate vectors and each data in the target training data, wherein the candidate vectors are the n The vector corresponding to any one of the individuals;
    所述距离为汉明距离、闵式距离或夹角余弦距离中的任意一种。The distance is any one of Hamming distance, Min-type distance or included angle cosine distance.
  15. 一种计算机可读存储介质,其中,包括程序或指令,当所述程序或指令在计算机设备上执行时,执行以下方法:A computer-readable storage medium, comprising a program or an instruction, when the program or instruction is executed on a computer device, the following method is performed:
    获取训练数据,其中,所述训练数据包括k个类别,k为大于1的正整数;Acquire training data, wherein the training data includes k categories, and k is a positive integer greater than 1;
    通过鲸鱼优化算法WOA从目标训练数据中确定n个个体对应的向量,得到第一向量集,其中,所述目标训练数据是所述k个类别中的任意一个类别,n为大于1的正整数;The whale optimization algorithm WOA is used to determine the vectors corresponding to n individuals from the target training data to obtain a first vector set, wherein the target training data is any one of the k categories, and n is a positive integer greater than 1 ;
    使用目标优化函数分别计算所述第一向量集中每个向量对应的函数值,得到n个第一函数值,将所述n个第一函数值中最小第一函数值对应的向量作为最佳向量;Use the objective optimization function to calculate the function value corresponding to each vector in the first vector set, to obtain n first function values, and use the vector corresponding to the smallest first function value among the n first function values as the optimal vector ;
    执行更新操作:To perform an update operation:
    通过WOA分别更新所述n个个体对应的向量,得到第二向量集;The vectors corresponding to the n individuals are respectively updated through WOA to obtain a second vector set;
    分别计算所述第二向量集中每个向量与所述最佳向量之间的距离,由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集;Calculate the distance between each vector in the second vector set and the optimal vector respectively, and update the vector corresponding to each individual based on the distance and the first preset condition to obtain a third vector set;
    使用所述目标优化函数,计算所述第二向量集以及所述第三向量集中每个向量对应的函数值,由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集;Using the target optimization function, calculate the function value corresponding to each vector in the second vector set and the third vector set, determine the target vector corresponding to the n individuals by the second preset condition, and obtain the target vector set ;
    使用所述目标优化函数计算所述目标向量集中每个目标向量对应的函数值,得到n个目标函数值;Using the objective optimization function to calculate the function value corresponding to each objective vector in the objective vector set to obtain n objective function values;
    将所述n个目标函数值中最小目标函数值与所述最佳向量对应的函数值进行比较,在所述最小目标函数值小于所述最佳向量对应的函数值时,确定所述最小目标函数值对应的目标向量作为新的最佳向量;Compare the minimum objective function value among the n objective function values with the function value corresponding to the optimal vector, and determine the minimum objective when the minimum objective function value is less than the function value corresponding to the optimal vector The target vector corresponding to the function value is used as the new optimal vector;
    执行预设次数的所述更新操作,将最后一次所述更新操作得到的所述新的最佳向量作为所述目标训练数据的聚类中心;Carry out the described update operation of the preset number of times, and use the new optimal vector obtained by the last described update operation as the cluster center of the target training data;
    获取待分类语音数据,分别计算所述待分类语音数据与所述k个类别的聚类中心的距离,将所述待分类语音数据归到与所述待分类语音数据距离最小的所述聚类中心对应的类别中。Acquire the speech data to be classified, calculate the distances between the speech data to be classified and the cluster centers of the k categories respectively, and classify the speech data to be classified into the cluster with the smallest distance from the speech data to be classified in the category corresponding to the center.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述第三向量集包括所述每个个体对应的第三向量;The computer-readable storage medium of claim 15, wherein the third set of vectors includes a third vector corresponding to each of the individuals;
    执行所述由所述距离与第一预设条件,更新所述每个个体对应的向量,得到第三向量集,包括:Execute the described distance and the first preset condition, update the vector corresponding to each individual, and obtain a third vector set, including:
    计算目标个体在所述第二向量集中对应的向量与所述最佳向量之间的距离,其中,所述目标个体是所述n个个体中的任意一个;Calculate the distance between the vector corresponding to the target individual in the second vector set and the optimal vector, wherein the target individual is any one of the n individuals;
    在所述距离大于第一阈值时,对所述目标个体在所述第二向量集中对应的向量与所述最佳向量执行交叉操作,得到所述目标个体在所述第三向量集中对应的第三向量。When the distance is greater than the first threshold, perform a crossover operation on the vector corresponding to the target individual in the second vector set and the optimal vector to obtain the first vector corresponding to the target individual in the third vector set Three vector.
  17. 根据权利要求16所述的计算机可读存储介质,其中,当所述程序或指令在计算机设备上执行时,还用于执行:The computer-readable storage medium of claim 16, wherein, when the program or instruction is executed on a computer device, it is further configured to:
    在所述距离小于或等于所述第一阈值时,对所述目标个体在所述第二向量集中对应的向量执行变异操作,得到所述目标个体在所述第三向量集中对应的第三向量。When the distance is less than or equal to the first threshold, a mutation operation is performed on the vector corresponding to the target individual in the second vector set to obtain a third vector corresponding to the target individual in the third vector set .
  18. 根据权利要求15至17任一项所述的计算机可读存储介质,其中,执行所述由第二预设条件确定所述n个个体对应的目标向量,得到目标向量集,包括:The computer-readable storage medium according to any one of claims 15 to 17, wherein the determining of the target vectors corresponding to the n individuals by the second preset condition is performed to obtain a target vector set, comprising:
    使用所述目标优化函数,计算所述目标个体在所述第二向量集中对应的向量的函数值与在所述第三向量集中对应的向量的函数值,其中,所述目标个体是所述n个个体中的任意一个;Using the objective optimization function, calculate the function value of the vector corresponding to the target individual in the second vector set and the function value of the vector corresponding to the third vector set, wherein the target individual is the n any one of the individuals;
    在所述第二向量集中对应的向量的函数值大于所述第三向量集中对应的向量的函数值时,将所述第三向量集中对应的向量作为所述目标个体对应的目标向量。When the function value of the vector corresponding to the second vector set is greater than the function value of the vector corresponding to the third vector set, the vector corresponding to the third vector set is used as the target vector corresponding to the target individual.
  19. 根据权利要求18所述的计算机可读存储介质,其中,当所述程序或指令在计算机设备上执行时,还用于执行:The computer-readable storage medium of claim 18, wherein the program or instruction, when executed on a computer device, is further configured to:
    在所述第二向量集中对应的向量的函数值小于或等于所述第三向量集中对应的向量的函数值时,使用传统磷虾群算法KHA得到所述目标个体对应的目标向量。When the function value of the corresponding vector in the second vector set is less than or equal to the function value of the corresponding vector in the third vector set, the traditional krill swarm algorithm KHA is used to obtain the target vector corresponding to the target individual.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述目标优化函数用于计算候选向量与所述目标训练数据中每个数据之间的距离之和,其中,所述候选向量为所述n个个体中任意一个个体对应的向量;The computer-readable storage medium of claim 19, wherein the objective optimization function is used to calculate the sum of distances between a candidate vector and each data in the target training data, wherein the candidate vector is the the vector corresponding to any one of the n individuals;
    所述距离为汉明距离、闵式距离或夹角余弦距离中的任意一种。The distance is any one of Hamming distance, Min-type distance or included angle cosine distance.
PCT/CN2021/096647 2020-12-17 2021-05-28 Data classification method and apparatus, and related device WO2022127037A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011503667.5 2020-12-17
CN202011503667.5A CN112613550A (en) 2020-12-17 2020-12-17 Data classification method, device and related equipment

Publications (1)

Publication Number Publication Date
WO2022127037A1 true WO2022127037A1 (en) 2022-06-23

Family

ID=75241078

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096647 WO2022127037A1 (en) 2020-12-17 2021-05-28 Data classification method and apparatus, and related device

Country Status (2)

Country Link
CN (1) CN112613550A (en)
WO (1) WO2022127037A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115689389A (en) * 2022-11-21 2023-02-03 黑龙江省水利科学研究院 Cold region river and lake health evaluation method and device based on whale algorithm and projection pursuit

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613550A (en) * 2020-12-17 2021-04-06 平安科技(深圳)有限公司 Data classification method, device and related equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108389211A (en) * 2018-03-16 2018-08-10 西安电子科技大学 Based on the image partition method for improving whale Optimization of Fuzzy cluster
CN112613550A (en) * 2020-12-17 2021-04-06 平安科技(深圳)有限公司 Data classification method, device and related equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263834B (en) * 2019-06-13 2022-07-05 东华大学 Method for detecting abnormal value of new energy power quality
CN110989342B (en) * 2019-11-19 2021-04-16 华北电力大学 Real-time T-S fuzzy modeling method for combined cycle unit heavy-duty gas turbine
CN111506729B (en) * 2020-04-17 2023-08-29 腾讯科技(深圳)有限公司 Information processing method, device and computer readable storage medium
CN112070418A (en) * 2020-09-21 2020-12-11 大连大学 Weapon target allocation method of multi-target whale optimization algorithm

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108389211A (en) * 2018-03-16 2018-08-10 西安电子科技大学 Based on the image partition method for improving whale Optimization of Fuzzy cluster
CN112613550A (en) * 2020-12-17 2021-04-06 平安科技(深圳)有限公司 Data classification method, device and related equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN, YAHUAN: "Research on Hybrid Swarm Intelligence Algorithm and Its Application in Clustering Analysis", WANFANG SCIENCE PERIODICAL DATABASE, 3 June 2019 (2019-06-03), pages 1 - 85, XP055944474 *
MINDFLOW: "AI Pre-Labeling", 27 April 2020 (2020-04-27), CN, pages 1 - 2, XP009538517, Retrieved from the Internet <URL:https://zhidao.baidu.com/question/2145030700660072908> [retrieved on 20220825] *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115689389A (en) * 2022-11-21 2023-02-03 黑龙江省水利科学研究院 Cold region river and lake health evaluation method and device based on whale algorithm and projection pursuit
CN115689389B (en) * 2022-11-21 2023-07-14 黑龙江省水利科学研究院 Cold region river and lake health evaluation method and device based on whale algorithm and projection pursuit

Also Published As

Publication number Publication date
CN112613550A (en) 2021-04-06

Similar Documents

Publication Publication Date Title
US9990558B2 (en) Generating image features based on robust feature-learning
US11271876B2 (en) Utilizing a graph neural network to identify supporting text phrases and generate digital query responses
CN111797893B (en) Neural network training method, image classification system and related equipment
CN109947919B (en) Method and apparatus for generating text matching model
CN108052588B (en) Method for constructing automatic document question-answering system based on convolutional neural network
WO2019114413A1 (en) Model training
CN108399414B (en) Sample selection method and device applied to cross-modal data retrieval field
CN110348535B (en) Visual question-answering model training method and device
CN111241751B (en) Wing profile optimization method and device based on agent assisted evolution algorithm
US20210232929A1 (en) Neural architecture search
JP7266674B2 (en) Image classification model training method, image processing method and apparatus
WO2022127037A1 (en) Data classification method and apparatus, and related device
CN111026544B (en) Node classification method and device for graph network model and terminal equipment
WO2020244065A1 (en) Character vector definition method, apparatus and device based on artificial intelligence, and storage medium
CN110516070B (en) Chinese question classification method based on text error correction and neural network
JP7430820B2 (en) Sorting model training method and device, electronic equipment, computer readable storage medium, computer program
CN107203558B (en) Object recommendation method and device, and recommendation information processing method and device
US11775832B2 (en) Device and method for artificial neural network operation
US10372743B2 (en) Systems and methods for homogeneous entity grouping
Tavakoli Modeling genome data using bidirectional LSTM
WO2021195095A1 (en) Neural architecture search with weight sharing
KR102250728B1 (en) Sample processing method and device, related apparatus and storage medium
CN114547267A (en) Intelligent question-answering model generation method and device, computing equipment and storage medium
CN111627494A (en) Protein property prediction method and device based on multi-dimensional features and computing equipment
CN111160000A (en) Composition automatic scoring method, device terminal equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21904972

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21904972

Country of ref document: EP

Kind code of ref document: A1