CN111241106A

CN111241106A - Approximate data processing method, device, medium and electronic equipment

Info

Publication number: CN111241106A
Application number: CN202010044200.2A
Authority: CN
Inventors: 冯晨; 王健宗; 彭俊清
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-06-05
Anticipated expiration: 2040-01-15
Also published as: CN111241106B; WO2021143016A1

Abstract

The disclosure relates to the field of data processing, and discloses an approximate data processing method, an approximate data processing device, an approximate data processing medium and electronic equipment. The method comprises the following steps: acquiring data to be processed; acquiring a vector corresponding to data to be processed; performing hash operation on the vector of the data to be processed by using each position sensitive hash function in the position sensitive hash function family to obtain a mapping value corresponding to the vector of the data to be processed; repeatedly executing the step of constructing the covering groups for the first predetermined number of times to obtain a plurality of covering groups, wherein the step of constructing the covering groups comprises constructing the covering groups based on mapping values corresponding to the vectors of the data to be processed and a position sensitive hash function for carrying out hash operation on the vectors of the data to be processed; and integrating a plurality of coverage groups to obtain the final coverage of the data to be processed, wherein the data to be processed belonging to the same final coverage is approximate data. Under the method, the condition that time consumption for processing a large amount of approximate data is unstable is avoided, and the data processing efficiency is improved on the whole.

Description

Approximate data processing method, device, medium and electronic equipment

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method, an apparatus, a medium, and an electronic device for processing approximate data.

Background

At present, when data processing is performed, for any item of data, in order to quickly find data similar to the data, a commonly used scheme is Location Sensitive Hashing (LSH), which maps high-dimensional data to low-dimensional data, and maps similar data into the same bucket, so that the probability that two data points adjacent to each other in an original data space are adjacent to each other in a mapped new data space is still very high, and the probability that two data points not adjacent to each other are adjacent to each other in the mapped new data space is very low. However, the use of the LSH algorithm involves the giving of a number of hyper-parameters. The mapping result based on the LSH algorithm is used for subsequent data processing tasks, and if a large amount of data needs to be processed, the mapping result is high in requirement, which causes certain instability. On one hand, if the data volume in the barrel is too large, the effect of improving the efficiency by using the LSH algorithm is greatly discounted; on the other hand, for the same set of data, the time taken to perform a data processing task is uncertain and is affected by the size of the amount of data in the heap.

Disclosure of Invention

In the field of data processing technology, to solve the above technical problem, an object of the present disclosure is to provide an approximate data processing method, apparatus, medium, and electronic device.

According to an aspect of the present disclosure, there is provided an approximate data processing method, the method including:

acquiring a plurality of data to be processed;

obtaining a vector corresponding to the data to be processed;

performing hash operation on the vector of the data to be processed by using each position sensitive hash function in a preset position sensitive hash function family to obtain a mapping value corresponding to the vector of the data to be processed, wherein the preset position sensitive hash function family comprises a plurality of position sensitive hash functions;

repeatedly executing the step of constructing the coverage group for a first predetermined number of times to obtain a plurality of coverage groups, wherein the step of constructing the coverage groups comprises constructing the coverage groups based on the mapping values corresponding to the vectors of the data to be processed and a position-sensitive hash function for performing hash operation on the vectors of the data to be processed, and each coverage group comprises at least one coverage, and each coverage comprises at least one data to be processed;

and integrating the plurality of coverage groups to obtain the final coverage to which each data to be processed belongs, wherein the data to be processed belonging to the same final coverage is approximate data.

According to another aspect of the present disclosure, there is provided an approximate data processing apparatus, the apparatus including:

the device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is configured to acquire a plurality of data to be processed;

a second obtaining module configured to obtain a vector corresponding to the data to be processed;

the hash module is configured to perform hash operation on the vector of the data to be processed by using each position-sensitive hash function in a preset position-sensitive hash function family to obtain a mapping value corresponding to the vector of the data to be processed, wherein the preset position-sensitive hash function family comprises a plurality of position-sensitive hash functions;

a repeated execution module configured to repeatedly execute the step of constructing the coverage group for a first predetermined number of times, resulting in a plurality of coverage groups, the step of constructing the coverage group including constructing a coverage group based on the mapping value corresponding to the vector of the data to be processed and a location-sensitive hash function that performs a hash operation on the vector of the data to be processed, the coverage group including at least one coverage, each of the coverage including at least one of the data to be processed;

and the integration module is configured to integrate the plurality of coverage groups to obtain the final coverage to which each piece of data to be processed belongs, wherein the data to be processed belonging to the same final coverage is approximate data.

According to another aspect of the present disclosure, there is provided a computer readable program medium storing computer program instructions which, when executed by a computer, cause the computer to perform the method as previously described.

According to another aspect of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method as previously described.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the approximate data processing method provided by the present disclosure includes the following steps: acquiring a plurality of data to be processed; obtaining a vector corresponding to the data to be processed; performing hash operation on the vector of the data to be processed by using each position sensitive hash function in a preset position sensitive hash function family to obtain a mapping value corresponding to the vector of the data to be processed, wherein the preset position sensitive hash function family comprises a plurality of position sensitive hash functions; repeatedly executing the step of constructing the coverage group for a first predetermined number of times to obtain a plurality of coverage groups, wherein the step of constructing the coverage groups comprises constructing the coverage groups based on the mapping values corresponding to the vectors of the data to be processed and a position-sensitive hash function for performing hash operation on the vectors of the data to be processed, and each coverage group comprises at least one coverage, and each coverage comprises at least one data to be processed; and integrating the plurality of coverage groups to obtain the final coverage to which each data to be processed belongs, wherein the data to be processed belonging to the same final coverage is approximate data.

Under the method, the covering groups are constructed for many times and are integrated, so that the time consumption of approximate data processing can be stabilized in a smaller range while the accuracy of data processing is maintained, the conditions that a large amount of approximate data is processed, the time consumption is unstable and the time consumption is overlarge are avoided, and the data processing efficiency is integrally improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a system architecture diagram illustrating a method of approximate data processing in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of approximating data processing in accordance with an exemplary embodiment;

FIG. 3 is a detailed flow diagram of step 220 according to one embodiment shown in a corresponding embodiment in FIG. 2;

FIG. 4 is a schematic view of one of the coverage groups shown in accordance with an exemplary embodiment;

FIG. 5 is a flowchart illustrating steps of constructing a coverage group when the data to be processed is voiceprint data according to one embodiment illustrated in a corresponding embodiment of FIG. 2;

FIG. 6 is a block diagram illustrating an approximation data processing apparatus according to an exemplary embodiment;

FIG. 7 is a block diagram illustrating an example of an electronic device implementing the above-described approximate data processing method in accordance with one illustrative embodiment;

fig. 8 is a diagram illustrating a computer-readable storage medium implementing the above-described approximate data processing method according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

The present disclosure first provides an approximate data processing method. The data may be any data that can be converted into vectors, such as audio, text, image, etc. The approximate data processing method is a method for classifying more similar data in a plurality of data into one class, and after classifying the data which can be approximated into one class, the data can be used for executing tasks such as searching, further accurate classification and the like.

The implementation terminal of the present disclosure may be any device having an operation and processing function, which may be connected to an external device for receiving or sending data, and specifically may be a portable mobile device, such as a smart phone, a tablet computer, a notebook computer, a pda (personal Digital assistant), or the like, or may be a fixed device, such as a computer device, a field terminal, a desktop computer, a server, a workstation, or the like, or may be a set of multiple devices, such as a physical infrastructure of cloud computing or a server cluster.

Preferably, the implementation terminal of the present disclosure may be a physical infrastructure of a server or cloud computing.

FIG. 1 is a system architecture diagram illustrating a method of approximating data processing, according to an example embodiment. As shown in fig. 1, the system architecture includes a server 110 and a user terminal 120, where the server 110 is connected to the user terminal 120 through a communication link, and can receive data sent by the user terminal 120 and send data to the user terminal 120, and in this embodiment, the server 110 is an implementation terminal of the present disclosure. After the user uses the user terminal 120 to send a plurality of data to the server 110, the server 110 may classify the received data by executing the approximate data processing method provided by the present disclosure, so as to classify the possibly similar data into one class, thereby providing support for data classification results for executing other tasks such as searching, accurate classification, and the like.

It is worth mentioning that fig. 1 is only one embodiment of the present disclosure. Although the implementation terminal in this embodiment is a server, in other embodiments, the implementation terminal may be various terminals or devices as described above; although in the present embodiment, the data for data processing is sent from only one terminal, in other embodiments or specific applications, the data for data processing may be obtained from multiple terminals, for example, the Server and the user terminal are in a C/S (Client/Server) architecture or a B/S (Browser/Server) architecture, the multiple user terminals use the Client or Browser installed thereon to send data to the Server, and the data sources on the user terminals may be various. The present disclosure is not intended to be limited thereby, nor should the scope of the present disclosure be limited thereby.

FIG. 2 is a flow diagram illustrating a method of approximating data processing according to an exemplary embodiment. The approximate data processing method of the present embodiment may be executed by a server, as shown in fig. 2, and includes the following steps:

step 210, obtaining a plurality of data to be processed.

As described above, the data to be processed may be various types of data, such as image data, voice data, text data, and the like.

Step 220, a vector corresponding to the data to be processed is obtained.

In one embodiment, the data to be processed is image data, and the image data is converted into vectors according to pixel values of pixel points included in each image data.

In an embodiment, each of the data to be processed corresponds to a vector, the data to be processed is voiceprint data, and the specific step of step 220 may be as shown in fig. 3. FIG. 3 is a detailed flow chart of step 220 according to one embodiment shown in a corresponding embodiment of FIG. 2, including the steps of:

and step 221, acquiring a mel frequency cepstrum coefficient characteristic value of the data to be processed.

In one embodiment, Mel-Frequency Cepstral Coefficients (MFCCs) feature values are voice feature values obtained by performing a series of processes of pre-emphasis, framing, windowing, fourier transform, inverse fourier transform, and the like on voice data.

Step 222, inputting the mel-frequency cepstrum coefficient characteristic value of each piece of data to be processed into a pre-trained gaussian mixture-general background model combined with a joint factor analysis model, and obtaining an identity confirmation vector corresponding to each piece of data to be processed.

A Joint Factor Analysis (JFA) model models channel difference and difference between different speaker data, interference components of the channels are removed, and voiceprint features in voiceprint data are extracted more accurately.

A Gaussian mixture-Universal background model (GMM-UBM) is a model that can recognize similar speech data, and training of the GMM-UBM model refers to a process of determining parameters of the model.

In one embodiment, training of the GMM-UBM model is achieved by inputting a plurality of voiceprint data pre-labeled with a corresponding speaker into the GMM-UBM model.

An Identity-confirmation Vector, I-Vector, is a Vector that records voiceprint feature information specific to a spoken speaker.

And step 230, performing hash operation on the vector of the data to be processed by using each position sensitive hash function in a preset position sensitive hash function family to obtain a mapping value corresponding to the vector of the data to be processed.

The preset position-sensitive hash function family comprises a plurality of position-sensitive hash functions.

A Location Sensitive Hashing (LSH) function is a function that can reduce high-dimensional data to a low-dimensional space, and the LSH function can make adjacent data points in an original data space have a high probability of being adjacent after being mapped, and make non-adjacent data points in the original data space have a high probability of being non-adjacent after being mapped.

In one embodiment, each location-sensitive hash function in the preset location-sensitive hash function family is established by the following formula:

wherein a is a random number sequence, b is a random number in (0, r), r is the difference between the maximum value and the minimum value in each feature of the identity confirmation vector of the data to be processed, x is the identity confirmation vector of the data to be processed, and the establishment of a preset position-sensitive hash function family containing a plurality of position-sensitive hash functions is realized by adjusting two parameters a and b.

h(x)＝sgn(v，r)，

wherein r is a random hyperplane, v is an identity confirmation vector of the data to be processed, and sgn is a sign function.

wherein E is₈Is a trellis decoding function, v8 is 8-dimensional data randomly fetched from a vector v, the vector v is an identity confirmation vector of the data to be processed, b is an 8-dimensional random offset vector, and w is a normalization factor.

And 240, repeatedly executing the step of constructing the covering groups for the first predetermined number of times to obtain a plurality of covering groups, wherein the step of constructing the covering groups comprises constructing the covering groups based on the mapping values corresponding to the vectors of the data to be processed and the position sensitive hash function for carrying out hash operation on the vectors of the data to be processed.

The coverage group comprises at least one coverage, and each coverage comprises at least one to-be-processed data.

An overlay (sphere) is essentially a collection of data to be processed, each of which may belong to at least one overlay, the data to be processed classified into the same overlay being considered approximate data. A coverage group is a group of coverage, which is a collection of coverage that may include at least one coverage. If the data to be processed is uniquely identified with an index, the overlay may be a collection of indices since the index is uniquely corresponding to the data to be processed. For example, if all indexes of all data to be processed are {1,2,3, 4}, one coverage group that can be obtained is { [1,2], [3], [4] }, where [1,2], [3], [4] is one coverage, respectively, and another coverage group that can be obtained is { [1,2,3], [3,4], [4] }, where [1,2], [3], [4] is one coverage, respectively.

The first predetermined number may be any number greater than 2 set by human experience, for example, the first predetermined number may be 10. Since the coverage groups are constructed multiple times, the coverage group obtained by constructing the coverage each time may be the same as or different from other established coverage groups, and thus there may be a case where two or more coverage groups are the same, and if the same multiple coverage groups are regarded as one, the number of coverage groups obtained by repeating the step of constructing the coverage groups a first predetermined number of times is less than or equal to a first predetermined number.

FIG. 4 is a schematic diagram illustrating one coverage group according to an exemplary embodiment. As shown in fig. 4, the overlay set includes a first overlay 410, a second overlay 420 and a third overlay 430, wherein a black dot in each overlay represents data to be processed belonging to the overlay, and it can be seen that each overlay includes at least one data to be processed, for example, the first data to be processed 440 is data to be processed belonging to the second overlay 420. The first overlay 410 and the second overlay 420 have an intersecting part, which means that the second pending data 450 belonging to the part belongs to more than one overlay, i.e. to both the first overlay 410 and the second overlay 420.

Fig. 5 is a flowchart of steps of constructing a coverage group when the data to be processed is voiceprint data according to an embodiment shown in a corresponding embodiment of fig. 2. Referring to fig. 5, the method comprises the following steps:

step 510, construct an integer set comprising 1, the number of dimensions of the identity confirmation vector, and all integers in between.

The identity confirmation Vector, i.e., I-Vector, is a Vector extracted from the voiceprint data, and may be the same identity confirmation Vector as in the foregoing embodiment.

For example, if the number of dimensions of the identity confirmation vector is 8, the set of integers of the final structure is {1,2,3,4,5,6,7,8 }.

Step 520, establish an initial coverage group and set the counter to 1.

Wherein the initial coverage group is an empty set.

The coverage groups may be recorded in various data structures, such as arrays, and when an initial coverage group is recorded using an array, the array corresponding to the initial coverage group is a null array.

The counter is a module or component with a counting function embedded in the implementation terminal of the present disclosure.

Step 530, determine whether the integer set is an empty set.

The integer set is an empty set, i.e. it does not contain any elements.

If the integer set is an empty set, the step of constructing the coverage group is terminated directly.

In the case that the integer set is not an empty set, the steps 540 and the following steps are repeatedly executed until the constructing of the coverage group is finished when the integer set is an empty set.

And 540, randomly taking one element from the integer set as a target element.

The set of integers of the construct includes a plurality of integers, each integer being an element.

Step 550, for each location-sensitive hash function, obtaining an index of an output result obtained by using the location-sensitive hash function equal to an identity confirmation vector of the location-sensitive hash function with respect to an output result obtained by inputting an identity confirmation vector of the target element.

The index of the identity confirmation vector is the unique identifier of the identity confirmation vector, is an integer in the integer set, and each identity confirmation vector is uniquely corresponding to one index.

In this step, the retrieved target element is fixed, and therefore, the identity confirmation vector indexed to the target element is also fixed, and this step is performed based on the fixed identity confirmation vector.

And when the output result obtained by inputting the other identity confirmation vectors into the position sensitive hash function is the same as the output result obtained by inputting the identity confirmation vector with the index as the target element into the same position sensitive hash function, the identity confirmation vectors and the identity confirmation vector with the index as the target element are considered to be similar.

And step 560, adding the intersection of the union set of all the obtained identity confirmation vector indexes and the integer set into the initial coverage group as a coverage with the index as the value of the counter.

The obtained union of all the identity confirmation vector indexes can ensure that only one identity confirmation vector index is reserved for repetition, and the situation that one coverage group comprises a plurality of identical identity confirmation vectors is avoided.

The index, the identity confirmation vector and the data to be processed are all in one-to-one correspondence, so that the coverage group is established in an index mode, and the corresponding identity confirmation vector and the data to be processed can be effectively divided.

With the construction of the coverage, the elements in the integer set may be reduced, elements that should not be included in the coverage established this time may exist in the union of all the obtained identity confirmation vector indexes, and need to be removed from the coverage established this time, and elements that do not exist in the integer set may be removed from the coverage established this time by means of intersection.

Step 570, determining the index as a similarity score of each identity confirmation vector in the coverage of the counter value, where the similarity score is equal to a ratio of the number of indexes of the location sensitive hash function used for determining the index of the identity confirmation vector to the number of all location sensitive hash functions.

Since the index of the hash function is uniquely corresponding to the hash function, the number of hash functions is equivalent to the number of indexes of the hash function.

And the coverage with the index as the value of the counter is the coverage to be constructed in the current round.

As mentioned above, only one element in the overlay can be kept by the union, but the number of the repeated elements also has its effect, and for the same id confirmation vector, if there are more output results of the location sensitive hash function to the id confirmation vector respectively equal to the output results of the location sensitive hash functions to the id confirmation vector indexed as the target element, this means that the more similar the id confirmation vector is to the id confirmation vector indexed as the target element, so this ratio can be used as the similarity score.

Step 580, remove all indexes of identity confirmation vectors from the integer set whose similarity score is greater than or equal to a predetermined similarity threshold.

When the similarity score between an identity confirmation vector and an identity confirmation vector with the index as the target element is large enough, the identity confirmation vector is similar to the identity confirmation vector with the index as the target element, and the index of the identity confirmation vector which should exist in only one overlay can be prevented from being an element in other overlays by removing the index of the identity confirmation vector from the integer set, so that the accuracy of overlay division can be ensured.

Step 590, increment the counter by 1.

As mentioned above, in step 560, the index of the constructed overlay is equal to the value of the counter, so the counter is used to count the number of constructed overlays and set the index for the constructed overlay, therefore, the counter needs to be increased by 1, each time the counter is increased by 1, which means that the link of the overlay constructed this time in the process of establishing the overlay group in this round is ended, and the construction of the next overlay needs to be performed, and by increasing the counter by 1, the value of the counter can be used to construct the next overlay.

It is easy to understand that both the data to be processed and the identification confirmation vector are in one-to-one correspondence with the unique identifier of the index, so that the coverage group may include the data to be processed, the identification confirmation vector, and the index, and the purpose of classifying the data to be processed can be achieved no matter which one of the data to be processed and the identification confirmation vector is included.

If two identity confirmation vectors are mapped to the same result by the same position sensitive hash function, it can be considered that the two identity confirmation vectors are similar in large probability.

Step 250, integrating the plurality of coverage groups to obtain the final coverage of each data to be processed.

Wherein, the data to be processed belonging to the same final coverage is approximate data.

Integrating multiple coverage groups is a process of integrating distribution conditions of data to be processed in the multiple coverage groups to form one coverage group, the finally formed coverage group comprises multiple final coverage, and each data to be processed can belong to at least one final coverage.

For each coverage group, the data to be processed in each coverage under the coverage group can be considered to be approximate, the final coverage is the final classification result of similar data to be processed, and the final coverage reflects the distribution condition of the approximate data to be processed in each coverage group as a whole.

In one embodiment, the integrating the plurality of coverage groups to obtain the final coverage to which each of the to-be-processed data belongs includes:

if the data to be processed belongs to the same coverage in coverage groups with more than a second preset number, classifying the data to be processed into one coverage, wherein the second preset number is smaller than the first preset number;

and respectively removing the data to be processed which do not belong to the same coverage in more than a second preset number of coverage groups from the coverage and classifying all the data to be processed removed from the coverage into one coverage.

In one embodiment, if the data to be processed belongs to the same coverage in more than a second predetermined number of coverage groups, classifying the data to be processed as one coverage includes:

for each index combination in each coverage containing at least two indexes, determining the number of coverage groups containing the index combination in the coverage;

and when the number is larger than a second preset number, classifying the data to be processed, of which the index belongs to the index combination, into a cover.

After the data to be processed is classified into the final overlays, the approximate data to be processed in each final overlay can be applied to perform other data processing tasks.

In one embodiment, after integrating the plurality of coverage groups to obtain the final coverage to which each of the data to be processed belongs, the method further includes:

clustering the data to be processed based on the obtained final coverage containing the data to be processed so as to divide the data to be processed into a plurality of classes.

In the embodiment, a further clustering task of the data to be processed is mainly executed, and most of the data to be processed in the established final coverage are similar and are nearly classified into one class, so that the further clustering is carried out on the basis, the clustering time can be greatly shortened, the clustering efficiency is improved, and meanwhile, most of the similar data to be processed are classified into one coverage, and after the further clustering is carried out on the basis, the probability of clustering errors is greatly reduced, and the clustering accuracy can be improved; in addition, since the final coverage is obtained by integrating a plurality of coverage groups, the time consumed by the classification task of the whole data to be processed can be more stable.

In one embodiment, the data to be processed in each final coverage is clustered using a k-means algorithm.

In one embodiment, the clustering the to-be-processed data based on the obtained final coverage including the to-be-processed data to divide the to-be-processed data into a plurality of classes includes:

taking each data to be processed as a class, and determining an initial class interval between the classes based on the final coverage of each data to be processed;

repeatedly executing a classification process until the class distance between two classes with the minimum class distance reaches a preset class distance threshold value or all the data to be processed are merged into one class, wherein the classification process comprises the following steps:

merging two types with the minimum class spacing;

and updating the class spacing between the classes.

In the embodiment, clustering is performed iteratively according to the class intervals, so that clustering reliability is guaranteed.

In one embodiment, the merging of the two classes with the smallest class interval includes:

acquiring a class interval between any pair of classes containing data to be processed which belongs to the first final coverage, taking the class interval as a class interval of an initial class pair, and marking the class interval of the initial class pair as a minimum class interval;

a judging step: starting from a class interval between any pair of classes which are not marked as judged except the initial class pair and contain the finally covered data to be processed with the smallest index, judging whether the class interval of two classes in the pair of classes is smaller than the minimum class interval or not and marking each class in the pair of classes as judged;

and (3) canceling and marking: if so, canceling the mark of the latest minimum class interval and marking the class interval of the pair of classes as the minimum class interval;

repeatedly executing the judging step and the canceling and marking step until the class spacing of the class pair marked as the minimum class spacing is not changed;

and merging a pair of classes marked as minimum class spacing as two classes with minimum class spacing.

In one embodiment, the determining the initial class interval between the classes based on the final coverage to which each to-be-processed data belongs, with each to-be-processed data as a class, includes:

obtaining the similarity score of each data to be processed according to a linear discriminant analysis model based on probability;

normalizing each similarity score to be between [0,1] to obtain a normalized similarity score;

for any pair of the data to be processed, if the corresponding classes belong to the same coverage, setting the class interval of the class corresponding to the pair of the data to be processed as 1 and the difference between the normalized similarity scores corresponding to the pair of the data to be processed, and if the corresponding classes belong to different coverage, setting the class interval of the class corresponding to the pair of the data to be processed as 1;

the updating of the class spacing between classes includes:

for each pair of classes corresponding to each merged class, acquiring the sum of the normalized similarity scores of each to-be-processed data belonging to the first class and each to-be-processed data belonging to the second class in the pair of classes;

and taking the ratio of the sum to the number of data pairs corresponding to all the data to be processed belonging to the first class and all the data to be processed belonging to the second class in the pair as the class interval between the two classes in the pair.

In this embodiment, clustering of voiceprint data is realized, in many scenarios, a plurality of voiceprint data can be generated by the same user, and under the condition that a large amount of voiceprint data are provided by a large amount of users, the voiceprint data mixed together need to be classified according to the users, and efficient, accurate and stable classification of the voiceprint data can be realized through this embodiment.

The first and second classes are references to two different classes within a pair. For example, the first class includes A, B, C three pieces of data to BE processed, and the second class includes D, E two pieces of data to BE processed, so that the two classes may correspond to combinations of six pieces of data to BE processed, i.e., AD, AE, BD, BE, CD, and CE, each combination has a normalized similarity score, the sum of the normalized similarity scores finally obtained is the sum of the similarity scores corresponding to the six combinations, and the ratio of the sum of the similarity scores to 6 is the class interval between the two classes.

In one embodiment, the obtaining the similarity score of each data to be processed according to the probability-based linear discriminant analysis model includes:

obtaining a vector of each data to be processed representing speaker information by using a probability-based linear discriminant analysis model;

and aiming at each data to be processed, obtaining the log-likelihood ratio of the vector representing the speaker information corresponding to each data to be processed in the data to be processed, and obtaining the similarity score of the data to be processed.

A Probabilistic Linear Discriminant Analysis (PLDA) model is a channel compensation algorithm that can ignore the effects of channel noise on recording the speaker's voice information.

In this embodiment, a probability-based linear discriminant analysis model is used to obtain a vector representing speaker information of each piece of data to be processed, and a log-likelihood ratio of the vectors representing speaker information corresponding to two pieces of data to be processed is used as a score between the two pieces of data to be processed. The class intervals are obtained by using the calculated scores, and clustering is performed according to the class intervals, so that the dynamic fault-tolerant capability during clustering is improved, and the condition of wrong classification caused by errors generated by a coverage integration method is avoided.

In one embodiment, the normalizing each similarity score to [0,1] to obtain a normalized similarity score includes:

normalizing each similarity score to be between [0,1] using the following formula to obtain a normalized similarity score:

wherein X is the similarity score to be normalized, X_maxFor the maximum value among the similarity scores of each pair of data to be processed, X_minIs the minimum value in the similarity scores of each pair of data to be processed,

is a normalized similarity score.

In one embodiment, after clustering the to-be-processed data based on the obtained final coverage including the to-be-processed data to divide the to-be-processed data into a plurality of classes, the method further comprises:

when a request for acquiring data similar to target data is received, the data similar to the target data is acquired based on the class into which each piece of data to be processed is classified.

In the embodiment, because the data to be processed in each class obtained after the coverage integration and clustering of the data to be processed are often highly similar, on the basis, if the data similar to the target data in all the data is determined, the data can be directly obtained from the similar classes, and the retrieval efficiency is improved.

In summary, according to the method for processing approximate data provided in the embodiment of fig. 2, the covering groups are constructed for multiple times, and the multiple covering groups are integrated, so that the time consumption for processing the approximate data is stabilized in a small range while the accuracy of data processing is maintained, and the situations that a large amount of approximate data is processed and time consumption is unstable and may be too large are avoided, thereby improving the data processing efficiency as a whole.

The present disclosure also provides an approximate data processing apparatus, and the following are apparatus embodiments of the present disclosure.

FIG. 6 is a block diagram illustrating an approximation data processing apparatus according to an exemplary embodiment. As shown in fig. 6, the apparatus 600 includes:

a first obtaining module 610 configured to obtain a plurality of data to be processed;

a second obtaining module 620 configured to obtain a vector corresponding to the data to be processed;

a hashing module 630, configured to perform a hashing operation on the vector of the to-be-processed data by using each location-sensitive hashing function in a preset location-sensitive hashing function family to obtain a mapping value corresponding to the vector of the to-be-processed data, where the preset location-sensitive hashing function family includes a plurality of location-sensitive hashing functions;

a repeated execution module 640 configured to repeatedly execute the step of constructing the coverage group for a first predetermined number of times, resulting in a plurality of coverage groups, the step of constructing the coverage group including constructing a coverage group based on the mapping value corresponding to the vector of the data to be processed and a location-sensitive hash function that performs a hash operation on the vector of the data to be processed, the coverage group including at least one coverage, each of the coverage including at least one of the data to be processed;

an integrating module 650 configured to integrate the plurality of coverage groups to obtain a final coverage to which each of the to-be-processed data belongs, wherein the to-be-processed data belonging to the same final coverage is approximate data.

According to a third aspect of the present disclosure, there is also provided an electronic device capable of implementing the above method.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 700 according to this embodiment of the invention is described below with reference to fig. 7. The electronic device 700 shown in fig. 7 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 7, electronic device 700 is embodied in the form of a general purpose computing device. The components of the electronic device 700 may include, but are not limited to: the at least one processing unit 710, the at least one memory unit 720, and a bus 730 that couples various system components including the memory unit 720 and the processing unit 710.

Wherein the storage unit stores program code that can be executed by the processing unit 710 such that the processing unit 710 performs the steps according to various exemplary embodiments of the present invention described in the section "example methods" above in this specification.

The storage unit 720 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)721 and/or a cache memory unit 722, and may further include a read only memory unit (ROM) 723.

The memory unit 720 may also include programs/utilities 724 having a set (at least one) of program modules 725, such program modules 725 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 730 may be any representation of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 700 may also communicate with one or more external devices 900 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 700, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 700 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 750. Also, the electronic device 700 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 760. As shown, the network adapter 760 communicates with the other modules of the electronic device 700 via the bus 730. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 700, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

According to a fourth aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-mentioned method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

Referring to fig. 8, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method of approximate data processing, the method comprising:

acquiring a plurality of data to be processed;

obtaining a vector corresponding to the data to be processed;

2. The method according to claim 1, wherein each of the data to be processed corresponds to a vector, the data to be processed is voiceprint data, and the obtaining the vector corresponding to the data to be processed comprises:

acquiring a Mel cepstrum coefficient characteristic value of the data to be processed;

and inputting the Mel cepstrum coefficient characteristic value of each piece of data to be processed into a pre-trained Gaussian mixture-general background model combined with a joint factor analysis model to obtain an identity confirmation vector corresponding to each piece of data to be processed.

3. The method of claim 1, wherein each location-sensitive hash function in the predetermined family of location-sensitive hash functions is established by the following formula:

4. The method of claim 2, wherein constructing a coverage group based on the mapped value corresponding to the vector of data to be processed and a location sensitive hash function that hashes the vector of data to be processed comprises:

constructing an integer set comprising 1, the dimension number of the identity confirmation vector and all integers between the two;

establishing an initial coverage group and setting a counter to be 1, wherein the initial coverage group is an empty set;

repeatedly executing a construction covering process until the integer set is an empty set, wherein the construction covering process comprises the following steps:

randomly taking an element from the integer set as a target element;

for each position-sensitive hash function, acquiring an index of an output result obtained by using the position-sensitive hash function, which is equal to an identity confirmation vector of an output result obtained by inputting an identity confirmation vector of the target element by the position-sensitive hash function;

adding the intersection of the union set of all the obtained identity confirmation vector indexes and the integer set into the initial covering group as a covering with the index being the value of the counter;

determining the index as a similarity score for each identity confirmation vector in the coverage of the counter value, the similarity score being equal to a ratio of a number of indices of the location sensitive hash function used to determine the index of the identity confirmation vector to a number of all location sensitive hash functions;

removing from the integer set all indices of identity confirmation vectors having a similarity score greater than or equal to a predetermined similarity threshold;

the counter is incremented by 1.

5. The method of claim 1, wherein said integrating the plurality of coverage groups to obtain a final coverage to which each of the to-be-processed data belongs comprises:

6. The method of claim 1, wherein after integrating the plurality of coverage groups to obtain a final coverage to which each of the data to be processed belongs, the method further comprises:

7. The method of claim 6, wherein the clustering the to-be-processed data based on the obtained final coverage containing the to-be-processed data to classify the to-be-processed data into a plurality of classes comprises:

merging two types with the minimum class spacing;

and updating the class spacing between the classes.

8. An approximate data processing apparatus, characterized in that the apparatus comprises:

9. A computer-readable program medium, characterized in that it stores computer program instructions which, when executed by a computer, cause the computer to perform the method according to any one of claims 1 to 7.

10. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1 to 7.