CN114781496A

CN114781496A - Optimizing sampling method and device and electronic equipment

Info

Publication number: CN114781496A
Application number: CN202210349040.1A
Authority: CN
Inventors: 狄东林; 秦涛; 王啸; 崔晟嘉; 张钋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2022-07-22
Anticipated expiration: 2042-04-01
Also published as: CN114781496B

Abstract

The disclosure provides an optimization sampling method, an optimization sampling device and electronic equipment, and relates to the technical field of data processing, in particular to the technical field of machine learning. The specific implementation scheme is as follows: acquiring respective object characteristics of a plurality of objects to be grouped; for each two characteristic value pairs consisting of characteristic values of the same characteristic dimension in the object characteristics, performing weighted summation on all the object characteristics to obtain a difference absolute value of the two characteristic values in the characteristic value pairs; calculating the similarity between every two objects to be grouped according to the absolute difference value; and selecting at least one pair of the objects to be grouped with the similarity larger than a preset similarity threshold value as an optimizing object pair. The efficiency of the optimized sampling can be improved.

Description

Optimizing sampling method and device and electronic equipment

Technical Field

The present disclosure relates to the field of data processing technology, and more particularly, to the field of machine learning technology.

Background

In some application scenarios, the objects involved in the test, such as the people involved in the test, sample data, etc., are divided into an experimental group and a control group. Experiments were performed on the experimental and control groups, respectively, and variables were introduced in the tests to determine the effect of the variables on the experimental results. This experimental procedure is called A/B experiment.

Disclosure of Invention

The disclosure provides a method, apparatus, device and storage medium for optimizing sampling.

According to an aspect of the present disclosure, there is provided an optimal sampling method, including:

acquiring respective object characteristics of a plurality of objects to be grouped;

for each two characteristic value pairs consisting of characteristic values of the same characteristic dimension in the object characteristics, performing weighted summation on all the object characteristics to obtain a difference absolute value of the two characteristic values in the characteristic value pairs;

calculating the similarity between the characteristics of every two objects to be grouped according to the absolute value of the difference;

and selecting at least one pair of the objects to be grouped with the similarity larger than a preset similarity threshold value as an optimizing object pair.

According to a second aspect of the present disclosure, there is provided an optimized sampling apparatus comprising:

the characteristic acquisition module is used for acquiring the object characteristics of a plurality of objects to be grouped;

the difference solving module is used for weighting and summing all the object features aiming at a feature value pair consisting of feature values of the same feature dimension in every two object features to obtain the absolute value of the difference between the two feature values in the feature value pair;

the similarity solving module is used for calculating the similarity between the characteristics of every two objects to be grouped according to the absolute value of the difference;

and the object pair screening module is used for selecting at least one pair of the objects to be grouped with the similarity larger than a preset similarity threshold value as an optimizing object pair.

According to a third aspect provided by the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect provided by the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the first aspect described above.

According to a fourth aspect provided by the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic flow diagram of an optimized sampling method provided by the present disclosure;

FIG. 2 is a schematic diagram of an architecture for a feedforward neural network for implementing the sample-seeking provided by the present disclosure;

FIG. 3 is a schematic diagram of a difference solution unit in a feedforward neural network provided by the present disclosure;

FIG. 4 is another schematic flow diagram of the optimized sampling method provided by the present disclosure;

fig. 5 is a schematic diagram of another structure of the optimized sampling device provided by the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing the method of optimizing sampling of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

For clearer explanation of the optimization sampling method provided by the present disclosure, an exemplary application scenario of the optimization sampling method provided by the present disclosure will be described below, it is to be understood that the following example is only one possible application scenario of the optimization sampling method provided by the present disclosure, and in other possible embodiments, the optimization sampling method provided by the present disclosure may also be applied to other possible embodiments, and the following example does not set any limitation thereto.

An application developer develops a new function for application, and in order to determine whether the new function is online, the application developer can determine the influence of the new function on the retention rate of a user through an A/B (analog/digital) experiment. Illustratively, a plurality of test persons are divided into a control group and an experimental group, and the test persons of the control group are allowed to use applications without new functions, and the test persons of the experimental group are allowed to use applications with new functions. And respectively counting the user retention rates of the control group and the experimental group, thereby obtaining the influence of the new function on the user retention rate.

To avoid the control group and the experimental group interfering with each other, the same test person cannot belong to both the control group and the experimental group. And in order to improve the accuracy of the experimental results, the variables between the control group and the experimental group should be reduced as much as possible, so it is required that the test persons in the control group are as similar as possible to the test persons in the experimental group.

It can be seen that if one tester is classified into the control group, another tester sufficiently similar to the tester needs to be classified into the experimental group. Therefore, it is common in the related art to divide the testers into a plurality of tester pairs, each tester pair consisting of two (groups of) sufficiently similar testers, assign one (group) of the testers to the control group and assign the other (group) of the testers to the experimental group for each tester pair, a process called the optimization sampling.

In order to accurately divide the testers into a plurality of tester pairs, it is necessary to determine whether each two testers are similar enough, that is, the similarity between each two testers needs to be calculated. When the number of testers is large, the similarity needing to be calculated is large, and the efficiency of optimizing sampling is low.

In the related technology, in order to improve the efficiency of optimizing sampling, the similarity between only part of testers is calculated, and the tester pairs are extracted according to the calculated similarity between part of testers. However, since only the similarity between some of the testers is calculated, the testers in the extracted pairs of testers may not be similar enough, i.e., the accuracy of the optimization sampling is low.

Based on this, the present disclosure provides an optimized sampling method, as shown in fig. 1, comprising:

s101, obtaining respective object characteristics of a plurality of objects to be grouped.

S102, weighting and summing all object features according to feature value pairs formed by feature values of the same feature dimension in every two object features to obtain the absolute value of the difference value of the two feature values in the feature value pairs.

And S103, calculating the similarity between the characteristics of every two objects to be grouped according to the absolute value of the difference.

And S104, selecting at least one pair of objects to be grouped with the similarity larger than a preset similarity threshold value as an optimizing object pair.

By adopting the embodiment, the calculation of the similarity can be decomposed into the calculation of the absolute difference value, and the absolute difference value of the two characteristic values in each characteristic value pair is calculated in a weighted summation mode of all the object characteristics, so that the calculation mode of each absolute difference value is the same, and the input is all the object characteristics.

On the other hand, the optimizing sampling method provided by the disclosure can calculate the absolute value of the difference between the two characteristic values in all the characteristic value pairs in batch, so that the similarity between every two objects to be grouped can be calculated, sufficiently similar objects to be grouped can be accurately determined to serve as the optimizing object pair, and the accuracy of optimizing sampling can be effectively improved.

The aforementioned objects to be grouped may be different types of objects according to different application scenarios, including but not limited to people, images, text, and the like. Each object to be grouped may be a single individual or a set composed of multiple individuals, for example, each object to be grouped may be a person or a traffic bucket composed of multiple persons.

The object features are vectors for characterizing the objects to be grouped, and each object feature is an M-dimensional vector. In a possible embodiment, in order to enable the feature vector to accurately characterize the objects to be grouped, the objects to be grouped are taken as a granularity, data generated by each object to be grouped within a preset time length are collected, and features of the data are extracted as object features of the objects to be grouped.

The magnitude of the similarity herein refers to the magnitude of the degree of similarity indicated by the similarity, not the magnitude on the similarity value. According to different application scenes, the similarity degree represented by the similarity degree and the similarity degree value can be positively correlated or negatively correlated.

For example, if the similarity between two objects to be grouped is represented by the cosine distance between the object features of the two objects to be grouped, the greater the numerical value of the similarity, the greater the degree of similarity represented by the similarity. If the similarity between two objects to be grouped is represented by the Euclidean distance, the larger the numerical value of the similarity is, the smaller the similarity represented by the similarity is.

It can be understood that in the foregoing S102, multiple weighted sums are required, and the object of each weighted sum is all the object features, and the operation performed by the fully-connected layer is to perform multiple weighted sums on each input value, so that the foregoing S102 may be performed by using the fully-connected layer.

Illustratively, the foregoing S102 is implemented by:

inputting the characteristic value pair into a preset full-connection layer aiming at the characteristic value pair consisting of the characteristic values of the same characteristic dimension in each two object characteristics to obtain the absolute value of the difference value of the two characteristic values in the characteristic value pair output by the full-connection layer,

and the full connection layer is used for calculating the absolute value of the difference between the two characteristic values in the characteristic value pair aiming at each characteristic value pair and outputting the absolute value of the difference.

By adopting the embodiment, the calculation of the absolute value of the difference value can be realized by utilizing the full connection layer, on one hand, the structure of the full connection layer is simpler, and the realization difficulty of the optimization sampling method provided by the disclosure can be reduced. On the other hand, the framework of the full connection layer is suitable for parallel computing and GPU computing, so that the efficiency of determining the absolute value of the difference value can be further improved by means of parallel computing, GPU computing and the like, namely the efficiency of the optimal sampling method provided by the disclosure is further improved.

The full connection layer may be an independent full connection layer, or may be a full connection layer in a preset feedforward neural network, where the preset feedforward neural network includes an input layer, a full connection layer, and an output layer.

For the situation that the full connection layer is in the preset feedforward neural network, the input of the preset feedforward neural network is all object features, and the output can be the absolute value of the difference between two feature values in each feature value pair, or the similarity between every two objects to be grouped.

In order to more clearly explain the optimal sampling method provided by the present disclosure, the structure of the feedforward neural network will be explained below by taking the output of the preset feedforward neural network as the similarity between every two objects to be grouped as an example:

referring to fig. 2, fig. 2 is a schematic structural diagram of a feedforward neural network provided in an embodiment of the present invention, in this example, the feedforward neural network includes an input layer, a fully-connected layer, and an output layer. And, the full connection layer includes a plurality of difference solving units.

The input layer is used for inputting the characteristic value of the same characteristic dimension in every two object characteristics as a characteristic value pair to the difference solving unit corresponding to the characteristic value pair, wherein different characteristic value pairs correspond to different difference solving units.

For example, suppose there are three objects to be grouped together, which are respectively denoted as objects to be grouped 1-3, where the object features of the object to be grouped 1 are { x11, x12, x13}, x11 is the feature value of the object to be grouped 1 in feature dimension 1, x12 is the feature value of the object to be grouped 1 in feature dimension 2, and x13 is the feature value of the object to be grouped 1 in feature dimension 3. The object characteristics of the object to be grouped 2 are { x21, x22, x23}, and the object characteristics of the object to be grouped 3 are { x31, x32, x33 }.

Then in this embodiment there are a total of 9 feature-pairs, x11, x21, x11, x31, x21, x31, x12, x22, x12, x32, x22, x32, x13, x23, x13, x33, x23, x 33. Therefore, the input layer inputs the 9 feature value pairs to different difference solving units respectively.

The difference solving unit is used for calculating the difference absolute value of two characteristic values in the characteristic value pairs input to the difference solving unit and inputting the difference absolute value to the output layer. For example, assuming that { x11, x21} is input to the difference solving unit 1, the difference solving unit 1 calculates the absolute value of the difference between x11 and x21 and inputs the calculated absolute value of the difference to the output layer.

By adopting the embodiment, the full connection layer is unitized, so that the full connection layer can be designed conveniently according to a specific application scene, and the adaptability of the optimizing sampling method is effectively improved.

The input layer may be one layer or a plurality of layers, and the input of each neuron in the first input layer should be one feature value in one object feature, and the feature values input to different neurons should be different. The last input layer is connected to the fully connected layer, so that each neuron in the fully connected layer is connected to all neurons in the last input layer.

And the difference solving unit belongs to a full connection layer, so each neuron in the last layer of input layer is connected with each difference solving unit. Therefore, by reasonably setting the weight of the feedforward neural network, the characteristic value of the same characteristic dimension in every two object characteristics can be used as a characteristic value pair by the output layer and input to the difference solving unit corresponding to the characteristic value pair.

For example, if the input of each neuron in the last input layer is a feature value, the value input from the ith neuron to the jth difference solution unit in the input layer is α_ijx_iWherein x is_iIs a characteristic value, alpha, input to the ith neuron of the input layer_ijWeights set for the ith neuron and the jth difference solving unit of the input layer, and if the corresponding pair of eigenvalues of the jth difference solving unit includes x_iThen alpha is_ijIf the value of the characteristic value pair corresponding to the jth difference solving unit does not include x, the value of the characteristic value pair corresponding to the jth difference solving unit is equal to 1_iThen alpha is_ij0. For example, if x_iX12, and the eigenvalue pair corresponding to the j-th difference solution unit is { x12, x32}, then α_ij1, and x_iX12, and the eigenvalue pair corresponding to the j-th difference solution unit is { x22, x32}, then α_ij0. It is understood that if α is_ij1, the output of the ith neuron is x_iSo that it is now considered that the ith neuron will be x_iInput to the jth difference solution unit, and if alpha is_ijWhen x is equal to 0, the output of the ith neuron is 0, and thus it is considered that x is not detected by the ith neuron at this time_iThe input is input to the jth difference solving unit.

For the case that the fully-connected layer is an independent fully-connected layer, the fully-connected layer may also include a plurality of difference solving units, and the principle of the difference solving unit in this case is completely the same as that of the fully-connected layer in the preset feedforward neural network, so the difference solving unit in this case may refer to the foregoing related description, and is not described herein again.

The structure of the difference solution unit may be different according to different application scenarios, and for example, in one possible embodiment, the difference solution unit is composed of a neuron, the input of the neuron is two eigenvalues in the eigenvalue pair corresponding to the difference solution unit, the neuron is configured to calculate a difference Δ between the two eigenvalues, determine a larger value of Δ and- Δ, and output the larger value as an absolute difference value of the two eigenvalues. For example, if the eigenvalue input to the neuron is 1 or 2, the neuron calculates 1-2 as-1, compares the larger value of-1 and 1 to obtain 1, and outputs 1 as the absolute value of the difference between 1 and 2.

In another possible embodiment, as shown in fig. 3, the difference solving unit includes: a first difference neuron, a second difference neuron, and a summing neuron. In this embodiment, the fully-connected layer is two layers, and the first difference neuron and the second difference neuron in all the difference solving units constitute the first-layer fully-connected layer, and the summing neuron in all the difference solving units constitutes the second-layer fully-connected layer.

In this example, the step of inputting the feature value pair to a preset full-connected layer to obtain an absolute value of a difference between two feature values in the feature value pair output by the full-connected layer is implemented by:

and S1021, inputting the characteristic value pair into a first difference neuron and a second difference neuron in a difference solving unit corresponding to the characteristic value pair respectively to obtain a first output value output by the first difference neuron and a second output value output by the second difference neuron.

And S1022, inputting the first output value and the second output value to a summing neuron in the difference solving unit corresponding to the characteristic value pair to obtain an output value output by the summing neuron, wherein the output value is used as an absolute value of the difference between the two characteristic values in the characteristic value pair.

In each difference solving unit, the inputs of the first difference neuron and the second difference neuron are: and inputting two eigenvalues in the eigenvalue pair of the difference solving unit.

And the first difference neuron is used for subtracting one characteristic value from the other characteristic value to obtain a first difference, judging whether the first difference is greater than 0 or not, if so, outputting the first difference, and if not, outputting 0.

And the second difference neuron is used for subtracting the characteristic value from the other characteristic value to obtain a second difference value, judging whether the second difference value is greater than 0 or not, if the second difference value is greater than 0, outputting the second difference value, and if the second difference value is not greater than 0, outputting 0.

And the summing neuron is used for calculating a summing result of the first output value and the second output value and outputting the summing result.

Illustratively, assuming that the two eigenvalues are 1 and 2, respectively, the first difference neuron calculates 1-2, resulting in-1 being the first difference, and since-1 is not greater than 0, 0 is output, i.e., the first output value is 0. The second difference neuron calculates 2-1, which results in a second difference of 1, and outputs 1 because 1 is greater than 0, i.e., the second output value is 1. The summing neuron calculates 0+1, obtains a summing result as 1, and outputs the summing result, so that the absolute value of the difference between the two characteristic values is 1.

It can be understood that outputting different results according to the magnitude relationship with the preset threshold is an operation (referred to as an activation function operation herein) that can be implemented by an activation function, and therefore, with this embodiment, the calculation of the absolute value of the difference can be implemented only by addition (subtraction can be regarded as a special addition operation) and activation function operation, and the feature that the full connection layer can efficiently implement addition and activation function operations is fully utilized, so that the efficiency of the difference solving unit in calculating the absolute value of the difference is improved, and the efficiency of the optimization sampling is further improved.

The similarity calculated in S103 may be represented in different forms, such as an array, a matrix, and the like, in different application scenarios. For convenience of description, the following matrix is taken as an example:

in a possible embodiment, the foregoing S103 is implemented by:

and generating a similarity matrix according to the absolute value of the difference, wherein the similarity matrix is an N-by-N dimensional matrix, and N is the number of the objects to be grouped. The element of the ith row and the jth column in the similarity matrix is the similarity between the ith object to be grouped and the jth object to be grouped, and i and j are positive integers with the value range of [1, N ]. For example, assuming that there are a total of 4 objects to be grouped, which are respectively denoted as S1-4, the similarity matrix is:

<S1，S1> <S1，S2> <S1，S3> <S1，S4>

<S2，S1> <S2，S2> <S2，S3> <S2，S4>

<S3，S1> <S3，S2> <S3，S3> <S3，S4>

<S4，S1> <S4，S2> <S4，S3> <S4，S4>

wherein, < S1, S1> is the similarity between S1 and S1, < S1, S2> is the similarity between S1 and S2, and so on.

By adopting the embodiment, the similarity between every two objects to be grouped is represented in a matrix form, so that the subsequent processing of the similarity based on matrix batch is facilitated.

It can be understood that the element in the ith row and the jth column in the similarity matrix and the element in the jth row and the ith column are the similarity between the ith object to be grouped and the jth object to be grouped. Therefore, theoretically, any side of the diagonal line in the similarity matrix comprises the similarity between every two objects to be grouped, and the optimal object pair can be determined by using the similarity of any side of the diagonal line in the similarity matrix.

The diagonal of the similarity matrix herein refers to: the diagonal line with the end points of the 1 st row and 1 st column element and the Nth row and N column element. For example, still taking the similarity matrix as an example, S1, S2>, < S1, S3>, < S1, S4>, < S2, S3>, < S2, S4>, < S3, S4> are all the similarities located at one side of the diagonal, and < S2, S1>, < S3, S1>, < S3, S2>, < S4, S1>, < S4, S2>, < S4, S3> are all the similarities located at the other side of the diagonal.

The way of determining the optimal object pair by using the similarity on any side of the diagonal in the similarity matrix is as follows: and sequencing all the similarities on one side of the diagonal in the similarity matrix according to the sequence from high to low to obtain a similarity sequence. And sequentially aiming at each similarity in the similarity sequence according to the sequence from front to back, and if each object to be grouped corresponding to the similarity does not belong to any optimizing object pair, taking two objects to be grouped corresponding to the similarity as a pair of optimizing object pairs until a preset termination condition is reached. The preset termination condition includes, but is not limited to, any of the following conditions: and determining that the number of the optimization object pairs reaches a preset number threshold, the cycle number reaches a preset number threshold and the like.

Illustratively, still taking the similarity matrix as an example, assuming that the termination condition is that the number of the determined optimal object pairs reaches 2, and all the similarities at one side of the diagonal line refer to: < S, S >, < S, S >, < S, S >, < S, S >, < S, S >, < S, S > and, assuming < S, S > < S, S >, < S, S >, < S, S >, < S, S > first for < S, S >, S > since no optimal pair is initially determined, S does not belong to any optimal pair, < S, S > is taken as a pair of optimal pairs, then < S, S >, S > is taken as a pair of optimal pairs, S > is not taken as a pair of optimal pairs, s4> is taken as the optimizing object pair. Regarding the < S3, S4>, since S3 and S4 do not belong to any optimization object pair, < S3, S4> is regarded as a pair of optimization objects.

By adopting the embodiment, the intersection between different optimization object pairs can be avoided, so that the intersection between the control group and the experimental group divided based on the optimization object pairs does not exist, the mutual interference of the experiments of the control group and the experimental group is effectively avoided, and the reliability of the A/B experiment is improved.

The optimal sampling method provided by the present disclosure will be described below with reference to a specific example, in which N objects to be grouped coexist, and the object feature of each object to be grouped is an M-dimensional feature vector, see fig. 4.

Firstly, splicing object features of each object to be grouped into an N-M dimensional feature matrix, wherein the elements of the ith row and the jth column in the feature matrix are as follows: j-th characteristic value (hereinafter, referred to as xij) of the object characteristic of the ith object to be grouped. And inputting the characteristic matrix into a preset feedforward neural network.

In this example, the full link layer is in a preset feedforward neural network, and the output of the preset feedforward neural network is the similarity between every two objects to be grouped. The input layer of the feedforward neural network is preset to be two layers. The first input layer comprises N M neurons, the input of each neuron is one eigenvalue in the characteristic matrix, and the input of different neurons is different. For convenience of description, neurons that input the feature value xij in the first input layer are denoted as neurons ij.

Each neuron in the first input layer corresponds to two neurons in the second input layer, which are referred to as positive neurons and negative neurons, respectively, for convenience of description. For convenience of description, a positive neuron corresponding to the neuron ij is referred to as a positive neuron ij, and a negative neuron corresponding to the neuron ij is referred to as a negative neuron ij. Neuron ij is used to input xij to positive neuron ij and-xij to negative neuron ij.

In this example, the fully-connected layer includes N × M difference solution units (N — 1) ×, and for convenience of description, the difference solution unit corresponding to the feature value pair { xij, xkj } is recorded as a difference solution unit ikj, where k is a positive integer with a value range of [1, N ], and k ≠ i.

The difference solving units ikj are solved for, where the input to the first difference solving unit is xij output by the positive neuron ij and-xkj output by the negative neuron kj. The input of the second difference solving unit is xkj output by the positive neuron kj and-xij output by the negative neuron ij.

And the first difference neuron is used for summing the inputs, namely calculating xij-xkj, if xij-xkj is larger than 0, the first difference neuron inputs xij-xkj to the summing neuron, and if xij-xkj is not larger than 0, the first difference neuron inputs 0 to the summing neuron.

And the second difference neuron is used for summing the inputs, namely calculating xkj-xij, if xkj-xij is larger than 0, the second difference neuron inputs xkj-xij to the summing neuron, and if xkj-xij is not larger than 0, the second difference neuron inputs 0 to the summing neuron.

And, the summing neuron is configured to sum the inputs and output the result to the output layer. It will be appreciated that if xij > xkj, the inputs to the summing neuron are xij-xkj and 0, and thus the result of the summation is xij-xkj, i.e. the absolute value of the difference between xij and xkj. Similarly, if xij < xkj > 0, the input of the summation neuron is 0 and xkj-xij, so that the result of summation is xkj-xij, namely the absolute value of the difference between xij and xkj. In this example, the difference solving unit can accurately calculate the absolute value of the difference between the two eigenvalues in the eigenvalue pair.

In this example, the input of the output layer is the absolute value of the difference output by each difference solving unit, and the output layer is used for determining the similarity matrix according to the absolute value of the difference. In this example, the similarity matrix is a matrix of dimensions N × N, and for the similarity matrix, reference may be made to the foregoing related description, which is not repeated herein.

All the similarities on one side of the diagonal in the similarity matrix are extracted and sequenced to obtain a similarity sequence as shown in fig. 4, in this example, the similarity represented by the similarity is negatively correlated with the numerical value of the similarity, so that the similarity is sequenced in the sequence from small to large during sequencing.

And determining the optimizing object pairs based on the similarity sequence, dividing one object to be grouped in each optimizing pair into an experimental group, and dividing the other object to be grouped into a control group. For how to determine the optimal object pair based on the similarity sequence, reference may be made to the foregoing related description, and details are not repeated here.

Corresponding to the foregoing optimized sampling method, the present disclosure also provides an optimized sampling apparatus, as shown in fig. 5, including:

a feature obtaining module 501, configured to obtain object features of multiple objects to be grouped;

a difference solving module 502, configured to perform weighted summation on all the object features for a feature value pair composed of feature values of the same feature dimension in every two object features, to obtain an absolute difference value between two feature values in the feature value pair;

a similarity solving module 503, configured to calculate a similarity between each two objects to be grouped according to the absolute difference value;

and the object pair screening module 504 is configured to select at least one pair of the objects to be grouped, of which the similarity is greater than a preset similarity threshold, as an optimization object pair.

In a possible embodiment, the difference solving module 502 performs weighted summation on all the object features for a feature value pair composed of feature values of the same feature dimension in every two object features, so as to obtain an absolute value of a difference between two feature values in the feature value pair, including:

and inputting the characteristic value pair to a preset full-connection layer aiming at the characteristic value pair consisting of the characteristic values of the same characteristic dimension in every two object characteristics to obtain the absolute value of the difference value of the two characteristic values in the characteristic value pair output by the full-connection layer.

And the full connection layer is used for calculating the difference absolute value of two characteristic values in the characteristic value pairs aiming at each characteristic value pair and outputting the difference absolute value.

In one possible embodiment, the fully-connected layer includes a plurality of difference solution units;

the difference solving module 502 inputs the feature value pairs to a preset full-link layer to obtain the absolute value of the difference between two feature values in the feature value pairs output by the full-link layer, and includes:

inputting the feature value pairs into the difference solving unit corresponding to the feature value pairs to obtain the absolute value of the difference between two feature values in the feature value pairs output by the difference solving unit;

the different feature value pairs correspond to different difference solving units, and each difference solving unit is used for calculating a difference absolute value of two feature values in the feature value pairs input to the difference solving unit and outputting the difference absolute value.

In a possible embodiment, the difference solution unit comprises a first difference neuron, a second difference neuron, and a summing neuron;

the difference solving module 502 inputs the feature value pairs to the difference solving unit corresponding to the feature value pairs to obtain the absolute value of the difference between two feature values in the feature value pairs output by the difference solving unit, and the method includes:

inputting the feature value pairs into a first difference neuron and a second difference neuron in a difference solving unit corresponding to the feature value pairs respectively to obtain a first output value output by the first difference neuron and a second output value output by the second difference neuron;

inputting the first output value and the second output value into a summing neuron in a difference solving unit corresponding to the feature value pair to obtain an output value output by the summing neuron, wherein the output value is used as a difference absolute value of two feature values in the feature value pair;

the first difference neuron is configured to subtract one eigenvalue from another eigenvalue in the eigenvalue pair input to the difference solving unit to obtain a first difference; if the first difference is larger than 0, outputting the first difference, and if the first difference is not larger than 0, outputting 0;

the second difference neuron is used for subtracting the characteristic value from the other characteristic value to obtain a second difference; if the second difference is larger than 0, outputting the second difference, and if the second difference is not larger than 0, outputting 0;

In a possible embodiment, the similarity solving module 503 calculates the similarity between each two objects to be grouped according to the absolute difference value, including:

and generating a similarity matrix according to the difference absolute value, wherein the element of the ith row and the jth column of the similarity matrix is the similarity between the ith object to be grouped and the jth object to be grouped, i and j are positive integers with the value range of [1, N ], and N is the number of the objects to be grouped.

In a possible embodiment, the object pair screening module 504 selects at least one pair of the objects to be grouped that meet a preset screening condition as an optimized object pair according to the similarity, including:

sequencing all the similarities on any side of the diagonal in the similarity matrix according to the sequence from high to low to obtain a similarity sequence;

according to the sequence from front to back, sequentially aiming at each similarity in the similarity sequence, if each object to be grouped corresponding to the similarity does not belong to any optimizing object pair, taking two objects to be grouped corresponding to the similarity as an optimizing object pair until a preset termination condition is reached.

In the technical scheme of the disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the common customs of public order.

It should be noted that, in some application scenarios, the object to be grouped in the embodiment of the present disclosure may be a human head model, and the human head model in the embodiment is not a human head model for a certain specific user, and cannot reflect personal information of the certain specific user.

It should be noted that in other application scenarios, the object to be grouped in the embodiment of the present disclosure may be a two-dimensional face image, and the two-dimensional face image in the embodiment is derived from a public data set.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 601 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the various methods and processes described above, such as method XXX. For example, in some embodiments, method XXX may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM603 and executed by computing unit 601, one or more steps of method XXX described above may be performed. Alternatively, in other embodiments, computing unit 601 may be configured to perform method XXX by any other suitable means (e.g., by way of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method of optimizing samples, comprising:

calculating the similarity between every two objects to be grouped according to the absolute difference value;

2. The method according to claim 1, wherein the weighted summation is performed on all the object features for each pair of feature values consisting of feature values of the same feature dimension in each two object features, so as to obtain an absolute value of a difference between the two feature values in the pair of feature values, and the method comprises:

inputting the characteristic value pair to a preset full-connection layer aiming at a characteristic value pair consisting of characteristic values of the same characteristic dimension in every two object characteristics to obtain a difference absolute value of the two characteristic values in the characteristic value pair output by the full-connection layer;

the full connection layer is used for calculating the absolute value of the difference between two characteristic values in each characteristic value pair and outputting the absolute value of the difference.

3. The method of claim 2, wherein the fully-connected layer comprises a plurality of difference solution units;

inputting the feature value pair into a preset full-connection layer to obtain the absolute value of the difference value between two feature values in the feature value pair output by the full-connection layer, wherein the method comprises the following steps:

inputting the characteristic value pair to the difference solving unit corresponding to the characteristic value pair to obtain the absolute value of the difference between two characteristic values in the characteristic value pair output by the difference solving unit;

4. The method of claim 3, wherein the difference solution unit comprises a first difference neuron, a second difference neuron, and a summing neuron;

the inputting the feature value pair into the difference solving unit corresponding to the feature value pair to obtain the absolute value of the difference between two feature values in the feature value pair output by the difference solving unit includes:

inputting the first output value and the second output value to a summing neuron in a difference solving unit corresponding to the characteristic value pair to obtain an output value output by the summing neuron as a difference absolute value of two characteristic values in the characteristic value pair;

the first difference neuron is configured to subtract one eigenvalue of the eigenvalue pair input to the difference solving unit from another eigenvalue of the eigenvalue pair to obtain a first difference; if the first difference is larger than 0, outputting the first difference, and if the first difference is not larger than 0, outputting 0;

5. The method according to claim 1, wherein the calculating the similarity between every two objects to be grouped according to the absolute difference value comprises:

6. The method according to claim 5, wherein selecting at least one pair of the objects to be grouped meeting a preset screening condition as an optimal object pair according to the similarity comprises:

and sequentially aiming at each similarity in the similarity sequence according to the sequence from front to back, and if each object to be grouped corresponding to the similarity does not belong to any optimizing object pair, taking two objects to be grouped corresponding to the similarity as a pair of optimizing object pairs until a preset termination condition is reached.

7. An optimized sampling apparatus comprising:

the similarity solving module is used for calculating the similarity between every two objects to be grouped according to the absolute value of the difference;

and the object pair screening module is used for selecting at least one pair of the objects to be grouped with the similarity larger than a preset similarity threshold as an optimizing object pair.

8. The apparatus according to claim 7, wherein the difference solving module performs weighted summation on all the object features for each two feature value pairs composed of feature values of the same feature dimension in the object features to obtain an absolute difference value between the two feature values in the feature value pairs, and includes:

inputting the feature value pairs into a preset full-connection layer aiming at feature value pairs consisting of feature values of the same feature dimension in every two object features to obtain a difference absolute value of two feature values in the feature value pairs output by the full-connection layer;

9. The apparatus of claim 8, wherein the fully-connected layer comprises a plurality of difference solution units;

the difference solving module inputs the feature value pairs to a preset full-connection layer to obtain the absolute value of the difference between two feature values in the feature value pairs output by the full-connection layer, and the difference solving module comprises:

10. The apparatus of claim 9, wherein the difference solution unit comprises a first difference neuron, a second difference neuron, and a summing neuron;

the difference solving module inputs the feature value pairs to the difference solving unit corresponding to the feature value pairs to obtain the absolute value of the difference between two feature values in the feature value pairs output by the difference solving unit, and the method comprises the following steps:

11. The apparatus according to claim 7, wherein the similarity solving module calculates the similarity between each two objects to be grouped according to the absolute difference value, and includes:

12. The apparatus of claim 11, wherein the object pair selection module selects at least one pair of the objects to be grouped that satisfy a preset selection condition as an optimal object pair according to the similarity, and includes:

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.