CN110750897A

CN110750897A - DDS automatic discovery method based on threshold bloom filter

Info

Publication number: CN110750897A
Application number: CN201910986341.3A
Authority: CN
Inventors: 樊智勇; 腾达; 刘哲旭
Original assignee: Civil Aviation University of China
Current assignee: Civil Aviation University of China
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2020-02-04
Anticipated expiration: 2039-10-17
Also published as: CN110750897B

Abstract

The invention discloses a DDS automatic discovery method based on a threshold bloom filter, which comprises the following steps of designing the threshold bloom filter, combining the threshold bloom filter with a DDS automatic discovery mechanism and determining an optimal threshold, wherein the steps are sequentially carried out, ① the method stores the endpoint description information in a DDS discovery stage through the threshold bloom filter, can reduce the memory consumption and the network data transmission and improve the real-time performance of DDS in distributed simulation, ② the method provides a threshold optimization mode to realize smaller false alarm rate and larger accuracy, and particularly when the number of DDS endpoints is more and the number of TBFs is smaller, the effect is more obvious, ③ the method provides a new idea for improving the DDS automatic discovery process.

Description

DDS automatic discovery method based on threshold bloom filter

Technical Field

The present invention relates to a distributed simulation technology, and in particular, to a DDS (data distribution service) automatic discovery method based on a threshold bloom filter.

Background

Distributed simulation is to decompose a huge simulation calculation task into a plurality of small tasks which are shared by a plurality of computers, and is widely applied to the fields of military affairs, traffic, power systems, medical treatment and the like. The DDS is a data-centered publish/subscribe communication model specification established by an object management organization, and due to the high efficiency and the real-time performance of the DDS, more and more distributed simulation systems adopt the DDS to transmit data. The existing DDS standard automatic discovery method is based on a simple discovery protocol and achieves good effects in small and medium-sized distributed simulation systems. However, when the simulation system is increased, a large amount of data needs to be exchanged frequently and in real time, and the existing DDS standard automatic discovery method generates high memory consumption and network data transmission, so that the DDS standard automatic discovery method is not suitable for a large-scale distributed simulation system.

Disclosure of Invention

In order to solve the above problems, an object of the present invention is to provide a DDS automatic discovery method capable of achieving low memory consumption, low network transmission capacity, and low false alarm rate based on a threshold bloom filter in a large distributed simulation system.

In order to achieve the purpose, the invention adopts the technical scheme that: a DDS automatic discovery method based on a threshold bloom filter is characterized by comprising the following steps in sequence: (1) designing a threshold bloom filter; (2) the combination of a threshold bloom filter and a DDS automatic discovery mechanism; (3) and determining an optimal threshold value.

The design steps of the threshold bloom filter are as follows: the threshold bloom filter uses a one-bit vector of m bits to store information, the initial value of the vector is 0; when storing an end point information element, the end point information element is mapped to a threshold bloom filter vector through k different hash functions, and the vector is marked as TBF (1); the distribution range of the result of each hash function is [1, m ]](ii) a According to the k mapping results, the values of the corresponding k positions in the vector TBF (1) are changed from 0 to 1; when a plurality of endpoint information elements are stored, mapping results are superposed; thus, when the threshold bloom filter stores n different endpoint information elements x_iBy each endpoint information element x_iThe sum of the mapping results yields a vector TBF (1), i.e.

(3)

For a set S (x)_1～x_n) Each endpoint information element x_iMapping to vector TBF (1) through k different hash functions, when inquiring an end point information element x_iWhen the terminal point information element belongs to the set S, the terminal point information element x is judged by setting different binarization threshold values theta and judgment threshold values T_iWhether the binary image belongs to the set S or not, wherein the binary threshold value theta satisfies 0-k, and the threshold value T satisfies 0-k; first, if the value of each position in the vector TBF (1) is less than or equal to θ, then this position is set to 0; at this time, the vector TBF (1) becomes a new vector TBF (2); then, the judgment is carried out through a judgment threshold value T, namely when an endpoint information element x_iIs greater than or equal to the decision threshold T, then it is decided that this end point information element x is_iBelonging to the set S.

The invention relates to a threshold bloom filter and DDS automatic discovery mechanism, which comprises the following steps:

let data be sent from node a to node B, with both nodes A, B defined as local participants and remote participants, respectively; in the participant discovery phase, the description information of local participant endpoints, namely data writers and data readers, is stored in a vector TBF (2) and is sent to other remote participants together through local participant data packets, and the description information of the endpoints is a unique key word of each local participant and is usually a subject name; in the endpoint discovery phase, when the endpoint of a remote participant subscribes to one or more topics, the remote participant first queries whether the subscribed topics exist in the vector TBF (2); if so, the remote participant sends the presence topic subscription information to the local participant; the local participant sends the theme data packet related to the remote participant to be matched with the service quality, and if the matching is successful, the local participant establishes communication with the remote participant.

The determination process of the optimal threshold value comprises the following steps:

assume an endpoint information element x_iThe results mapped by the hash function are different, namely different end point information elements are mapped to different positions, and the end point information element x is mapped to different positions_iViewed as a single realized hypergeometric distribution, wherein the capacity of the distributionW is a value where the value of w is equal to the number of bits m of the vector TBF (1), the number of bits to set 1 is r, the value of r is equal to the number of hash functions k, so at the end point information element x_iIn the mapping of (2), the probability p of locating the vector TBF (1) at a specific position 1₁The method comprises the following steps:

(4)

the value of the ith position of the vector TBF (1) marked as I is regarded as a discrete random variable, and the mapping result of the ith position of the n end point information elements in the vector TBF (1) obeys binomial distribution B (n, p)₁) (ii) a Mapping f times at the ith position in the vector TBF (1), wherein the value of the ith position is v, and the value of the ith position v is equal to the value of the mapping times f; therefore, the probability P (I ═ v) of the value I ═ v at the ith position in the vector TBF (1) is:

(5)

then, the expected value l (v) of the number of positions with the ith position value v in the vector TBF (1) is obtained:

(6)

by setting a binary threshold theta, the vector TBF (2) is obtained from the vector TBF (1), and the probability P that a position value in the vector TBF (2) is 0 can be known according to the formula (3)₀Comprises the following steps:

(7)

in the vector TBF (2), a probability P of a position value of 1₁Comprises the following steps:

(8)

endpoint information element x in set S_iThe dot product value of the mapping result in the vector TBF (2) and the vector TBF (2) is d_xDot product expected value

The result is obtained by the formula (4); when the value of theta is equal to 0,

that is, the dot product value for any one endpoint information element in the set S is k; thus, when v ≦ θ, the expected value of the dot product for all positions is

(9)

Similarly, the end point information element y not belonging to the set S maps the result in TBF (2) and the dot product value of the vector TBF (2) is d_yDot product expected valueThe calculation of the number of non-zero positions in the vector TBF (2) is as follows:

(10)

thus, the dot product value d can be calculated_xSum dot product value d_yProbability feature of (2), dot product value d_xSum dot product value d_yAll are characterized by discrete random variables that follow a binomial distribution: d_x～B(k，p_x)，d_y～B(k，p_y) Calculating the probability that the product corresponding to a certain position is 1 in the dot product process, namely obtaining the probability P from the formula (7) and the formula (8)_xAnd probability P_y：

(11)

(12)

Two parameters are introduced to describe an optimization target, namely a recall ratio TPR and a false alarm ratio FPR, wherein the recall ratio TPR represents the probability of success of theme name query, namely the correct judgment of an endpoint information element x_iBelongs to the set S, and the recall ratio TPR is within the range of 0-1; on the contrary, the false alarm rate FPR represents the probability of failure of the topic name query, i.e. the wrong judgment end point information element y belongs to the set S, the range of the false alarm rate FPR is 0 ≤ FPR ≤ 1, and the recall rate TPR can be calculated from the dot product value d in consideration of the judgment threshold T_xThe probability mass function is obtained by the following specific formula:

(13)

similarly, the false alarm rate FPR can be determined by the dot product value d_yThe probability mass function is obtained by the following specific formula:

(14)

and finally, taking the transmission precision as an optimization objective function ACC:

(15)

the constraint conditions of the formula (13) are respectively formula (2), formula (3), formula (9), formula (10), formula (11) and formula (12), and the maximum value of the formula (13) is a nonlinear problem and is solved through a genetic algorithm.

Compared with the prior art, the method has the following advantages:

① the method stores the end point description information of DDS discovery stage through threshold bloom filter, which can reduce memory consumption and network data transmission, and improve the real-time of DDS in distributed simulation.

② the method provides a threshold optimization mode, realizes smaller false alarm rate and larger accuracy, and has more obvious effect especially when the DDS end point number is more and the TBF is smaller.

③ the method provides a new idea for improving the DDS automatic discovery process.

Drawings

Fig. 1 is a flowchart of a DDS auto-discovery method based on a threshold bloom filter according to the present invention.

FIG. 2 is an example analysis diagram of a configuration of a threshold bloom filter;

FIG. 3 is a diagram of a process for implementing the threshold bloom filter auto-discovery method;

FIG. 4 is a flow diagram of a local participant establishing communication with a remote participant;

FIG. 5 is a schematic diagram of an example of threshold selection for a DDS auto-discovery method based on a threshold bloom filter;

fig. 6 is a distribution diagram of false alarm rates TPR with different values of the binarization threshold θ and the determination threshold T;

fig. 7 is a distribution diagram of recall rate FPR when the values of the binarization threshold θ and the determination threshold T are different;

fig. 8 is a distribution diagram of transmission accuracy ACC when the values of the binarization threshold θ and the determination threshold T are different;

FIG. 9 is a graph of a recall TPR analysis of TBFAD and SDPBloom;

FIG. 10 is a graph of the FPR analysis of the false alarm rate of TBFAD and SDPBloom;

FIG. 11 is a graph of transmission accuracy ACC analysis of TBFAD and SDPBloom;

FIG. 12 is a graph of the performance recall TPR analysis of TBFAD and SDPBloom;

FIG. 13 is a graph of FPR analysis of the false alarm rate of TBFAD and SDPBloom performance;

FIG. 14 is a graph of ACC analysis for TBFAD and SDPBloom performance transmission accuracy;

FIG. 15 is a graph of the performance recall TPR analysis of TBFAD and SDPBloom;

FIG. 16 is a graph of the FPR analysis of the false alarm rate of TBFAD and SDPBloom performance;

FIG. 17 is a graph of ACC analysis for TBFAD and SDPBloom performance transmission accuracy.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the DDS auto-discovery method based on threshold bloom filter provided by the present invention includes the following steps performed in sequence:

(1) design of threshold bloom filters

The threshold bloom filter stores n different endpoint information elements x_iBy each endpoint information element x_iThe sum of the mapping results may result in a vector TBF (1), i.e.

(16)

As shown in fig. 2, an example of the configuration of the threshold bloom filter: i.e. the number m of bits of the vector TBF is 20 and the number k of hash functions is 3. For set S (x)₁～x₄)，(x₁～x₄) Each end point information element in (1) is mapped to 3 different positions by 3 different hash functions.

When querying an endpoint information element x_iWhen the terminal point information element belongs to the set S, the terminal point information element x is judged by setting different binarization threshold values theta and judgment threshold values T_iWhether the binary image belongs to the set S or not, wherein the binary threshold value theta satisfies 0-k, and the threshold value T satisfies 0-k; first, if the value of each position in the vector TBF (1) is less than or equal to θ, then this position is set to 0; at this time, the vector TBF (1) becomes a new vector TBF (2); then, judging, namely judging when an endpoint information element x is present through a judgment threshold value T_iIs greater than or equal to the decision threshold T, then it is decided that this end point information element x is_iBelonging to the set S.

(2) Combination of threshold bloom filters with DDS auto discovery mechanism

As shown in fig. 3, a simple example of communication between two nodes is used to illustrate the threshold bloom filter auto-discovery method implementation. Data is sent from node a to node B, with two nodes A, B defined as a local participant and a remote participant, respectively.

As shown in fig. 4, in the participant discovery phase, the description information of the local participant endpoints, i.e. data writers and data readers, which are keys unique to each local participant, usually subject names, is stored in the vector TBF (2) and sent to other remote participants together via local participant data packets; in the endpoint discovery phase, when the endpoint of a remote participant subscribes to one or more topics, the remote participant first queries whether the subscribed topics exist in the vector TBF (2); if so, the remote participant sends the presence topic subscription information to the local participant; the local participant sends the theme data packet related to the remote participant to be matched with the service quality, and if the matching is successful, the local participant establishes communication with the remote participant.

(17) Determination of optimal threshold

As shown in fig. 5, fig. 5 illustrates that the selection of the binarization threshold θ and the decision threshold T is a key part of the threshold bloom filter DDS automatic discovery method, and a specific automatic discovery process is described as follows:

on the basis of obtaining the vector TBF (1) as shown in FIG. 2, whether an endpoint information element y belongs to the set S (x) is inquired through a threshold value distribution filter DDS automatic discovery method₁～x₄). In this example, vector y is used₁Indicating the result of the mapping of the endpoint information element y in the TBF.

Fig. 5(b) and 5(c) show that the binarization threshold of equation (5) is 0 and 1, respectively. When θ is 0, the dot product value d of the end point information element y and the vector TBF (2) according to equation (8) is 3. Since k is 3, the decision threshold T range is [0, k ], then d ≧ T is always true, so no matter what the decision threshold T takes, it is concluded that the end point information element y always belongs to the set S. In practice, however, the endpoint information element y is not a member of the set S. Therefore, the query process is erroneous, which results in failure of the auto-discovery process.

When the binarization threshold θ is 1, the dot product value d of the end point information element y and the vector TBF (2) is 1. If the decision threshold range is T ≦ 1, then d ≧ T. In this case, the end point information element y is still judged to belong to the set S, resulting in an erroneous judgment. If the threshold value range T is judged to be more than or equal to 2, a correct judgment is generated, namely the end point information element y does not belong to the set S, and finally the automatic discovery process is successful.

The above discussion shows that the selection of the threshold is very important to the auto-discovery process. In the present invention, the optimum threshold value is obtained by the following method. Transmission Accuracy (ACC) as an optimization objective function:

(18)

the constraint conditions are respectively formula (2), formula (3), formula (9), formula (10), formula (11) and formula (12). And iterating the maximum value of the transmission precision ACC through a genetic algorithm, and solving a binarization threshold value theta and a judgment threshold value T when the transmission precision ACC is the maximum value.

In order to verify the effectiveness of the DDS automatic discovery method based on the threshold bloom filter, the invention carries out experiments on the DDS automatic discovery method, and the process is as follows:

to further verify the performance of the DDS auto discovery method based on threshold bloom filter provided by the present invention, 4-group comparison experiments were performed with the currently approved improved auto discovery method (SDPBloom).

In the experiment, the recall ratio TPR of SDPBloom is always 1, and the false alarm ratio FPR is_SBComprises the following steps:

(19)

the transmission accuracy ACC of SDPBloom is also given by equation (13).

When the number m of threshold bloom filter bits, the number n of storage endpoint information and the number k of hash functions are constant values, the method and the SDPBloom are compared and analyzed as follows:

in the experiment, a Windows 10 system is selected as an operating system of an experiment platform, and the main parameters of hardware are as follows: the CPU model is Intel CPU Core i5-3210, the CPU dominant frequency is 2.5GHz, and the memory is 12.0 GB. Assuming that m is 500, n is 1000, and k is 30, the optimum binarization threshold θ is 5 and the determination threshold T is 20 are obtained by the threshold bloom filter DDS auto discovery method (TBFAD). As opposed to SDPBloom table 1.

TABLE 1 comparison of SDPBloom and TBFAD Process Performance

By the TBFAD method, the false alarm rate FPR is reduced from 0.93 to 0.24, and the transmission accuracy ACC is increased from 0.54 to 0.81, which indicates that the data transmission error rate is greatly reduced. Although the recall ratio TPR is slightly reduced, the overall process communication performance is significantly improved.

As shown in fig. 6, the values of the binarization threshold θ and the decision threshold T are different, and the distribution of the recall ratio TPR verifies the effectiveness of the method on the optimal threshold.

As shown in fig. 7, the values of the binarization threshold θ and the determination threshold T are different, and the distribution of the false alarm rate FPR verifies the effectiveness of the method on the optimal threshold.

As shown in fig. 8, the values of the binarization threshold θ and the determination threshold T are different, and the distribution of the transmission precision ACC verifies the effectiveness of the method on the optimal threshold.

In addition, the same experiment is carried out on different threshold bloom filter bit numbers m, the number n of the storage end point information and the number k of the hash function, and similar conclusions are obtained.

When the number of storage endpoint information n changes, TBFAD versus SDPBloom is analyzed as follows:

as shown in fig. 9, 10, and 11, on the basis of the above experiment, assuming that the number m of threshold bloom filters and the number k of hash functions are not changed, and the number n of storage end point information is changed from 10 to 150, the recall rate TPR, the false alarm rate FPR, and the transmission accuracy ACC of TBFAD and SDPBloom are analyzed, and the following conclusions can be obtained:

(1) the recall TPR for both methods is a maximum of 1 before the number n of stored end point messages increases to 30. Thereafter, the false alarm rate TPR of TBFAD decreases, but the recall rate TPR of TBFAD decreases by within 0.2 before the number n of storage end point messages increases to 150.

(2) The false alarm rate FPR for both methods is a minimum of 0 before the number n of stored endpoint information increases to 30. After that, the false alarm rates FPR of both increase, but the variation of TBFAD (0.28 at 150 n) is significantly smaller than SDPBloom (1 at 150 n).

(3) The transmission accuracy ACC is calculated by the recall ratio TPR and the false alarm ratio FPR. For SDPBloom, the number n of stored endpoint information starts to drop rapidly when it is greater than 30, and drops to 0.5 when n is 150. But for TBFAD, it starts to fall slowly when n is greater than 70, and when n is 150, the transmission accuracy ACC only falls to 0.77.

The number of stored endpoint information n is determined by the number of local participant endpoints, and in large distributed simulation systems the value of the number of stored endpoint information n will typically be large. According to the conclusion, when the number of the local participant endpoints is large, compared with the SDPBloom, the TBFAD method has obvious advantages and guarantees the correctness of data transmission in large-scale distributed simulation.

When the threshold bloom filter number m is changed, TBFAD is analyzed in comparison to SDPBloom as follows:

as shown in fig. 12, 13, and 14, assuming that the number n of storage end point information and the number k of hash functions are not changed, and the number m of threshold bloom filters is changed from 100 to 1000, the performance of the recall rate TPR, the false alarm rate FPR, and the transmission precision ACC of the TBFAD and the SDPBloom is analyzed, and the following conclusions can be obtained:

(1) when m is more than or equal to 100 and less than or equal to 1000, the recall ratio TPR of the TBFAD is slightly lower than SDPBloom all the time, but the recall ratio TPR of the TBFAD is increased along with the increase of the number m of the threshold bloom filter. When m is 1000, the recall ratio TPR of TBFAD can reach 0.9.

(2) The false alarm rate FPR of both methods is reduced along with the increase of the bit number m of the threshold bloom filter, but the false alarm rate FPR of the TBFAD is far smaller than that of the SDPBloom.

(3) The transmission accuracy ACC of both methods increases with the increase of the threshold bloom filter number m, but the transmission accuracy ACC of TBFAD is always higher than SDPBloom. When the number m of threshold bloom filter bits is small, the transmission accuracy ACC of both methods is low, but TBFAD is significantly better than SDPBloom.

In the participant discovery phase of DDS, smaller length vectors may allow for reduced bandwidth usage. In the experiment, when the bit number of the bloom filter bit number m is less, the TBFAD method provided by the invention achieves a good effect. Thus, in a large-scale distributed simulation, it can be guaranteed that there is a small data transfer between the local participant and the remote participant.

When the hash function number k changes, TBFAD versus SDPBloom is analyzed as follows:

as shown in fig. 15, 16, and 17, assuming that the number m of threshold bloom filters and the number n of storage endpoint information are not changed, and the number k of hash functions is changed from 5 to 50, the recall rate TPR, the false alarm rate FPR, and the transmission accuracy ACC of TBFAD and SDPBloom are analyzed, so that the following conclusions can be obtained:

(1) when the number k of the hash functions is less than or equal to 10, the recall ratios TPR of TBFAD and SDPBloom are both 1. Thereafter, the recall TPR of TBFAD starts to decrease, but the range of variation is always within 0.25 until the number k of hash functions increases to 50.

(2) When the number k of the hash functions is less than or equal to 10, the false alarm rate FPR of the TBFAD and the SDPBloom have the same trend. As the number k of hash functions increases, the false alarm rate FPR of TBFAD remains within a small range of [0.1, 0.25], while the false alarm rate FPR of SDPBloom rises rapidly.

(3) For SDPBloom, the transmission accuracy ACC starts to decrease rapidly as the number k of hash functions increases, and when the number k of hash functions is 50, the transmission accuracy ACC is 0.5. However, the transmission accuracy ACC of TBFAD decreases slowly, and only decreases to 0.8 when the hash function number k is 50.

The number of hash functions determines the complexity of the computation when constructing the vector TBF. The result shows that in the distributed simulation system, when the number k of the hash functions is selected to be smaller, the calculation amount can be reduced, and the TBFADF is superior to or equal to the performance of the SDPBloom.

Claims

1. A DDS automatic discovery method based on a threshold bloom filter is characterized by comprising the following steps in sequence: 1. designing a threshold bloom filter; 2. the combination of a threshold bloom filter and a DDS automatic discovery mechanism; 3. determining an optimal threshold value;

the design steps of the threshold bloom filter are as follows: the threshold bloom filter uses a one-bit vector of m bits to store information, the initial value of the vector is 0; when storing a participant endpoint information element, the endpoint information element is mapped into a threshold bloom filter vector through k different hash functions, and the vector is marked as TBF (1); the distribution range of the result of each hash function is [1, m ]](ii) a According to the k mapping results, the values of the corresponding k positions in the vector TBF (1) are changed from 0 to 1; when a plurality of endpoint information elements are stored, mapping results are superposed; thus, when the threshold bloom filter stores n different endpoint information elements x_iBy each endpoint information element x_iThe sum of the mapping results yields a vector TBF (1), i.e.

For a set S (x)₁～x_n) Each endpoint information element x_iMapping to vector TBF (1) through k different hash functions, when inquiring an end point information element x_iWhen the terminal point information element belongs to the set S, the terminal point information element x is judged by setting different binarization threshold values theta and judgment threshold values T_iWhether the binary image belongs to the set S or not, wherein the binary threshold value theta satisfies 0-k, and the threshold value T satisfies 0-k; first, if the value of each position in the vector TBF (1) is less than or equal to θ, then this position is set to 0; at this time, the vector TBF (1) becomes a new vector TBF (2); then, the judgment is carried out through a judgment threshold value T, namely when an endpoint information element x_iIs greater than or equal to the decision threshold T, then it is decided that this end point information element x is_iBelonging to the set S.

2. The DDS auto-discovery method based on threshold bloom filter as claimed in claim 1, wherein the combining step of the threshold bloom filter and DDS auto-discovery mechanism is:

3. The DDS auto-discovery method based on threshold bloom filter as claimed in claim 1, wherein the determining process of the optimal threshold is as follows:

assume an endpoint information element x_iThe results mapped by the hash function are different, namely different end point information elements are mapped to different positions, and the end point information element x is mapped to different positions_iIs treated as a single-realized hypergeometric distribution, where the capacity of the distribution is w, i.e. the value of w is equal to the value of the number of bits m of the vector TBF (1), the number of bits to put 1 is r, i.e. the value of r is equal to the value of the number k of hash functions, so that at the end point information element x_iIn the mapping of (2), the probability p of locating the vector TBF (1) at a specific position 1₁The method comprises the following steps:

the value of the ith position of the vector TBF (1) marked as I is regarded as a discrete random variable, and the mapping result of the ith position of the n end point information elements in the vector TBF (1) obeys binomial distribution B (n, p)₁) (ii) a Mapping f times at the ith position in the vector TBF (1), wherein the value of the ith position is v, and the value of the ith position v is equal to the value of the mapping times f; therefore, the first and second electrodes are formed on the substrate,the probability P (I ═ v) of the value I ═ v at the ith position in the vector TBF (1) is:

Similarly, the end point information element y not belonging to the set S maps the result in TBF (2) and the dot product value of the vector TBF (2) is d_yDot product expected value

The calculation of the number of non-zero positions in the vector TBF (2) is as follows: