CN117807434A

CN117807434A - Communication data set processing method and device

Info

Publication number: CN117807434A
Application number: CN202311662402.3A
Authority: CN
Inventors: 王志勤; 江甲沫; 杜滢; 徐明枫; 李阳; 闫志宇; 沈霞; 刘晓峰; 魏贵明; 徐菲
Original assignee: China Academy of Information and Communications Technology CAICT
Current assignee: China Academy of Information and Communications Technology CAICT
Priority date: 2023-12-06
Filing date: 2023-12-06
Publication date: 2024-04-02

Abstract

The application discloses a communication data set processing method, wherein 2 data samples are arbitrarily taken in a communication data set and respectively used as a first reference sample and a second reference sample; determining the included angle between each data sample in the first data set and the reference sample in the vector space, and generating a first angle parameter set; determining the included angle between each data sample in the second data set and the reference sample in the vector space, and generating a second angle parameter set; generating a first position set and a second position set according to each angle parameter in the first angle parameter set and the second angle parameter set; and determining a data set corresponding to the position set with larger discrete degree in the first position set and the second position set, and entering a target data set for training the neural network model. The embodiment of the application also provides a device for realizing the method. The method and the device solve the problem that the communication data set for model training has quality defects.

Description

Communication data set processing method and device

Technical Field

The present disclosure relates to the field of mobile communications and computer technologies, and in particular, to a method and an apparatus for processing a communication data set.

Background

With the great improvement of the computing capacity of equipment, the neural network model with huge parameter quantity can complete training in a short time and obtain performance exceeding that of the traditional method, and the surge of research on related algorithms in the industry is initiated. The data set is used as the basis of algorithm research, plays an increasingly prominent role in promoting deep exploration of machine/deep learning algorithms, and the development of the two forms a complementary relationship. On the one hand, researchers develop neural network models and algorithms based on existing datasets to achieve high-precision performance; on the other hand, new models and algorithms may promote the evolution of data sets towards more complex, larger scale routes when the performance on the original data set is saturated. Taking the development of the data set in the image classification field as an example, the early image net data set only comprises 20 types of images, a simple neural network model can obtain excellent performance even in a traditional algorithm, and as the image net data set is expanded to 1000 types to reach millions of data, the original model can not support high-precision performance experience any more, so that researchers are driven to design a higher-level model to improve the precision of the algorithm. In summary, the comprehensive research of the data set and the algorithm is inexhaustible, and the data set and the algorithm are continuously more mature, so that the vigorous development of the artificial intelligence technology is promoted.

In order to facilitate the tight coupling of the mobile communication field and artificial intelligence technology, it is essential to construct high quality wireless communication data sets. On the basis of the communication data set, the migration application of the classical neural network architecture in the fields of images, texts, voice and the like can be realized, and the problems of high complexity, limited precision and the like in the traditional communication algorithm are solved; furthermore, the method can help researchers to explore and mine the characteristics of communication data, promote to build a novel neural network structure (for example, a convolutional neural network is born in the image field, a cyclic neural network is born in the text field and the like) which is more attached to the characteristics of the communication data, and promote cross fusion of multiple fields in other fields.

The neural network model is a fitting model based on massive parameters, and the training effect of the neural network model is highly dependent on the richness of samples in the data set. When the data set scale is smaller, the trained model is usually in an overfitting state, and the robust performance of the model is poor. In addition, the presence of low quality samples (e.g., false marks, blurriness, outliers, etc.) in the data set can also significantly reduce the test performance of the model. Therefore, a high quality data set is an important premise and guarantee for realizing a high-precision algorithm. In the field of relatively sophisticated artificial intelligence, such as image, text and speech, there have been some related studies on dataset testing. For example, resolution, peak signal-to-noise ratio and structural similarity are widely adopted in the image field, segmentation signal-to-noise ratio in the voice field, and the like, and parameters such as information entropy for describing diversity by counting sample categories through labels.

However, the above-described test method is not adapted to quality testing of a communication data set for the following reasons: on one hand, the communication data is not human direct observation data, and a certain piece of communication data cannot be defined as a good sample in a comparison mode; on the other hand, the communication data is more changeable and lacks manual labeling, so that it is difficult to distinguish whether two pieces of communication data which are not labeled belong to the same category. Thus, testing the quality of a communication data set presents a challenge, and there is currently no set of quality assessment theory and method that matches the communication data set.

Disclosure of Invention

The application provides a communication data set processing method and device, which solve the problem that a communication data set for model training has quality defects.

In a first aspect, an embodiment of the present application proposes a method for processing a communication data set, where the communication data set includes one or more of wireless air interface data, signaling data, or network data, and the communication data set includes a first data set and a second data set, and the method includes the following steps:

taking 2 data samples in the communication data set as a first reference sample and a second reference sample respectively to form a target plane in a vector space;

Determining included angles between each data sample in the first data set and 2 reference samples in a vector space respectively, and generating a first angle parameter set;

determining included angles between each data sample in the second data set and 2 reference samples in a vector space respectively, and generating a second angle parameter set;

projection of each angle parameter in the first angle parameter set and the second angle parameter set on the target plane respectively generates a first position set and a second position set;

and determining a data set corresponding to the position set with larger discrete degree in the first position set and the second position set, and entering a target data set for training the neural network model.

In one embodiment of the present application, the first reference sample belongs to a first data set and the second reference sample belongs to a second data set.

In one embodiment of the present application, determining the angle between any one of the data samples and the reference sample includes: in a vector space, calculating cosine similarity between any one data sample and the reference sample; and determining the included angle of any data sample and the reference sample in a vector space according to the cosine similarity.

In one embodiment of the application, polar coordinates are established in the target plane, and a first reference position and a second reference position are determined according to a vector included angle between the first reference sample and the second reference sample; the first position set is generated by superposing the projection of the first angle parameter set on the target plane according to the first reference position; the second set of positions is generated by superimposing a projection of the second set of angular parameters onto the target plane according to the second reference position.

In one embodiment of the present application, determining the degree of discretization of the first set of locations includes: determining a distribution area of a first position set as a first distribution area on a target plane, and dividing the first distribution area by using a dividing angle step length to obtain a first distribution area division set; calculating the probability of the first position set in different partition areas in the first distribution area partition set;

in one embodiment of the present application, determining the degree of discretization of the second set of locations includes: determining a distribution area of the second position set as a second distribution area on the target plane, and dividing the second distribution area by using a dividing angle step length to obtain a second distribution area division set; the probability of the second set of locations being within different partitioned areas in the second set of partitioned areas is calculated.

In one embodiment of the present application, the second data set is obtained by expanding a first data set, including:

training a target-generated artificial intelligence model using N data samples in the first dataset;

combining N data samples in the first data set with M-N virtual samples constructed using a target-generated artificial intelligence model to generate the second data set (M > N) comprising M data samples.

In one embodiment of the present application, a first entropy value and a second entropy value are calculated according to statistical properties of the first set of locations and the second set of locations, respectively; and determining a data set corresponding to the position set with a larger entropy value.

In one embodiment of the present application, JS divergences of the first and second position sets are calculated according to statistical properties of the first and second position sets, respectively;

in one embodiment of the present application, in response to the JS divergence being greater than a set threshold, either data set is included in the target data set, and data samples are added in a set proportion, the added data samples being derived from the other data set.

In one embodiment of the application, the first data set and the second data set are mixed in the target data set in response to the degree of discretization of both the first set of locations and the second set of locations being less than a set threshold.

In a second aspect, an embodiment of the present application further proposes a communication data set processing device, configured to implement a method according to any embodiment of the first aspect of the present application, including:

the acquisition module is used for acquiring a communication data set and first and second reference samples;

the generating module is used for generating the first angle parameter set and the second angle parameter set; and generating a first set of locations and a second set of locations;

and the determining module is used for determining a target data set, wherein the target data set comprises data sets corresponding to position sets with larger discrete degrees in the first position set and the second position set.

In a third aspect, embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a method according to any of the embodiments of the first aspect of the present application.

In a fourth aspect, embodiments of the present application further provide an electronic device, including a memory, a processor, and a computer program stored on the memory and executable by the processor, the processor implementing a method according to any one of the embodiments of the first aspect of the present application when the computer program is executed by the processor.

The above-mentioned at least one technical scheme that this application embodiment adopted can reach following beneficial effect:

the invention designs a processing method for a data set in a mobile communication network, which plays an important role in improving network performance by integrating artificial intelligence technology in the mobile communication network. The communication data set refers to a set formed by putting together a large number of samples of air interface data, signaling data or network data (e.g., channel state information data, scheduling information data, communication quality data, etc.) of a communication system. The communication data set is necessary to be used when the artificial intelligence technology is applied to a communication system, namely the artificial intelligence technology can acquire a corresponding usable artificial intelligence model through executing a training process by a large amount of data, the quality of the data set has an important influence on the training effect of the artificial intelligence model, and a testing method of the data set is also an important research content and a basic tool in the artificial intelligence field. At present, due to the characteristics of complex statistical characteristics, uncertain data set labels and the like of a mobile communication network data set, a relatively perfect test method for the mobile communication network data set is not available in the industry. The invention can be used for carrying out operations such as expanding, characteristic analysis, classification, labeling, statistical characteristic analysis and the like on the mobile communication network data set, can be used for distinguishing the usability of the mobile communication network data set, and can improve the characteristics of diversity and the like of the mobile communication network data set, thereby improving the performance gain brought by the application of the artificial intelligence technology to the mobile communication network.

On the one hand, if the validity of the communication data set can be tested before the artificial intelligence algorithm model training and testing is performed, the cost overhead caused by the artificial intelligence algorithm model training and testing is significantly reduced. On the other hand, quality analysis and comparison are carried out on different communication data sets, and an improvement scheme of the communication data sets is provided on the basis of the quality analysis and comparison, so that generalization and accuracy of the artificial intelligence algorithm model are remarkably improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a data space diagram of a communication data set;

FIG. 2 is a schematic plan view of a reference sample forming a target;

FIG. 3 is a schematic view of an angular projection of a data sample of a dataset onto a target plane;

FIG. 4 is a schematic diagram of first and second sets of data generated from first and second data sets;

FIG. 5 shows angle-split set and sample distribution statistics;

FIG. 6 is a flow chart of an embodiment of a communication dataset processing method of the present application;

FIG. 7 is a flow chart of another embodiment of a communication dataset processing method of the present application;

Fig. 8 is a schematic diagram of an embodiment of a communication data set processing apparatus.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Because a standard communication data set is not available for comparison in the research field of communication data sets, for a certain communication data set to be tested, the characteristics of the communication data set can be learned based on an artificial intelligence technology, then the original data of the communication data set is expanded to construct an expanded data set, and the original data set is tested by comparing the characteristics of the original data set and the expanded data set. In the scheme, aiming at an original data set to be tested, a generated artificial intelligent model is used for expanding the original data set to be tested, a virtual data set and an expanded data set are constructed, and then the original data set, the virtual data set and the expanded data set are analyzed by using evaluation indexes such as diversity, similarity and characteristic values of the data sets, and finally test results of the original data set are obtained.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a data space diagram of a communication data set. The communication data sets may be divided according to types of wireless air interface data, signaling data, network data, etc., and some common data sets include a channel data set, a modulation and coding data set, network traffic data, a network resource management data set, a channel/network time sequence prediction data set, etc. Taking a channel data set as an example, the data contains information such as fading, multipath propagation and the like experienced by a wireless signal in the physical space transmission process, and is embodied in the forms of signal intensity, phase and the like. As shown in fig. 1, an example of a channel strength data set is given in which the horizontal coordinates represent the number of symbols and the vertical coordinates represent the number of subcarriers.

In an embodiment of the present application, the communication data set comprises a first data set (i.e. data set Ω below) and a second data set (i.e. data set E below).

Fig. 2 is a schematic plan view of a reference sample constituting a target. Taking 2 data samples in the communication data set as a first reference sample and a second reference sample x respectively _R,1 And x _R,2 To form a target plane within the vector space. For example, the first reference sample belongs to a first data set and the second reference sample belongs to a second data set.

Establishing polar coordinates in the target plane according to the vector included angle alpha of the first reference sample and the second reference sample _1,2 A first reference position and a second reference position are determined. And the reference position can be used as a coordinate axis, a first reference coordinate axis and a second reference coordinate axis are obtained through the first reference sample and the second reference sample, and a target coordinate system is established by using the first reference coordinate axis and the second reference coordinate axis.

After the first reference sample and the second reference sample are normalized according to respective modulus values, the first reference position vector is x _R,1 /||x _R,1 The second reference position vector is x _R,2 /||x _R,2 I, the included angle alpha between the two _1,2 Forming a circumferential angle. That is, the first reference coordinate axis is calculated from the first reference sample, i.e., the first reference sample is divided by the first reference sampleMould length; and calculating the second reference coordinate axis according to the second reference sample, namely dividing the second reference sample by the module length of the second reference sample.

In order to obtain the included angle between the two, one method is to calculate the cosine similarity between the first reference sample and the second reference sample, wherein the cosine similarity is calculated by taking dot product operation by using the first reference sample and the second reference sample, dividing the dot product operation by the module length of the first reference sample, and dividing the module length of the second reference sample.

Further, an included angle between the first reference sample and the second reference sample is obtained according to cosine similarity between the first reference sample and the second reference sample.

FIG. 3 is a schematic view of an angular projection of a data sample of a dataset onto a target plane. Determining included angles between each data sample in the first data set and 2 reference samples in a vector space respectively, and generating a first angle parameter set; and determining the included angles of each data sample in the second data set and 2 reference samples in a vector space respectively, and generating a second angle parameter set. Projection of each angle parameter in the first angle parameter set and the second angle parameter set on the target plane respectively generates a first position set and a second position set; in the planar polar coordinate space shown in fig. 2-3, the first position set is generated by superimposing the projection of the first angle parameter set on the target plane according to the first reference position; the second set of positions is generated by superimposing a projection of the second set of angular parameters onto the target plane according to the second reference position. When the sampled data are normalized according to their vector length values, both the first set of locations and the second set of locations are located on the circumference as shown in fig. 2-3.

The angle parameter of the present application may be an angle value or a trigonometric function value of the angle value. For example, a specific process may be: in a vector space, calculating cosine similarity between each data sample in the first data set and the first reference sample one by one to obtain a 'one mapping one' parameter set; the cosine similarity is calculated by taking dot product operation by the data sample and the reference sample, dividing the dot product operation by the modular length of the data sample and dividing the modular length of the reference sample;

similarly, calculating cosine similarity between each data sample in the first data set and the second reference sample one by one to obtain a 'one-mapping-two' parameter set;

similarly, calculating cosine similarity between each data sample in the second data set and the first reference sample one by one to obtain a parameter set of two mapping one;

similarly, calculating cosine similarity between each data sample in the second data set and the second reference sample one by one to obtain the 'two-mapping-two' parameter set;

obtaining the position set through the parameter set, including: converting the parameter set of the first mapping one parameter set, the parameter set of the second mapping two parameter set, the parameter set of the second mapping one parameter set and the parameter set of the second mapping two parameter set into an angle set of the first mapping one angle set, an angle set of the second mapping two angle set, an angle set of the second mapping one angle set and an angle set of the second mapping two angle set; according to the angle set of one mapping, the angle set of two mapping, and the angle set of two mapping, all data samples in the first data set and the second data set are mapped into the target coordinate system, and the position set of one mapping, the position set of two mapping, and the position set of two mapping are correspondingly obtained. The position set of the first mapping one and the position set of the second mapping two form a first position set, and the position set of the second mapping one and the position set of the second mapping two form a second position set.

As shown in fig. 3, sample x in dataset Ω _Ω,i And sample x in dataset E _E,j The angle values are mapped on the circular arcs taking the reference sample as the axis. In sample x _Ω,1 For example, according to the reference sample x _R,1 And x _R,2 Included angle theta between _1,1 And theta _1,2 And two reference samplesIncluded angle alpha between the bodies _1,2 Relation of x _Ω,1 The relative position in the coordinate system can be divided into the following three cases:

when satisfy theta _1,1 +α _1,2 ＝θ _1,2 When x is _Ω,1 Is positioned in an arcOutside and near reference sample x _R,1 As shown in fig. 3 (a);

when satisfy theta _1,1 +θ _1,2 ＝α _1,2 When x is _Ω,1 Is positioned in an arcAs shown in fig. 3 (b);

when satisfy theta _1,2 +α _1,2 ＝θ _1,1 When x is _Ω,1 Is positioned in an arcOutside and near reference sample x _R,2 As shown in fig. 3 (c).

FIG. 4 is a schematic diagram of first and second sets of locations generated by first and second data sets. Determining a distribution area of the first position set as a first distribution area R on the target plane _Ω . Determining the distribution area of the second position set as a second distribution area R on the target plane _E 。

Specifically, the one mapping angle set and the one mapping two angle set are respectively obtained according to the one mapping one parameter set, namely the one mapping one angle cosine value set and the one mapping two parameter set, namely the one mapping two angle cosine value set. And mapping the data samples in the first data set into the target coordinate system by using the mapping angle set and the mapping two-angle set to obtain a corresponding mapping position set and a mapping position set, which are collectively called as a first position set. And similarly, mapping the data samples in the second data set into the target coordinate system to obtain the corresponding two-mapping one-position set and two-mapping two-position set, which are collectively called as a second position set.

Using the set of locations, obtaining statistical properties of a second data set of the first data set, comprising:

obtaining a first distribution area and a second distribution area of all data samples in the first data set and the second data set on the target coordinate plane according to the first position set and the second position set;

dividing the first distribution area and the second distribution area into dividing areas by using the dividing angle step length to obtain a first distribution area dividing set and a second distribution area dividing set;

dividing all data samples in the first data set and the second data set into a first distribution area segmentation set and a second distribution area segmentation set by using the first mapping first position set, the second mapping second position set, the second mapping first position set and the second mapping second position set;

and calculating the distribution of all the data samples in the first data set and the second data set in different partition areas in the first distribution area partition set and the second distribution area partition set to obtain the statistical characteristics of the first data set and the statistical characteristics of the second data set.

Fig. 5 shows the angle-split set and sample distribution statistics. Dividing the first distribution area by using the dividing angle step length to obtain a first distribution area dividing set; the probability of the first set of locations being within different partitioned areas in the first set of distributed area partitions is calculated. Dividing the second distribution area by using the dividing angle step length to obtain a second distribution area dividing set; the probability of the second set of locations being within different partitioned areas in the second set of partitioned areas is calculated.

As shown in fig. 5, the first distribution area and the second distribution area are divided into divided areas by using a dividing angle step, and the obtained first distribution area division set includesThe second distributed area division set comprises +.>The segmented regions.

Since the samples in the communication dataset and the elements in the location set are corresponding, with respect to probability computation, it may also be: and calculating the number of data samples in the first data set falling into each divided area in the first distribution area divided set, and dividing the number by the total number of data samples in the first data set to obtain the probability that the data samples in the first data set fall into each divided area in the first distribution area divided set. And similarly, obtaining the probability that the data sample in the second data set falls into each partition area in the second distribution area partition set.

Further, diversity and similarity of the data sets may be analyzed based on the probability of distribution of the data samples in the set of segmentation angles.

Regarding diversity analysis, for example, a first statistical characteristic of the first data set, e.g., information entropy of the first data set, is derived from the probability that the data sample in the first data set is located in each of the first set of distribution region partitions and the probability that the data sample in the second data set is located in each of the second set of distribution region partitions; and a second statistical property, e.g., information entropy of the second data set; and testing the first data set and the second data set by the first statistical characteristic and the second statistical characteristic.

Testing the diversity of the first data set and the second data set according to the first statistical characteristic and the second statistical characteristic, comprising:

when the first statistical characteristic is greater than the second statistical characteristic, the diversity of the first data set is better than the diversity of the second data set;

when the first statistical characteristic is less than the second statistical characteristic, the diversity of the first data set is less than the diversity of the second data set;

the diversity of the first data set and the diversity of the second data set are the same when the first statistical characteristic is equal to the second statistical characteristic.

Regarding similarity analysis, performing union operation on the first distribution area and the second distribution area to obtain a third distribution area, and simultaneously obtaining a third distribution area segmentation set according to the first distribution area segmentation set and the second distribution area segmentation set, wherein overlapping parts of the first distribution area and the second distribution area do not carry out repeated statistics;

and calculating the number of the data samples in each of the third distribution area division sets in the first data set, so as to obtain the probability that the data samples in the first data set are in each of the third distribution area division sets, namely dividing the number of the data samples in the first data set in each of the third distribution area division sets by the number of the data samples in the first data set.

And similarly, obtaining the probability that the data sample in the second data set is positioned in each partition area in the third distribution area partition set.

And testing the similarity of the first data set and the second data set by using the probability that the data sample in the first data set is positioned in each divided area in the third distribution area divided set and the probability that the data sample in the second data set is positioned in each divided area in the third distribution area divided set.

And obtaining a third statistical characteristic between the first data set and the second data set according to the probability that the data sample in the first data set is positioned in each divided area in the third distribution area divided set and the probability that the data sample in the second data set is positioned in each divided area in the third distribution area divided set, wherein the third statistical characteristic is JS divergence between the first data set and the second data set.

And testing the similarity of the first data set and the second data set according to the third statistical characteristic, wherein the value range of the third statistical characteristic is [0,1], and the closer to 0 is the higher the similarity of the first data set and the second data set, and the closer to 1 is the lower the similarity of the first data set and the second data set.

FIG. 6 is a flow chart of an embodiment of a communication data set process of the present application.

Step 61: the samples in the communication data set to be tested Φ are data processed using a predefined method.

For example, data cleansing is performed on samples in the communication data set Φ to be tested, and invalid data samples are screened out.

For another example, format specification operation (such as adjusting data dimension, clipping, normalizing, etc.) is performed on the data samples in the to-be-tested communication data set Φ after the invalid data samples are screened, so as to obtain an original data set Ω after data processing.

Step 62: determining a first data set and a second data set to be tested, and distinguishing two cases:

case 1: a comparison test was performed on 2 known data sets (data set Ω and data set E).

Case 2 (see example step 72 for details): testing 1 known data set (data set omega), training a generative artificial intelligent model by using N data samples of the data set omega, performing generative expansion on the data set omega by using the trained generative artificial intelligent model after model training is completed, and generating M-N new samples to form a data set E together with the original N samples in the data set omega.

Step 63: and analyzing the statistical properties of the data set omega and the data set E to be tested.

Step 631: and calculating the angular similarity among the samples in the data set, taking cosine similarity as an example. Arbitrarily selecting two data samples x from the data set omega and the data set E _R,1 And x _R,2 As a reference sample, and calculate sample x in dataset Ω _Ω,i I=1,..n and reference sample x _R,k Angular similarity cos θ between k=1, 2 _i,k Sample x in dataset E was calculated with i=1,.. _E,j J=1,..m and reference sample x _R,k Angular similarity cos β between k=1, 2 _j,k ,j＝1,...,M,k＝1,2。

Step 632: a new coordinate system is established. Calculating x _R,1 And x _R,2 Inter-angulation similarity cos alpha _1,2 And converted into an angle value alpha _1,2 Build up of reference sample x _R,1 /||x _R,1 I and x _R,2 /||x _R,2 And I is a new coordinate system of two-dimensional plane coordinate axes.

Step 633: the positions of all samples in dataset Ω and dataset E in the new coordinate system are determined. For ease of description, assume that all data samples lie on the target plane, the angular similarity cos θ calculated in step 631 is calculated _i,k And cos beta _j,k Converted into an angle value theta _i,k And beta _j,k And sample x in dataset Ω _Ω,i And sample x in dataset E _E,j According to the angle value theta _i,k And beta _j,k Corresponding to the reference sample x _R,1 And x _R,2 The first position set and the second position set are respectively formed on the circular arc of the shaft. After mapping, the samples in the data set omega and the data set E are respectively concentrated and distributed in different sector areas R _Ω And R is _E Is a kind of medium.

Step 634: the statistical properties of dataset Ω and dataset E are calculated. First at an angleDividing the sample distribution region of the data set Ω and the data set E into L sub-regions for the segmentation step, wherein the sample distribution region R of the data set Ω _Ω Comprises L _Ω A sub-region. Next, the number of samples in each sub-region is counted as +.>And calculate the probability of occurrence of the sample in each sub-region +.>Analogically statistical sample distribution region R of data set E _E Comprising L _E Sample number in individual subregion +.>Re-calculating to obtain sample occurrence probability +.>According toAnd->The statistical properties of dataset Ω and dataset E are calculated.

Step 64: optionally, a diversity test is performed on dataset Ω and dataset E. Using the statistical properties of data set Ω and data set E obtained in step 63, entropy values H (Ω) and H (E) are calculated:

the test dataset Ω and dataset E are analyzed for diversity by comparing the entropy values H (Ω) and H (E) of dataset Ω and dataset E based on predefined evaluation rules.

Entropy values represent the degree of dispersion of data. And determining a data set corresponding to the position set with larger discrete degree in the first position set and the second position set, and entering a target data set for training the neural network model.

Step 65: optionally, similarity tests are performed on dataset Ω and dataset E. Using the statistical properties of dataset Ω and dataset E obtained in step 63, a JS divergence JS (P _Ω |P _E ) The value range of JS divergence is [0,1 ]]The closer to 0 means the higher the similarity of the two data sets, calculated as follows:

the diversity of the test dataset Ω and dataset E is analyzed by comparing JS divergence for dataset Ω and dataset E based on predefined evaluation rules.

FIG. 7 is a flow chart of another embodiment of the communication data set process of the present application.

Step 71: the samples in the communication data set to be tested Φ are data processed using a predefined method.

Step 711: and (3) cleaning the data of the samples in the communication data set phi to be tested, and screening out invalid data samples.

Step 712: and performing format specification operation (such as data dimension adjustment, amplitude limiting, normalization and the like) on the data samples in the communication data set phi to be tested after the invalid data samples are screened out, so as to obtain an original data set omega after data processing.

Step 72: the original dataset Ω is expanded using a generative artificial intelligence model.

Step 721: the raw dataset Ω is used to train the generative artificial intelligence model.

Step 722: M-N virtual samples are generated using the trained generative artificial intelligence model, where M > N.

Step 723: and constructing a virtual data set psi with the sample number of M-N by using the generated M-N virtual samples, and constructing an extended data set E with the sample number of M by using the original data set omega with the sample number of N and the virtual data set psi with the sample number of M-N together.

Step 73: the original dataset Ω and the extended dataset E are subjected to a diversity analysis using predefined evaluation indicators.

Step 731: and quantitatively measuring the similarity among the samples in the data set. Selecting two data samples x from the original data set Ω and the extended data set E _R,1 And x _R,2 As a reference sample, and calculate sample x in dataset Ω _Ω,i I=1,.. _E,j J=1,..m and reference sample x _R,k K=1, 2 similarity. Cosine similarity is taken as an example, and the calculation mode is adoptedThe following is shown:

step 732: a new coordinate system is established. Calculating x _R,1 And x _R,2 Cosine similarity cos alpha between _1,2 And converted into an angle value alpha _1,2 Build up of reference sample x _R,1 /||x _R,1 I and x _R,2 /||x _R,2 The i is the reference frame of the two-dimensional plane coordinate axis, as shown in fig. 2.

Step 733: the positions of all samples in dataset Ω and dataset E in the new coordinate system are determined. Cosine similarity cos θ calculated in step 731 _i,k And cos beta _j,k Converted into an angle value theta _i,k And beta _j,k And sample x in dataset Ω _Ω,i And sample x in dataset E _E,j The angle values are mapped to arcs around the reference sample as axes, as shown in fig. 3.

The mapped effects are shown in FIG. 4, in which the samples in dataset Ω and dataset E are each centrally distributed in different sector areas R _Ω And R is _E Is a kind of medium.

Step 734: the diversity of the dataset Ω is evaluated with the entropy value H (Ω) as an index. As shown in fig. 5, first at an angleDividing the sample distribution region of the data set Ω and the data set E into L sub-regions for the segmentation step, wherein the sample distribution region R of the data set Ω _Ω Comprises L _Ω A sub-region. Then, the number of samples in each sub-region is counted as n _Ω,l ,l＝1,...,L _Ω And calculate that the sample appears in the first sub-region R _Ω,l The probability of p _Ω,l ＝n _Ω,l N. The entropy value is used as an index for measuring diversity, and the larger the entropy value isThe better the diversity of the data set, the following way of calculation is:

step 735: the diversity of the data set E was evaluated using the entropy value H (E) as an index. The process is the same as step 734, in which the sample distribution region R of the data set E is counted _E Comprising L _E Number of samples n in each sub-region _E,l ,l＝1,...,L _E Then calculate the probability p of sample occurrence in each sub-region _E,l ＝n _E,l And (n+m), and finally calculating the entropy value of the data set E according to the following formula:

it should be noted that, step 735 and step 734 are parallel alternative solutions.

Step 736: based on a predefined evaluation rule, the diversity between the original data set omega and the extended data set E is analyzed, and then the diversity difference between the original data set omega and the extended data set E is judged. Comparing the entropy values H (omega) and H (E) of the data set omega and the data set E calculated in the steps 734-735, and when H (omega) is more than H (E), the diversity of the data set omega is better; conversely, the diversity of the data set E is better.

Step 74: the original dataset Ω and the extended dataset E are subjected to a similarity analysis using predefined evaluation indicators.

Step 741: mapping all samples in dataset Ω and dataset E into the new coordinate system, and the specific process is the same as in steps 731-733.

Step 742: r is calculated _Ω ∪R _E The probability of sample distribution in the subareas of (a) is p _Ω,l And p _E,l L=1,.. the specific process of L is the same as steps 734-735.

Step 743: with JS scattering JS (P) _Ω |P _E ) As an index, the similarity of the data set Ω and the data set E was evaluated. The value range of JS divergence is [0,1 ]]The closer to 0 means the higher the similarity of the two data sets, calculated as follows:

Wherein P is _Ω And P _E Representing the distribution laws of dataset Ω and dataset E, respectively.

Step 744: based on predefined evaluation rules, the similarity between the original dataset Ω and the expanded dataset E is analyzed.

The embodiment of the application also provides a communication data set processing device, which is used for implementing the method of any embodiment of the first aspect of the application, and comprises the following steps:

an acquisition module 81 is configured to acquire the communication data set and the first and second reference samples. The functions implemented by the acquisition module may specifically include the functions described in steps 61 to 62 and steps 71 to 72 of the embodiment.

A generating module 82, configured to generate the first angle parameter set and the second angle parameter set; and is further configured to generate a first set of locations and a second set of locations. The functions implemented by the generating module may specifically include the functions described in the embodiment steps 631 to 633 and 731 to 733.

A determining module 83, configured to determine a target data set, where the target data set includes a data set corresponding to a location set with a greater degree of dispersion in the first location set and the second location set. The functions implemented by the determining module may specifically also include the functions described in the embodiment steps 634, 64, 734 to 736.

In one embodiment, the determining module is further configured to determine a target data set, where the target data set includes a data set that has a high similarity between the first location set and the second location set, and optimally corresponds to a location set where the JS divergence of the first location set and/or the second location set is less than a set threshold. The functions of the determining module may in particular also comprise the functions as described in steps 65, 74.

It should be further noted that, based on the embodiments shown in fig. 1 to 8 of the present application, a target data set is generated, and in one embodiment of the present application, a first entropy value and a second entropy value are calculated according to statistical characteristics of the first location set and the second location set, respectively; and determining a data set corresponding to the position set with a larger entropy value. Alternatively, in one embodiment of the present application, the first data set and the second data set are blended in the target data set in response to the degree of discretization of both the first set of locations and the second set of locations being less than a set threshold. In the preferred embodiment of the present application, the degree of dispersion, i.e., diversity, is expressed in terms of entropy.

Generating a target data set, in another embodiment of the application, calculating JS divergences of the first and second location sets, respectively, based on statistical properties of the first and second location sets; any data set for which the JS divergence is less than the set threshold is determined, or, in one embodiment of the present application, in response to the JS divergence being greater than the set threshold, any data set is included in the target data set and data samples are added in a set proportion, the added data samples being derived from another data set. In the preferred embodiment of the present application, JS divergence is used to represent similarity. Any data set herein is either the first data set or the second data set.

Specifically, for example: when meeting the conditions H (omega) not less than H (E) and H (omega) not less than H _set And JS (P) _Ω |P _E )≤JS _set When H is _set And JS (JavaScript) _set The threshold values of the entropy value and the JS divergence are preset respectively, the fact that the data set omega is similar to the data set E is explained at the moment, the diversity of the data set omega is higher than that of the data set E, meanwhile, the threshold requirement can be met, and the data set omega is used as a target data set for model training;

as another example, when the condition H (E) is equal to or greater than H (Ω), H (E) is equal to or greater than H _set And JS (P) _Ω |P _E )≤JS _set When the data set E is used as a target data set, model training is carried out;

for another example, when meetingThe conditions H (omega) are more than or equal to H (E) and H (omega) is more than or equal to H _set And JS (P) _Ω |P _E )＞JS _set When the diversity of the data set omega is higher than that of the data set E, the threshold requirement can be met, but the similarity of the data set omega and the target data set is lower, at the moment, the target data set is mainly the data set omega, data are extracted from the data set E according to the proportion gamma and added into the target data set, and then model training is carried out;

as another example, when the condition H (E) is equal to or greater than H (Ω), H (E) is equal to or greater than H _set And JS (P) _Ω |P _E )＞JS _set When the model training process is performed, the target data set is mainly a data set E, and data is extracted from the data set omega according to a proportion gamma and added into the target data set process of the model training process;

as another example, when the condition H (E). Ltoreq.H is satisfied _set And H (Ω) is less than or equal to H _set When both are not meeting the requirement, the index of the mixture of the two data sets, i.e. H (Ω. U.E), is calculated again. When H (Ω ≡E) < H _set Model training is performed without using two data sets; when H (Ω ≡E) is greater than or equal to H _set And when the two data sets are mixed, performing model training on the target data set generated by mixing the two data sets.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Accordingly, the present application also proposes a computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a method as described in any of the embodiments of the present application.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Further, the present application also proposes an electronic device (or computing device) comprising a memory, a processor and a computer program stored on the memory and executable by the processor, said processor implementing a method according to any of the embodiments of the present application when said computer program is executed.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media. Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A method of processing a communication data set, the communication data set comprising one or more of wireless air interface data, signaling data, or network data, the communication data set comprising a first data set and a second data set, comprising:

2. The communication data set processing method of claim 1, wherein the first reference sample belongs to a first data set and the second reference sample belongs to a second data set.

3. The method of communication dataset processing as claimed in claim 1, wherein determining an angle between any of the data samples and the reference sample comprises: in a vector space, calculating cosine similarity between any one data sample and the reference sample; and determining the included angle of any data sample and the reference sample in a vector space according to the cosine similarity.

4. The communication data set processing method of claim 1, wherein polar coordinates are established in the target plane, and the first reference position and the second reference position are determined according to a vector included angle of the first reference sample and the second reference sample; the first position set is generated by superposing the projection of the first angle parameter set on the target plane according to the first reference position; the second set of positions is generated by superimposing a projection of the second set of angular parameters onto the target plane according to the second reference position.

5. The communication data set processing method of claim 1, wherein,

determining a degree of discretization of the first set of locations, comprising: determining a distribution area of a first position set as a first distribution area on a target plane, and dividing the first distribution area by using a dividing angle step length to obtain a first distribution area division set; calculating the probability of the first position set in different partition areas in the first distribution area partition set;

determining a degree of discretization of the second set of locations, comprising: determining a distribution area of the second position set as a second distribution area on the target plane, and dividing the second distribution area by using a dividing angle step length to obtain a second distribution area division set; the probability of the second set of locations being within different partitioned areas in the second set of partitioned areas is calculated.

6. The communication data set processing method according to claim 1, wherein the second data set is obtained by a first data set expansion, comprising:

7. The communication data set processing method of claim 1, wherein,

respectively calculating a first entropy value and a second entropy value according to the statistical characteristics of the first position set and the second position set;

and determining a data set corresponding to the position set with a larger entropy value.

8. The communication data set processing method of claim 1, wherein,

according to the statistical characteristics of the first position set and the second position set, JS divergences of the first position set and the second position set are calculated respectively;

in response to the JS divergence being greater than a set threshold, any data set is included in the target data set, and data samples are added in a set proportion, the added data samples being derived from another data set.

9. The communication data set processing method of claim 1, wherein,

in response to the degree of discretization of both the first set of locations and the second set of locations being less than a set threshold, first data sets and second data sets are blended in the target data set.

10. A communication data set processing device for implementing the method of any one of claims 1 to 9, comprising

11. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-9.

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-9 when executing the computer program.