CN117614845A

CN117614845A - Communication information processing method and device based on big data analysis

Info

Publication number: CN117614845A
Application number: CN202311513146.1A
Authority: CN
Inventors: 康波峰; 黄明金; 周雯; 熊刚
Original assignee: Weichuang Software Wuhan Co ltd
Current assignee: Weichuang Software Wuhan Co ltd
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2024-02-27
Anticipated expiration: 2043-11-13
Also published as: CN117614845B

Abstract

The invention discloses a communication information processing method and a device based on big data analysis, which relate to the technical field of communication information processing, and the method comprises the following steps: collecting initial communication data of a user, and storing the initial communication data into a communication database according to a data type; preprocessing the acquired initial communication data to obtain standard communication data; designing a feature combination of standard communication data, and extracting features of the standard communication data according to the feature combination to obtain a first feature set; screening the first feature set according to a screening rule to obtain a second feature set; acquiring a communication task list, and analyzing and processing the second feature set according to the communication tasks in the communication task list to obtain analysis results corresponding to the communication tasks; and visualizing the analysis result by using a visualization tool. The invention utilizes the advantage of big data analysis to carry out personalized analysis on the communication data, and provides personalized, intelligent and accurate service for users.

Description

Communication information processing method and device based on big data analysis

Technical Field

The invention belongs to the technical field of communication information processing, and particularly relates to a communication information processing method and device based on big data analysis.

Background

With the development and popularization of communication technology, a great deal of data is generated by the communication behavior of people, including short messages, telephone call records, social media messages and the like. The communication data contains rich information, and can be used in various fields such as personal communication habit analysis, social relationship mining, business marketing and the like. However, conventional communication data processing methods often face many challenges and limitations, including large data volume, various data types, and uneven data quality.

At present, conventional database management systems and data mining technologies are often used for processing communication data. These methods perform well in processing structured data, but have limitations in processing unstructured communication data. For example, conventional database management systems may not be able to efficiently process large-scale text data, while data mining techniques may face inefficiencies in feature extraction and task analysis. In addition, the traditional communication data processing method often lacks deep understanding and mining of the communication behaviors of the user, and is difficult to provide personalized and accurate service for the user.

The invention patent with the Chinese application number of 202210491734.9 discloses a communication information automatic analysis system and equipment based on big data, wherein the field of special interest of a user is summarized according to the history search record of the corresponding user, and then a search information analysis unit searches by using a search engine according to the previously summarized field of interest of the user and a keyword input by the user. In the prior art, rough recommendation is performed according to local files, online browsing records, searching interests and the like of users, deep mining and analysis of communication information are not performed, user feedback is usually required to be obtained according to irregular pushing, follow-up recommendation optimization is performed, and the method is not intelligent enough, and the experience of the users is poor.

Disclosure of Invention

In view of the above, the invention provides a communication information processing method and device based on big data analysis, which performs personalized analysis on communication data by utilizing the advantages of big data analysis, provides personalized, intelligent and precise services for users, improves the efficiency and quality of communication data processing, realizes deeper and comprehensive analysis on the communication data, and improves the utilization value of the communication data.

The technical purpose of the invention is realized as follows:

in one aspect, the invention provides a communication information processing method based on big data analysis, which comprises the following steps:

s1, initial communication data of a user are collected and stored in a communication database according to data types;

s2, preprocessing the acquired initial communication data to obtain standard communication data, and storing the standard communication data into a communication database;

s3, designing a feature combination of standard communication data, and carrying out feature extraction on the standard communication data according to the feature combination to obtain a first feature set;

s4, screening the first feature set according to a screening rule to obtain a second feature set;

s5, acquiring a communication task list, and analyzing and processing the second feature set according to the communication tasks in the communication task list to obtain analysis results corresponding to the communication tasks;

And S6, visualizing the analysis result by using a visualization tool, and displaying the result to the user according to the user requirement.

Based on the above technical solution, preferably, step S2 includes:

s21, coding the acquired initial communication data to obtain a unique identifier of the initial communication data;

s22, judging whether the initial communication data is repeated or not according to the unique identifier, and if the unique identifier is repeated, merging the corresponding initial communication data to obtain first communication data;

s23 setting a first threshold delta ₁ Second threshold delta ₂ And a third threshold delta ₃ Calculating the number N of missing characters in the first communication data ₁ And is in contact with a first threshold delta ₁ Second threshold delta ₂ And a third threshold delta ₃ And (3) judging:

if the number delta of the missing characters of the first communication data ₁ <N ₁ ≤δ ₂ Classifying the first communication data into first data to be repaired;

if the number delta of the missing characters of the first communication data ₂ <N ₁ ≤δ ₃ Classifying the first communication data into second data to be repaired;

if the number delta of the missing characters of the first communication data ₃ <N ₁ Classifying the first communication data into third data to be repaired;

s24, respectively processing the first data to be repaired, the second data to be repaired and the third data to be repaired to obtain second communication data;

S25, carrying out anomaly identification on the second communication data by adopting an anomaly detection method to obtain anomaly data, and repairing the anomaly data to obtain third communication data;

s26, carrying out format conversion and normalization on the third communication data to obtain standard communication data.

Based on the above technical solution, preferably, step S24 includes:

deleting the first data to be repaired from the first communication data;

acquiring a time stamp of the second data to be repaired, taking the time stamp as a target time stamp, and respectively searching Y non-missing communication data forwards in the first communication data by taking the target time stamp as an origin pointBackward searching Y non-missing communication data +.>Calculate->And->Filling the second data to be repaired by using the weighted average value to obtain second repair data, wherein +_>Weight of (2) is less than +.>Weight of (2);

predicting the missing value of the third data to be repaired by adopting a pre-trained random forest model, and filling the missing value in the third data to be repaired according to the prediction result of the random forest model to obtain third repair data;

and updating the first communication data by using the second repair data and the third repair data to obtain second communication data.

On the basis of the above technical solution, preferably, step S25 includes:

s251, traversing the second communication data, calculating a first distance between each second communication data and other second communication data, forming a distance sorting table according to the arrangement of the first distances from small to large, selecting the first m communication data as a neighbor set of the current second communication data, and taking the second communication data and the corresponding neighbor set as a relationship set;

s252 traversing the relation set, calculating a second distance d between each second communication data and its neighbor according to the first distance ₂ And updating the second distance to the relation set, wherein the calculation formula of the second distance is as follows:

in the method, in the process of the invention,the ith neighbor and neighbor in the neighbor set of the current second communication dataA second distance of the second communication data, < >>For the first distance between the ith neighbor in the neighbor set of the current second communication data and the second communication data,/the first distance is equal to the second distance>The method comprises the steps of setting a first distance between an ith neighbor in a neighbor set of current second communication data and an mth second communication data in a distance sorting table of the current second communication data;

s253, traversing the relation set, calculating the distance density of each second communication data according to the second distance, and updating the distance density into the relation set, wherein the calculation formula of the distance density is as follows:

Wherein ρ is _d For the distance density of the current second communication data,m is the sum of second distances of the current second communication data, and m is the number of neighbors in the neighbor set of the current second communication data;

s254, setting a density threshold, calculating the local density of each second communication data according to the distance density, and taking the second communication data with the local density larger than the density threshold as abnormal data, wherein the calculation formula of the local density is as follows:

in the method, in the process of the invention,sigma ρ is the local density of the current second communication data _d A represents the second communication data, which is the sum of the distance densities of all the second communication dataIs the number of (3);

s255, acquiring the time stamp of the abnormal data, setting a time interval, sequentially and forwards acquiring n time-interval normal data in the second communication data by taking the time stamp of the abnormal data as an origin, and repairing the abnormal data by utilizing the n time-interval normal data to obtain third communication data.

Based on the above technical solution, preferably, in step S255, the formula of anomaly repair is as follows:

in the method, in the process of the invention,representing repaired abnormal data, x ^k (t) normal data representing the kth time interval, lambda _k A weighting coefficient for normal data of a kth time interval;

Wherein lambda is _k The calculation formula of (2) is as follows:

in the method, in the process of the invention,is the initial assigned weight of normal data for the kth time interval, +.>Is the time attenuation term, r is the attenuation factor, t _k Time value t of normal data for the kth time interval _x For the time value corresponding to the abnormal data g _k～x Is the correlation coefficient of normal data and corresponding abnormal data of the kth time interval.

Based on the above technical solution, preferably, step S3 includes:

s31, determining B feature types according to the characteristics of standard communication data, and carrying out feature recombination according to the determined B feature types to obtain C feature combinations, wherein the feature recombination mode comprises feature operation, feature intersection and feature transformation;

s32, according to the designed C feature combinations and the determined B feature types, carrying out corresponding feature extraction on each standard communication data, namely extracting B+C features from each standard communication data, taking the B+C features as first features, and forming a first feature set by the first features of all the standard communication data.

Based on the above technical solution, preferably, step S4 includes:

s41, forming a matrix of B+C dimensions by the first features as a first feature matrix;

S42, performing redundancy removal on the first feature matrix by using a selection function to obtain a second feature, wherein the selection function is as follows:

F＝Sigmoid(conv(fc(AP(X))))

in the formula, F is a feature screening function, sigmoid represents an activation function, conv represents convolution processing, fc represents full-connection layer processing, AP represents adaptive pooling processing, and X is a first feature matrix;

s43, forming all second features into a second feature set.

Based on the above technical solution, preferably, step S5 includes:

s51, acquiring a communication task list, wherein the communication task comprises a classification task, a clustering task and a recommendation task;

s52, when the communication task is a classification task, determining a classification target, selecting a feature corresponding to the required data type from the second feature set according to the classification target, and performing classification prediction on the classification feature by adopting a plurality of SVMs to obtain a classification result;

s53, when the communication task is a clustering task, determining a clustering target, determining the number of clusters according to the clustering target, selecting the characteristics corresponding to the required data types from the second characteristic set as clustering characteristics, performing k-means++ clustering analysis on the clustering characteristics according to the number of clusters to obtain the clustering labels of all the clusters, and taking the clusters and the clustering labels thereof as clustering results;

And S54, when the communication task is a recommendation task, confirming a recommendation target, selecting the feature corresponding to the required data type from the second feature set according to the recommendation target, and recommending the recommendation feature by adopting a recommendation algorithm to obtain a recommendation result, wherein the feature is used as the recommendation feature.

Based on the above technical solution, preferably, in step S54, the recommendation algorithm includes:

firstly, setting a recommendation characteristic as a node of a recommendation algorithm, putting the recommendation characteristic into an opening list, and evaluating the node in the opening list according to a correlation index of the recommendation characteristic to obtain an evaluation score, wherein the correlation index of the recommendation characteristic is obtained by mining the correlation between the recommendation characteristic and an adjacent recommendation characteristic according to a correlation rule;

step two, sorting the evaluation scores from large to small, selecting a node corresponding to the first evaluation score as a starting node of a recommendation algorithm, and placing the starting node into a closing list;

step three, calculating the weighted values of the evaluation scores of each node in the open list and all nodes in the closed list, selecting the node in the open list corresponding to the first weighted value according to the sequence from big to small, putting the node in the closed list, and updating the weighted value of the open list;

And step four, repeating the step three until the opening list is empty.

In another aspect, the present invention also provides a communication information processing apparatus based on big data analysis, the apparatus being configured to perform the method of any one of the above, the apparatus comprising:

the data acquisition module is internally provided with a communication database and is used for acquiring initial communication data of a user and storing the initial communication data into the communication database according to data types;

the data processing module is used for preprocessing initial communication data to obtain standard communication data, and storing the standard communication data into a communication database, wherein the preprocessing comprises repeated data deletion, missing data identification, missing value filling, abnormal data identification and repair;

the feature storage module is used for designing a feature combination of the standard communication data, extracting features of the standard communication data according to the feature combination to obtain a first feature set, screening the first feature set according to a screening rule to obtain a second feature set, and storing the first feature set and the second feature set;

the data analysis module is used for acquiring a communication task list, analyzing and processing the second feature set according to the communication tasks in the communication task list to obtain analysis results corresponding to the communication tasks, and storing the analysis results;

The visual display module is used for visualizing the analysis result and displaying the visualized analysis result and corresponding communication data to the user according to the user demand.

Compared with the prior art, the method has the following beneficial effects:

(1) According to the method, the communication data is deeply preprocessed, the usability of the communication data is improved, so that the communication data with low quality originally can be used after being processed, the waste of resources is avoided, customized analysis is performed based on task driving, corresponding analysis results are obtained according to different communication tasks, the pertinence and the practicability of the analysis are improved, visual display is performed, and a user is helped to better use the communication data;

(2) According to the invention, the standard communication data is finally obtained by encoding, de-duplication, missing value processing, anomaly detection and repair of the initial communication data, and the standard communication data is stored in the communication database, so that the quality and consistency of the data are improved, the integrity and accuracy of the communication data are improved, and the reliability of subsequent analysis and application is ensured;

(3) According to the invention, through feature recombination and extraction, feature types can be determined according to the characteristics of standard communication data, and features are recombined to obtain a new feature combination, which is helpful for extracting more representative and effective features, so that the features and modes of the communication data are better represented, and then redundancy removal is performed on the first feature matrix by using a selection function to obtain a second feature, which is helpful for reducing the dimension of the feature matrix, removing redundant information, improving the compactness and effectiveness of the features, and facilitating subsequent data processing and analysis;

(4) According to the invention, classification prediction, cluster analysis and recommendation effects are carried out according to different communication task types, so that a plurality of different data processing and analysis modes are provided for users, the application field of communication data is expanded, and the application value and practicability of the data are improved;

(5) The recommendation algorithm provided by the invention is beneficial to realizing personalized recommendation, selecting high-value nodes, dynamically updating the recommendation process and optimizing the recommendation result, improves the recommendation efficiency and accuracy, and enhances the satisfaction degree of users on the recommendation result, thereby improving the application value and user experience of data;

(6) The processing device provided by the invention utilizes the advantages of big data analysis to perform personalized analysis on the communication data, and provides personalized and accurate service for users, so that the device is expected to improve the efficiency of communication data processing, improve the processing capacity of unstructured data, and provide more intelligent and personalized communication service for users.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

fig. 2 is a block diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will clearly and fully describe the technical aspects of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.

As shown in fig. 1, first, the present invention provides a communication information processing method based on big data analysis, which includes the following steps:

Specifically, in an embodiment of the present invention, step S1 includes:

collecting initial communication data of a user refers to collecting various communication activities of the user on communication equipment, including telephone call records, short messages, mails, social media messages and the like. These data types include text, audio, video, etc. forms, which need to be stored according to their particular data type.

The communication database is a database specially used for storing user communication data, and needs a reasonable data structure to store different types of communication data. For example, the phone call record may include information such as call time, call duration, opposite party number, etc.; the short message can comprise information such as sending time, receiving time, content and the like; mail may include information about sender, recipient, subject, text, etc.

When initial communication data of a user is collected, the integrity and the accuracy of the data are required to be ensured, and meanwhile, the privacy information of the user is required to be protected, so that relevant laws and regulations and privacy policies are complied with. When storing communication data, the backup and recovery of the data, and the safety and reliability of the data are also required to be considered.

Specifically, in an embodiment of the present invention, step S2 includes:

s21, the acquired initial communication data are encoded, and a unique identifier of the initial communication data is obtained.

Specifically, the unique identifier is a combination of the device ID that collected the initial communication data and the time stamp at the time of collection.

S22, judging whether the initial communication data is repeated or not according to the unique identifier, and if the unique identifier is repeated, merging the corresponding initial communication data to obtain the first communication data.

Whether duplicate data exists is determined by comparing the combination of the device ID and the time stamp. If the unique identifier is repeated, the fact that repeated communication data exists is indicated, and corresponding data are needed to be combined to obtain first communication data.

If the number delta of the missing characters of the first communication data ₃ <N ₁ And classifying the first communication data into third data to be repaired.

Specifically, δ ₁ <δ ₂ <δ ₃ These threeThe thresholds are set according to the specific situation, and are usually determined according to the number of the whole characters of the communication data, for example, a first threshold delta ₁ I.e. the ratio is 10%, the second threshold delta ₂ I.e. 30% of the total, third threshold delta ₃ I.e. 50%.

S24, the first data to be repaired, the second data to be repaired and the third data to be repaired are respectively processed to obtain second communication data.

Specifically, step S24 includes:

deleting the first data to be repaired from the first communication data;

Specifically, the first data to be repaired, i.e., the missing data, has a smaller proportion and has a smaller influence on the subsequent analysis, and the record containing the missing data can be deleted directly.

The proportion of the second data to be repaired, namely the missing data, is centered, and has a certain influence on the subsequent analysis, at this time, the filling value is calculated according to the adjacent non-missing communication data, the filling value can be a weighted average value, and the communication data is usually data with a time stamp, because the communication data records the occurrence time of a communication event. The timestamp may be a combination of date and time used to mark the point in time of the occurrence of the communication event. The time stamp of the communication data may be used to analyze patterns of communication behavior, timing, association with other variables, and the like. And for communication data, timeliness is important, so that non-missing data with a time stamp after the target time stamp should be given a larger weight.

For the case of more missing data or irregular pattern of missing data, i.e. the third data to be repaired, a pre-trained random forest model is used to predict the missing value, which may be implemented as follows:

Data preparation: first, a data set containing communication data needs to be prepared. It is ensured that both the characteristic value and the target value in the dataset are numerical. The data set is divided into a training set and a test set.

Feature selection and engineering: and selecting proper characteristics to perform model training according to the characteristics and field knowledge of the communication data. Feature engineering, such as feature scaling, feature combining, feature dimension reduction, etc., can be performed to improve the performance of the model.

Model training: training the training set using a random forest model. Random forests are an integrated learning algorithm that consists of multiple decision trees. In the training process, the random forest can randomly select characteristics and samples for training so as to improve the generalization capability of the model.

Model evaluation: and evaluating the trained random forest model by using the test set. The predictive performance of the model may be evaluated using evaluation metrics such as mean square error, accuracy, etc. If the model performs poorly, the parameters of the model are adjusted.

Missing value prediction: and predicting the missing value in the communication data by using the trained random forest model. For each missing value, using information of other relevant fields as input, an estimate of the missing value is predicted by a random forest model.

S25, carrying out anomaly identification on the second communication data by adopting an anomaly detection method to obtain anomaly data, and repairing the anomaly data to obtain third communication data.

Specifically, step S25 includes:

s251, traversing the second communication data, calculating a first distance between each second communication data and other second communication data, forming a distance sorting table according to the arrangement of the first distances from small to large, selecting the first m communication data as a neighbor set of the current second communication data, and taking the second communication data and the corresponding neighbor set as a relationship set.

Specifically, the first distance may be Jaccard similarity, which is a commonly used distance measurement method, and is suitable for aggregate data. For aggregate features in the communication data, jaccard similarity may be used to calculate the distance between data points.

in the method, in the process of the invention,for the ith neighbor in the neighbor set of the current second communication data and the second communication dataSecond distance,/, of- >For the first distance between the ith neighbor in the neighbor set of the current second communication data and the second communication data,/the first distance is equal to the second distance>The method comprises the steps of setting a first distance between an ith neighbor in a neighbor set of current second communication data and an mth second communication data in a distance sorting table of the current second communication data;

in the method, in the process of the invention,sigma ρ is the local density of the current second communication data _d A represents the number of second communication data, which is the sum of the distance densities of all the second communication data.

In particular, the method comprises the steps of,the larger the second communication data is, the more far from other data is indicated, if +. >The second communication data is considered to be abnormal data.

S255, acquiring the time stamp of the abnormal data, setting a time interval, sequentially and forwards acquiring n time-interval normal data in the second communication data by taking the time stamp of the abnormal data as an origin, and repairing the abnormal data by utilizing the n time-interval normal data to obtain third communication data. The formula for anomaly repair is as follows:

wherein lambda is _k The calculation formula of (2) is as follows:

And formatting the data according to the requirements to ensure the consistency and normalization of the data and obtain standard communication data. Such standard communication data can be more easily stored, analyzed and applied.

Specifically, in an embodiment of the present invention, step S3 includes:

s31, determining B feature types according to the characteristics of standard communication data, and carrying out feature recombination according to the determined B feature types to obtain C feature combinations, wherein the feature recombination mode comprises feature operation, feature intersection and feature transformation.

Taking b=4 and c=3 as an example, step S3 will be described:

determining the feature type: in determining the feature type, a proper feature type needs to be selected according to the characteristics and requirements of the communication data. In this embodiment, the feature types include:

time domain features: the extraction method of the features extracted based on the time sequence information of the communication data comprises the following steps: average value: and carrying out averaging operation on the time sequence of the communication data. Variance: and carrying out variance operation on the time sequence of the communication data. Maximum value: and finding out the maximum value in the communication data time sequence. Minimum value: and finding out the minimum value in the communication data time sequence.

Frequency domain characteristics: the extraction method of the features extracted based on the frequency spectrum information of the communication data comprises the following steps: spectral energy: the communication data is converted into a frequency domain through Fourier transformation, and the frequency spectrum energy is calculated. Spectral mean: the mean value of the spectrum is calculated. Spectral peak: the peak frequencies and corresponding energy values in the spectrum are found.

Statistical characteristics: the extraction method of the features extracted based on the statistical distribution information of the communication data comprises the following steps: the statistical characteristics of the mean, variance, skewness, kurtosis and the like can be directly calculated through the statistical distribution information of the communication data.

Filtering characteristics: based on the features extracted from the filtering result of the communication data, the features such as the mean value, the variance and the like after filtering can be obtained by performing filtering operation on the communication data and then calculating corresponding statistical features.

Selecting a proper feature type from the four determined feature types to carry out feature recombination, wherein the method specifically comprises the following steps:

and (3) characteristic operation: dividing the communication frequency characteristic by the communication duration characteristic to obtain an average communication frequency characteristic.

Feature crossover: and crossing the communication frequency characteristic and the communication duration characteristic to obtain the communication total quantity characteristic.

Feature transformation: and carrying out logarithmic transformation on the communication frequency characteristics to obtain logarithmic communication frequency characteristics.

The 7 extracted features can be combined into a feature vector, the 7 features are used as first features, and the first features of all standard communication data form a first feature set.

Specifically, in an embodiment of the present invention, step S4 includes:

s41, forming a matrix of the dimension B+C by the first features as a first feature matrix.

First, all the extracted first features are combined into a matrix according to a specific rule, and if 4 kinds of preliminary features and 3 kinds of recombined features are provided, the features are arranged in sequence to form a 7-dimensional matrix.

F＝Sigmoid(conv(fc(AP(X))))

in the formula, F is a feature screening function, sigmoid represents an activation function, conv represents convolution processing, fc represents full-connection layer processing, AP represents adaptive pooling processing, and X is a first feature matrix.

In this step, the first feature matrix is processed using a selection function to remove redundant information and extract more representative features. The function of the selection function is to process the input feature matrix through operations such as convolution, full connection layer and self-adaptive pooling, and finally to output the screened features through the Sigmoid activation function.

S43, forming all second features into a second feature set.

Specifically, in an embodiment of the present invention, step S5 includes:

In this embodiment, the communication tasks include a classification task, a clustering task and a recommendation task, and the task is used as a driver to perform analysis and processing, and the specific implementation process is as follows:

(1) The communication task is a classification task

Determining a classification target:

firstly, determining a classification target of a communication task, for example, classifying communication types according to characteristics of communication data, such as speaker identification in voice communication, emotion classification in text communication, and the like.

Selecting classification characteristics:

and selecting the characteristic corresponding to the required data type from the second characteristic set as the classification characteristic according to the determined classification target. These features should be those that distinguish the classification target, and may be selected by a fuzzy selection algorithm that may be combined with domain knowledge to avoid missing features of potential importance to the classification task, the specific fuzzy selection algorithm including: 1) The importance of each feature is calculated using the concepts of fuzzy sets and membership functions in fuzzy set theory. Fuzzy sets may help describe the degree of membership of a feature to a classified object and thus measure its importance. 2) For each feature, its degree of membership to the classification target, i.e. the degree of influence of that feature on the classification target, is determined. 3) A threshold for feature selection is set, which may be determined empirically or by knowledge in the field, for screening features with membership higher than the threshold. 4) And selecting the characteristic with high membership as a final classification characteristic according to the characteristic with membership higher than a set threshold.

Training a plurality of SVM classifiers:

the prepared data set is divided into a training set and a testing set, and the training and the evaluation are performed by adopting a cross-validation mode.

For a plurality of SVM classifiers, a one-to-one or one-to-many mode is adopted for multi-class classification.

For each SVM classifier, the training set is used to train, and parameters are adjusted to obtain the best performance.

And classifying and predicting the data in the test set by using a plurality of trained SVM classifiers.

And performing performance evaluation on the classification result, and performing evaluation by using the F1 value.

And according to the performance evaluation result, selecting classification characteristics, adjusting and optimizing parameters of the SVM classifier and the like so as to improve classification performance.

Classification prediction:

and taking the selected classification features as input data, and inputting the input data into a pre-trained SVM classifier to obtain a classification result.

Specifically, the purpose of the classification task is to identify spam messages, telephone nuisances and the like, so that the labels of the classification result are whether spam messages, telephone nuisances and the like.

(2) The communication task is a clustering task

Determining clustering targets and clustering quantity:

according to task requirements and data characteristics, determining how many clusters the data set is wished to be divided into, namely determining the clustering targets and the number of clusters.

Selecting a clustering feature:

and selecting the characteristic corresponding to the required data type from the second characteristic set as a clustering characteristic. These features will be used for the input of the clustering algorithm.

k-means++ cluster analysis:

and carrying out cluster analysis on the selected cluster features by using a k-means++ cluster algorithm. k-means++ iteratively divides the samples into k clusters such that the sum of the squares of the distances of each sample point to the center point of the cluster to which it belongs is minimized.

And (3) obtaining a clustering result:

the k-means++ clustering algorithm obtains cluster labels of each sample, and takes the clusters and the cluster labels thereof as a clustering result.

Specifically, the purpose of the clustering task is to find similar groups, such as similar pictures, similar contacts, similar senders of messages, etc., so that the clustering result is a description of the similar groups and their similar content.

(3) The communication task is a recommended task

Confirming a recommended target:

according to the service requirements and the user characteristics, the recommended targets, such as recommending movies, commodities, music and the like, are confirmed.

Selecting recommended features:

and selecting the characteristics corresponding to the required data types from the second characteristic set as recommended characteristics. These features will be used for input of the recommendation algorithm.

Recommendation algorithm:

step one, setting the recommended features as nodes of a recommendation algorithm, putting the recommended features into an opening list, and evaluating the nodes in the opening list according to the relevance indexes of the recommended features to obtain evaluation scores, wherein the relevance indexes of the recommended features are obtained by mining the relevance between the recommended features and adjacent recommended features according to a relevance rule.

Specifically, the association rule mining employs Apriori algorithm.

Step two, sorting the evaluation scores from large to small, selecting a node corresponding to the first evaluation score as a starting node of a recommendation algorithm, and placing the starting node into a closing list.

Step three, calculating the weighted values of the evaluation scores of each node in the open list and all nodes in the closed list, selecting the node in the open list corresponding to the first weighted value according to the sequence from big to small, putting the node in the closed list, and updating the weighted value of the open list.

Specifically, the weighted value calculation mode of the single node i in the open list is as follows:

wherein E is _i To turn on the weighting value, y, of node i in the list _j To turn off the evaluation score of the jth node in the list, x _i To turn on the evaluation score for node i in the list, z is the number of nodes in the closed list, α _j 、β _i Is the weight.

And step four, repeating the step three until the opening list is empty.

Obtaining a recommended result:

the recommendation algorithm will obtain recommended content or product as recommendation results. These results may be personalized for the user based on the interests and needs of the user.

According to the embodiment, the features corresponding to the required data types are selected from the second feature set according to the recommendation targets, and the recommendation algorithm is adopted to recommend the recommendation features, so that recommendation results are obtained. These recommendation results may be used to provide personalized recommendation services to the user, improving user experience and satisfaction.

In addition, referring to fig. 2, the present invention further provides a communication information processing apparatus based on big data analysis, where the apparatus is configured to perform the method described in any one of the foregoing, and the apparatus includes:

the data acquisition module is internally provided with a communication database and is used for acquiring initial communication data of a user and storing the initial communication data into the communication database according to data types.

The communication data may include various forms of communication records such as short messages, phone call records, social media messages, and the like.

The data processing module is used for preprocessing the initial communication data to obtain standard communication data, and storing the standard communication data into the communication database, wherein the preprocessing comprises repeated data deletion, missing data identification, missing value filling, abnormal data identification and repair. Through these preprocessing steps, the accuracy and integrity of the communication data can be ensured.

The feature storage module is used for designing a feature combination of the standard communication data, extracting features of the standard communication data according to the feature combination to obtain a first feature set, screening the first feature set according to screening rules to obtain a second feature set, and storing the first feature set and the second feature set.

The data analysis module is used for acquiring a communication task list, analyzing and processing the second feature set according to the communication tasks in the communication task list, obtaining analysis results corresponding to the communication tasks, and storing the analysis results. These analysis results may include patterns of communication behavior, trends, anomalies, and so forth.

The visual display module is used for visualizing the analysis result and displaying the visualized analysis result and corresponding communication data to the user according to the user demand. Including presentation in the form of charts, statistics, trend analysis, etc., to help the user better understand the meaning and pattern of the communication data.

Specifically, the device has the functions of communication data collection, preprocessing, feature extraction and analysis and visual display, and provides a comprehensive communication data processing and analysis platform for users.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The communication information processing method based on big data analysis is characterized by comprising the following steps:

2. The communication information processing method based on big data analysis according to claim 1, wherein step S2 includes:

if the number delta of the missing characters of the first communication data ₁ ＜N ₁ ≤δ ₂ Classifying the first communication data into first data to be repaired;

if the number delta of the missing characters of the first communication data ₂ ＜N ₁ ≤δ ₃ Classifying the first communication data into second data to be repaired;

if the number delta of the missing characters of the first communication data ₃ ＜N ₁ Classifying the first communication data into third data to be repaired;

3. The communication information processing method based on big data analysis according to claim 2, wherein step S24 includes:

deleting the first data to be repaired from the first communication data;

acquiring a time stamp of the second data to be repaired, taking the time stamp as a target time stamp, and respectively searching Y non-missing communication data forwards in the first communication data by taking the target time stamp as an origin pointBackward searching Y non-missing communication data +.>Calculation ofAnd->Filling the second data to be repaired by using the weighted average value to obtain second repair data, wherein +_>Weight of (2) is less than +.>Weight of (2);

4. The communication information processing method based on big data analysis according to claim 2, wherein step S25 includes:

s252 traversing the relation set, calculating a second distance d between each second communication data and its neighbor according to the first distance ₂ And willUpdating the second distance to the relation set, wherein the calculation formula of the second distance is as follows:

in the method, in the process of the invention,is the second distance between the ith neighbor in the neighbor set of the current second communication data and the second communication data,for the first distance between the ith neighbor in the neighbor set of the current second communication data and the second communication data,/the first distance is equal to the second distance>The method comprises the steps of setting a first distance between an ith neighbor in a neighbor set of current second communication data and an mth second communication data in a distance sorting table of the current second communication data;

in the method, in the process of the invention,sigma ρ is the local density of the current second communication data _d A represents the number of second communication data for the sum of the distance densities of all the second communication data;

5. The method for processing communication information based on big data analysis according to claim 4, wherein in step S255, the formula for anomaly repair is as follows:

wherein lambda is _k The calculation formula of (2) is as follows:

6. The communication information processing method based on big data analysis according to claim 1, wherein step S3 includes:

7. The method for processing communication information based on big data analysis according to claim 6, wherein step S4 comprises:

F＝Sigmoid(conv(fc(AP(X))))

s43, forming all second features into a second feature set.

8. The communication information processing method based on big data analysis according to claim 1, wherein step S5 includes:

9. The method for processing communication information based on big data analysis according to claim 8, wherein in step S54, the recommendation algorithm includes:

And step four, repeating the step three until the opening list is empty.

10. A communication information processing apparatus based on big data analysis, characterized in that the apparatus is adapted to perform the method of any of claims 1-9, the apparatus comprising: