CN114510649B

CN114510649B - Social network and LSTM model accuracy calculating method based on deduplication sample

Info

Publication number: CN114510649B
Application number: CN202210180890.3A
Authority: CN
Inventors: 魏嵬; 李晓婉; 张贝贝
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2024-04-09
Anticipated expiration: 2042-02-25
Also published as: CN114510649A

Abstract

The invention discloses a social network and LSTM model accuracy calculating method based on a deduplication sample, which comprises the following steps: step 1, constructing a set, namely, constructing a set of an original sequence data set according to a given category as an original result set; step 2, LSTM verification, namely classifying data by using an LSTM model, and filtering a result with correct classification; the post type in the classification result is the main paste which is kept unchanged, the post type is the comment and is searched in the original sequence data set, and the main paste corresponding to the current comment is searched; step 3, backtracking, namely after all the main patches are taken, constructing a social network according to the original sequence data set; and 4, linking, namely fusing and linking the original result set constructed in the step 1 and the result set generated by fusing the step 2LSTM and the step 3 social network.

Description

Social network and LSTM model accuracy calculating method based on deduplication sample

Technical Field

The invention belongs to the field of natural language processing text classification, and particularly relates to a social network and LSTM model accuracy computing method based on a deduplication sample.

Background

Accuracy is generally widely used to evaluate the accuracy, and in general, it can measure the quality of a method (e.g., the higher the accuracy, the higher the score when answering questions). This is also true in the field of artificial intelligence, where accuracy is primarily used to evaluate more objectively whether an algorithm achieves a desired effect.

The algorithm accuracy refers to the comparison of the result output by the algorithm with the real result, and the higher the proportion of the quantity consistent with the real result to the total quantity, the better the algorithm is proved to perform in a certain aspect. If the algorithm is to be verified, other indexes such as recall rate and the like are combined for comprehensive judgment. The traditional accuracy rate is calculated in a mode of comparing with a true value according to the principle, and when the data volume is extremely large, more calculation force is consumed in the calculation process. Especially in a compound model, such as a fused classification model combining an LSTM and a social network, the fused model needs to be comprehensively considered when calculating the final accuracy, and the complexity of accuracy calculation may be increased according to the connection condition between the models. Therefore, the invention provides an accuracy calculation method for fusing the classification model between the LSTM and the social network in a sample with no repetition or a sample with lower repetition rate.

Disclosure of Invention

The invention aims to provide a social network and LSTM model accuracy calculating method based on a deduplication sample, which is suitable for calculating the accuracy of samples with no repetition or lower repetition rate in a social network and LSTM fusion classification model.

The technical scheme adopted by the invention is that the social network and LSTM model accuracy calculating method based on the deduplication sample comprises the following steps:

step 1, constructing a set, namely constructing a set from an original sequence data set according to a given category, wherein the set is called an original result set;

step 2, LSTM verification, namely classifying data by using an LSTM model, and filtering a result with correct classification; the post type in the classification result is the main paste which is kept unchanged, the post type is the comment and is searched in the original sequence data set, and the main paste corresponding to the current comment is searched;

step 3, backtracking, namely after all the main patches are taken, constructing a social network according to the original sequence data set;

and 4, linking, namely fusing and linking the original result set constructed in the step 1 and the result set generated by fusing the step 2LSTM and the step 3 social network.

The present invention is also characterized in that,

the specific implementation mode of the step 1 is as follows: two original result sets are constructed, wherein each element in the original result sets is a dictionary taking the content of a post as a key and the type of the post as a value, the type of the post is a main paste and comment, and the two original result sets are specifically C1Sample Set and C2 Sample Set.

The specific implementation mode of the step 2 is as follows:

firstly, carrying out secondary classification on posts by using an LSTM classifier, forming a classification result into a result set according to the categories, and marking the result set as gamma and theta, wherein gamma represents the result of displaying the category as C1 after the LSTM classification, and theta represents the result of displaying the category as C2 after the LSTM classification;

then put aside the data with wrong classification temporarily, classify the data with correct classification: judging whether the current post is a main post or a comment, if the post is inquired to be the main post, directly entering the next link to construct a social network, and if the inquired result is the comment, entering a Search link in the first graph to be compared with All posts Base to inquire the main post actually corresponding to the Search link; and after the main paste corresponding to the current comment is found, entering the next link to construct the social network.

The specific implementation mode of the step 3 is as follows:

when All the main patches which are classified by the LSTM and are correctly classified and the main patches corresponding to the comments which are correctly classified are taken, returning each main patch to All posts Base for searching, wherein the process is called big searching, namely as shown by Trace in the figure; all the social network results are returned to the All posts Base to inquire about the dependency relationships existing between nodes according to the original sequence information, and then a social network result set is formed; taking out all constructed network nodes, forming corresponding sets of different categories, namely C1 node set and C2 node set, respectively, and marking the C1 node set and the C2 node set as a and b respectively for the following operations:

α＝a∩p (1)

β＝b∩q (2)

alpha and beta are final result sets of the social network corresponding to the C1 and C2 categories; p and q in the formulas (1) and (2) refer to C1Sample Set and C2 Sample Set, respectively.

The specific implementation mode of the step 4 is as follows:

establishing the connection between the LSTM and the social network model, wherein the connection comprises the following concrete steps: constructing result sets alpha and beta and LSTM result sets gamma and theta of the social network, and then performing the following operations:

s ₁ ＝α∪γ (3)

s ₂ ＝β∪θ (4)

wherein s is ₁ Sum s ₂ Corresponding to the set corresponding to the C1 category and the C2 category calculated by the LSTM and the social network fusion model; then, the following operations (5), (6) are carried out:

L ₁ ＝s ₁ ∩p (5)

L ₂ ＝s ₂ ∩q (6)

wherein L is ₁ And L ₂ Respectively representing posts of two types respectively judged to be pairs, wherein p and q respectively refer to C1Sample Set and C2 Sample Set;

finally, the final accuracy is calculated using equation (7):

the beneficial effects of the invention are as follows: the method is robust, and is applicable to the calculation of the accuracy of the fusion model of the deep learning and the social network with social properties, unlike the common calculation accuracy method. Due to the characteristics of social network site data, repeated data are fewer, time sequence and social properties are provided, and the method can calculate the accuracy of the fusion model in the fusion model faster by utilizing the characteristics of the social data set.

Drawings

FIG. 1 is a flow chart of the social network and LSTM model accuracy calculation method based on the deduplication sample of the present invention.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention provides a social network and LSTM model accuracy calculating method based on a deduplication sample, which is shown in fig. 1 and comprises the following steps:

step 1, set construction, namely, a set constructed by an original sequence data set according to a given category is called an original result set, and the original sequence data set is classified into one set;

the specific implementation mode of the step 1 is as follows: two original result sets are constructed, wherein the dictionary taking the content of the posts as keys and the types of the posts as values are adopted in the original result sets, the types of the posts are main posts and comments, and the two original result sets are specifically C1Sample Set and C2 Sample Set as shown in the following figure 1.

the specific implementation mode of the step 2 is as follows:

then put aside the data with wrong classification temporarily, classify the data with correct classification: judging whether the current Post is a main Post or a comment, namely a section in the Post type stage in the figure, wherein the section uses the Post content as a key to inquire the corresponding value (namely the Post type) so as to judge whether the Post type is the main Post or the comment; this step of determination requires the use of queries and controls from the original sequence dataset All posts Base (a part of which is used in this experiment, all posts Base refers to the complete dataset and the sequence relationship is not broken). For different conditions queried according to the post type in the stage, different flows exist, if the queried post is a main post, the next link is directly entered to construct a social network, and if the queried result is a comment, the Search link in the graph is compared with All post Base to query the main post truly corresponding to the Search link; after the main paste corresponding to the current comment is found, the next link is entered to construct a social network, and the process is called small inquiry and is a part of backtracking of the social network. The small inquiry can effectively solve the problem of sample data sequence destruction after the data preprocessing is completed.

the specific implementation mode of the step 3 is as follows:

the main pastes with correct classification after LSTM classification and the main pastes corresponding to the comments with correct classification are All obtained at present, and after All the main pastes are obtained, each main paste is returned to All posts Base for searching, and the process is called big searching, namely Trace representation in the figure; all the social network results are returned to the All posts Base to inquire about the dependency relationships existing between nodes according to the original sequence information, and then a social network result set is formed; since the data set used herein is a part of data randomly extracted from All posts Base, there is a backtracking operation in order to avoid the problem of sequence information corruption due to random sampling. However, backtracking causes inconsistent data sets, in order to solve this problem, we add a verification step, to take out all the constructed network nodes as All network nodes in the figure, and then form corresponding sets of different categories, namely C1 node set and C2 node set (C1 node set and C2 node set are the corresponding sets of C1 and C2 categories generated after the social network is completed), and mark C1 node set and C2 node set as a and b respectively, and do the following operations:

α＝a∩p (1)

β＝b∩q (2)

alpha and beta are final result sets of the social network corresponding to the C1 and C2 categories. P and q in the formulas (1) and (2) refer to C1Sample Set and C2 Sample Set, respectively. Therefore, all redundant data which are not used in the experiment can be filtered, the data selected in the experiment are reserved, and a final result is finally obtained. Thus, ambiguity of the data set is solved, and the data set is unified on the data set adopted by the experiment.

The specific implementation mode of the step 4 is as follows:

the social network building section has been substantially completed. Establishing the connection between the LSTM and the social network model, wherein the connection comprises the following concrete steps: constructing result sets alpha and beta and LSTM result sets gamma and theta of the social network, and then performing the following operations:

s ₁ ＝α∪γ (3)

s ₂ ＝β∪θ (4)

L ₁ ＝s ₁ ∩p (5)

L ₂ ＝s ₂ ∩q (6)

finally, the final accuracy is calculated using equation (7):

the social network site comment data are adopted in the data, and the possibility that comment texts are the same is low, so that the accuracy can be quickly obtained by adopting the method for obtaining the accuracy by adopting the set after the efficiency and the text specificity are considered. However, in the process of solving the set, the problem that the number of texts changes due to the fact that the same texts are de-duplicated inevitably occurs is solved, and for the problem, the new number generated by the set is used as a reference, the total number and the number meeting the conditions are changed by the same number, the influence on the accuracy is small, namely the influence of the numerator and the denominator on the reduction of the score value is small, so that the accuracy is solved by comprehensively using a unified criterion.

Examples

The data set adopted by the experiment in the embodiment is post information of a certain social network site, and the format of the data is json. The single json unit internally comprises a plurality of fields, such as more than ten fields including a 'userId', 'userName', 'area', 'like', 'content', and the like, and the experiment in the invention temporarily uses the contents of the 'userId' field and the 'content', 'postType', and the data aggregate is 16000 pieces of text data. Firstly, two types of data with labels are scrambled, and after shuffling operation is carried out, a training set and a testing set are shown as 8:2 to facilitate later training and testing.

The following is the accuracy calculated using the accuracy calculation method described in the present invention.

Because the social network is not an algorithm adopted by the traditional machine learning and deep learning algorithm for text classification, the social network aims at abstracting characters into nodes and constructing a network among the nodes, so that the relation among the characters can be intuitively reflected. In the traditional model, little or no classification algorithm is applied to improve the performance of the classification algorithm. Therefore, the accuracy rate calculation method of the fusion model is difficult to directly and accurately reflect the performance of the fusion algorithm. In addition, the social network stage needs to be returned to the original sample for searching and inquiring, and the searching process is huge and time-consuming, so that the accuracy of the fusion algorithm is calculated by adopting the collection and the collection operation. Experimental results prove that the accuracy rate can be rapidly calculated by the accuracy rate calculation method.

Claims

1. The social network and LSTM model accuracy calculating method based on the deduplication sample is characterized by comprising the following steps of:

step 1, constructing a set, namely constructing a set of original sequence data sets according to a given category, wherein the set is called an original result set;

the specific implementation mode of the step 2 is as follows:

then put aside the data with wrong classification temporarily, classify the data with correct classification: judging whether the current post is a main post or a comment, if the post is inquired to be the main post, directly entering the next link to construct a social network, and if the inquired result is the comment, comparing the Search link with All post Base in the diagram to inquire the main post actually corresponding to the Search link; after the main paste corresponding to the current comment is found, entering the next link to construct a social network;

the specific implementation mode of the step 3 is as follows:

the main pastes with correct classification after LSTM classification and the main pastes corresponding to the comments with correct classification are All obtained at present, and after All the main pastes are obtained, each main paste is returned to All posts Base for searching, and the process is called big searching, namely Trace representation in the figure; all the social network results are returned to the All posts Base to inquire about the dependency relationships existing between nodes according to the original sequence information, and then a social network result set is formed; taking out all constructed network nodes, forming corresponding sets of different categories, namely C1 node set and C2 node set, respectively, and marking the C1 node set and the C2 node set as a and b respectively for the following operations:

α＝a∩pα＝a∩c ₁ (1)

β＝b∩qβ＝b∩c ₂ (2)

alpha and beta are final result sets of the social network corresponding to the C1 and C2 categories; p and q in the formulas (1) and (2) respectively refer to C1Sample Set and C2 Sample Set;

step 4, connection, namely fusing and connecting the original result set constructed in the step 1 and the result set generated by fusing the step 2LSTM and the step 3 social network;

the specific implementation mode of the step 4 is as follows:

s ₁ ＝α∪γ (3)

s ₂ ＝β∪θ (4)

L ₁ ＝s ₁ ∩p (5)

L ₂ ＝s ₂ ∩q (6)

finally, the final accuracy is calculated using equation (7):

2. the method for calculating accuracy of a social network and an LSTM model based on a deduplication sample according to claim 1, wherein the specific implementation manner of step 1 is as follows: two original result sets are constructed, wherein each element in the original result sets is a dictionary taking the content of a post as a key and the type of the post as a value, the type of the post is a main paste and comment, and the two original result sets are specifically C1Sample Set and C2 Sample Set.