CN114510649B - Social network and LSTM model accuracy calculating method based on deduplication sample - Google Patents

Social network and LSTM model accuracy calculating method based on deduplication sample Download PDF

Info

Publication number
CN114510649B
CN114510649B CN202210180890.3A CN202210180890A CN114510649B CN 114510649 B CN114510649 B CN 114510649B CN 202210180890 A CN202210180890 A CN 202210180890A CN 114510649 B CN114510649 B CN 114510649B
Authority
CN
China
Prior art keywords
result
social network
lstm
post
main
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210180890.3A
Other languages
Chinese (zh)
Other versions
CN114510649A (en
Inventor
魏嵬
李晓婉
张贝贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202210180890.3A priority Critical patent/CN114510649B/en
Publication of CN114510649A publication Critical patent/CN114510649A/en
Application granted granted Critical
Publication of CN114510649B publication Critical patent/CN114510649B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a social network and LSTM model accuracy calculating method based on a deduplication sample, which comprises the following steps: step 1, constructing a set, namely, constructing a set of an original sequence data set according to a given category as an original result set; step 2, LSTM verification, namely classifying data by using an LSTM model, and filtering a result with correct classification; the post type in the classification result is the main paste which is kept unchanged, the post type is the comment and is searched in the original sequence data set, and the main paste corresponding to the current comment is searched; step 3, backtracking, namely after all the main patches are taken, constructing a social network according to the original sequence data set; and 4, linking, namely fusing and linking the original result set constructed in the step 1 and the result set generated by fusing the step 2LSTM and the step 3 social network.

Description

Social network and LSTM model accuracy calculating method based on deduplication sample
Technical Field
The invention belongs to the field of natural language processing text classification, and particularly relates to a social network and LSTM model accuracy computing method based on a deduplication sample.
Background
Accuracy is generally widely used to evaluate the accuracy, and in general, it can measure the quality of a method (e.g., the higher the accuracy, the higher the score when answering questions). This is also true in the field of artificial intelligence, where accuracy is primarily used to evaluate more objectively whether an algorithm achieves a desired effect.
The algorithm accuracy refers to the comparison of the result output by the algorithm with the real result, and the higher the proportion of the quantity consistent with the real result to the total quantity, the better the algorithm is proved to perform in a certain aspect. If the algorithm is to be verified, other indexes such as recall rate and the like are combined for comprehensive judgment. The traditional accuracy rate is calculated in a mode of comparing with a true value according to the principle, and when the data volume is extremely large, more calculation force is consumed in the calculation process. Especially in a compound model, such as a fused classification model combining an LSTM and a social network, the fused model needs to be comprehensively considered when calculating the final accuracy, and the complexity of accuracy calculation may be increased according to the connection condition between the models. Therefore, the invention provides an accuracy calculation method for fusing the classification model between the LSTM and the social network in a sample with no repetition or a sample with lower repetition rate.
Disclosure of Invention
The invention aims to provide a social network and LSTM model accuracy calculating method based on a deduplication sample, which is suitable for calculating the accuracy of samples with no repetition or lower repetition rate in a social network and LSTM fusion classification model.
The technical scheme adopted by the invention is that the social network and LSTM model accuracy calculating method based on the deduplication sample comprises the following steps:
step 1, constructing a set, namely constructing a set from an original sequence data set according to a given category, wherein the set is called an original result set;
step 2, LSTM verification, namely classifying data by using an LSTM model, and filtering a result with correct classification; the post type in the classification result is the main paste which is kept unchanged, the post type is the comment and is searched in the original sequence data set, and the main paste corresponding to the current comment is searched;
step 3, backtracking, namely after all the main patches are taken, constructing a social network according to the original sequence data set;
and 4, linking, namely fusing and linking the original result set constructed in the step 1 and the result set generated by fusing the step 2LSTM and the step 3 social network.
The present invention is also characterized in that,
the specific implementation mode of the step 1 is as follows: two original result sets are constructed, wherein each element in the original result sets is a dictionary taking the content of a post as a key and the type of the post as a value, the type of the post is a main paste and comment, and the two original result sets are specifically C1Sample Set and C2 Sample Set.
The specific implementation mode of the step 2 is as follows:
firstly, carrying out secondary classification on posts by using an LSTM classifier, forming a classification result into a result set according to the categories, and marking the result set as gamma and theta, wherein gamma represents the result of displaying the category as C1 after the LSTM classification, and theta represents the result of displaying the category as C2 after the LSTM classification;
then put aside the data with wrong classification temporarily, classify the data with correct classification: judging whether the current post is a main post or a comment, if the post is inquired to be the main post, directly entering the next link to construct a social network, and if the inquired result is the comment, entering a Search link in the first graph to be compared with All posts Base to inquire the main post actually corresponding to the Search link; and after the main paste corresponding to the current comment is found, entering the next link to construct the social network.
The specific implementation mode of the step 3 is as follows:
when All the main patches which are classified by the LSTM and are correctly classified and the main patches corresponding to the comments which are correctly classified are taken, returning each main patch to All posts Base for searching, wherein the process is called big searching, namely as shown by Trace in the figure; all the social network results are returned to the All posts Base to inquire about the dependency relationships existing between nodes according to the original sequence information, and then a social network result set is formed; taking out all constructed network nodes, forming corresponding sets of different categories, namely C1 node set and C2 node set, respectively, and marking the C1 node set and the C2 node set as a and b respectively for the following operations:
α=a∩p (1)
β=b∩q (2)
alpha and beta are final result sets of the social network corresponding to the C1 and C2 categories; p and q in the formulas (1) and (2) refer to C1Sample Set and C2 Sample Set, respectively.
The specific implementation mode of the step 4 is as follows:
establishing the connection between the LSTM and the social network model, wherein the connection comprises the following concrete steps: constructing result sets alpha and beta and LSTM result sets gamma and theta of the social network, and then performing the following operations:
s 1 =α∪γ (3)
s 2 =β∪θ (4)
wherein s is 1 Sum s 2 Corresponding to the set corresponding to the C1 category and the C2 category calculated by the LSTM and the social network fusion model; then, the following operations (5), (6) are carried out:
L 1 =s 1 ∩p (5)
L 2 =s 2 ∩q (6)
wherein L is 1 And L 2 Respectively representing posts of two types respectively judged to be pairs, wherein p and q respectively refer to C1Sample Set and C2 Sample Set;
finally, the final accuracy is calculated using equation (7):
the beneficial effects of the invention are as follows: the method is robust, and is applicable to the calculation of the accuracy of the fusion model of the deep learning and the social network with social properties, unlike the common calculation accuracy method. Due to the characteristics of social network site data, repeated data are fewer, time sequence and social properties are provided, and the method can calculate the accuracy of the fusion model in the fusion model faster by utilizing the characteristics of the social data set.
Drawings
FIG. 1 is a flow chart of the social network and LSTM model accuracy calculation method based on the deduplication sample of the present invention.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
The invention provides a social network and LSTM model accuracy calculating method based on a deduplication sample, which is shown in fig. 1 and comprises the following steps:
step 1, set construction, namely, a set constructed by an original sequence data set according to a given category is called an original result set, and the original sequence data set is classified into one set;
the specific implementation mode of the step 1 is as follows: two original result sets are constructed, wherein the dictionary taking the content of the posts as keys and the types of the posts as values are adopted in the original result sets, the types of the posts are main posts and comments, and the two original result sets are specifically C1Sample Set and C2 Sample Set as shown in the following figure 1.
Step 2, LSTM verification, namely classifying data by using an LSTM model, and filtering a result with correct classification; the post type in the classification result is the main paste which is kept unchanged, the post type is the comment and is searched in the original sequence data set, and the main paste corresponding to the current comment is searched;
the specific implementation mode of the step 2 is as follows:
firstly, carrying out secondary classification on posts by using an LSTM classifier, forming a classification result into a result set according to the categories, and marking the result set as gamma and theta, wherein gamma represents the result of displaying the category as C1 after the LSTM classification, and theta represents the result of displaying the category as C2 after the LSTM classification;
then put aside the data with wrong classification temporarily, classify the data with correct classification: judging whether the current Post is a main Post or a comment, namely a section in the Post type stage in the figure, wherein the section uses the Post content as a key to inquire the corresponding value (namely the Post type) so as to judge whether the Post type is the main Post or the comment; this step of determination requires the use of queries and controls from the original sequence dataset All posts Base (a part of which is used in this experiment, all posts Base refers to the complete dataset and the sequence relationship is not broken). For different conditions queried according to the post type in the stage, different flows exist, if the queried post is a main post, the next link is directly entered to construct a social network, and if the queried result is a comment, the Search link in the graph is compared with All post Base to query the main post truly corresponding to the Search link; after the main paste corresponding to the current comment is found, the next link is entered to construct a social network, and the process is called small inquiry and is a part of backtracking of the social network. The small inquiry can effectively solve the problem of sample data sequence destruction after the data preprocessing is completed.
Step 3, backtracking, namely after all the main patches are taken, constructing a social network according to the original sequence data set;
the specific implementation mode of the step 3 is as follows:
the main pastes with correct classification after LSTM classification and the main pastes corresponding to the comments with correct classification are All obtained at present, and after All the main pastes are obtained, each main paste is returned to All posts Base for searching, and the process is called big searching, namely Trace representation in the figure; all the social network results are returned to the All posts Base to inquire about the dependency relationships existing between nodes according to the original sequence information, and then a social network result set is formed; since the data set used herein is a part of data randomly extracted from All posts Base, there is a backtracking operation in order to avoid the problem of sequence information corruption due to random sampling. However, backtracking causes inconsistent data sets, in order to solve this problem, we add a verification step, to take out all the constructed network nodes as All network nodes in the figure, and then form corresponding sets of different categories, namely C1 node set and C2 node set (C1 node set and C2 node set are the corresponding sets of C1 and C2 categories generated after the social network is completed), and mark C1 node set and C2 node set as a and b respectively, and do the following operations:
α=a∩p (1)
β=b∩q (2)
alpha and beta are final result sets of the social network corresponding to the C1 and C2 categories. P and q in the formulas (1) and (2) refer to C1Sample Set and C2 Sample Set, respectively. Therefore, all redundant data which are not used in the experiment can be filtered, the data selected in the experiment are reserved, and a final result is finally obtained. Thus, ambiguity of the data set is solved, and the data set is unified on the data set adopted by the experiment.
And 4, linking, namely fusing and linking the original result set constructed in the step 1 and the result set generated by fusing the step 2LSTM and the step 3 social network.
The specific implementation mode of the step 4 is as follows:
the social network building section has been substantially completed. Establishing the connection between the LSTM and the social network model, wherein the connection comprises the following concrete steps: constructing result sets alpha and beta and LSTM result sets gamma and theta of the social network, and then performing the following operations:
s 1 =α∪γ (3)
s 2 =β∪θ (4)
wherein s is 1 Sum s 2 Corresponding to the set corresponding to the C1 category and the C2 category calculated by the LSTM and the social network fusion model; then, the following operations (5), (6) are carried out:
L 1 =s 1 ∩p (5)
L 2 =s 2 ∩q (6)
wherein L is 1 And L 2 Respectively representing posts of two types respectively judged to be pairs, wherein p and q respectively refer to C1Sample Set and C2 Sample Set;
finally, the final accuracy is calculated using equation (7):
the social network site comment data are adopted in the data, and the possibility that comment texts are the same is low, so that the accuracy can be quickly obtained by adopting the method for obtaining the accuracy by adopting the set after the efficiency and the text specificity are considered. However, in the process of solving the set, the problem that the number of texts changes due to the fact that the same texts are de-duplicated inevitably occurs is solved, and for the problem, the new number generated by the set is used as a reference, the total number and the number meeting the conditions are changed by the same number, the influence on the accuracy is small, namely the influence of the numerator and the denominator on the reduction of the score value is small, so that the accuracy is solved by comprehensively using a unified criterion.
Examples
The data set adopted by the experiment in the embodiment is post information of a certain social network site, and the format of the data is json. The single json unit internally comprises a plurality of fields, such as more than ten fields including a 'userId', 'userName', 'area', 'like', 'content', and the like, and the experiment in the invention temporarily uses the contents of the 'userId' field and the 'content', 'postType', and the data aggregate is 16000 pieces of text data. Firstly, two types of data with labels are scrambled, and after shuffling operation is carried out, a training set and a testing set are shown as 8:2 to facilitate later training and testing.
The following is the accuracy calculated using the accuracy calculation method described in the present invention.
Because the social network is not an algorithm adopted by the traditional machine learning and deep learning algorithm for text classification, the social network aims at abstracting characters into nodes and constructing a network among the nodes, so that the relation among the characters can be intuitively reflected. In the traditional model, little or no classification algorithm is applied to improve the performance of the classification algorithm. Therefore, the accuracy rate calculation method of the fusion model is difficult to directly and accurately reflect the performance of the fusion algorithm. In addition, the social network stage needs to be returned to the original sample for searching and inquiring, and the searching process is huge and time-consuming, so that the accuracy of the fusion algorithm is calculated by adopting the collection and the collection operation. Experimental results prove that the accuracy rate can be rapidly calculated by the accuracy rate calculation method.

Claims (2)

1. The social network and LSTM model accuracy calculating method based on the deduplication sample is characterized by comprising the following steps of:
step 1, constructing a set, namely constructing a set of original sequence data sets according to a given category, wherein the set is called an original result set;
step 2, LSTM verification, namely classifying data by using an LSTM model, and filtering a result with correct classification; the post type in the classification result is the main paste which is kept unchanged, the post type is the comment and is searched in the original sequence data set, and the main paste corresponding to the current comment is searched;
the specific implementation mode of the step 2 is as follows:
firstly, carrying out secondary classification on posts by using an LSTM classifier, forming a classification result into a result set according to the categories, and marking the result set as gamma and theta, wherein gamma represents the result of displaying the category as C1 after the LSTM classification, and theta represents the result of displaying the category as C2 after the LSTM classification;
then put aside the data with wrong classification temporarily, classify the data with correct classification: judging whether the current post is a main post or a comment, if the post is inquired to be the main post, directly entering the next link to construct a social network, and if the inquired result is the comment, comparing the Search link with All post Base in the diagram to inquire the main post actually corresponding to the Search link; after the main paste corresponding to the current comment is found, entering the next link to construct a social network;
step 3, backtracking, namely after all the main patches are taken, constructing a social network according to the original sequence data set;
the specific implementation mode of the step 3 is as follows:
the main pastes with correct classification after LSTM classification and the main pastes corresponding to the comments with correct classification are All obtained at present, and after All the main pastes are obtained, each main paste is returned to All posts Base for searching, and the process is called big searching, namely Trace representation in the figure; all the social network results are returned to the All posts Base to inquire about the dependency relationships existing between nodes according to the original sequence information, and then a social network result set is formed; taking out all constructed network nodes, forming corresponding sets of different categories, namely C1 node set and C2 node set, respectively, and marking the C1 node set and the C2 node set as a and b respectively for the following operations:
α=a∩pα=a∩c 1 (1)
β=b∩qβ=b∩c 2 (2)
alpha and beta are final result sets of the social network corresponding to the C1 and C2 categories; p and q in the formulas (1) and (2) respectively refer to C1Sample Set and C2 Sample Set;
step 4, connection, namely fusing and connecting the original result set constructed in the step 1 and the result set generated by fusing the step 2LSTM and the step 3 social network;
the specific implementation mode of the step 4 is as follows:
establishing the connection between the LSTM and the social network model, wherein the connection comprises the following concrete steps: constructing result sets alpha and beta and LSTM result sets gamma and theta of the social network, and then performing the following operations:
s 1 =α∪γ (3)
s 2 =β∪θ (4)
wherein s is 1 Sum s 2 Corresponding to the set corresponding to the C1 category and the C2 category calculated by the LSTM and the social network fusion model; then, the following operations (5), (6) are carried out:
L 1 =s 1 ∩p (5)
L 2 =s 2 ∩q (6)
wherein L is 1 And L 2 Respectively representing posts of two types respectively judged to be pairs, wherein p and q respectively refer to C1Sample Set and C2 Sample Set;
finally, the final accuracy is calculated using equation (7):
2. the method for calculating accuracy of a social network and an LSTM model based on a deduplication sample according to claim 1, wherein the specific implementation manner of step 1 is as follows: two original result sets are constructed, wherein each element in the original result sets is a dictionary taking the content of a post as a key and the type of the post as a value, the type of the post is a main paste and comment, and the two original result sets are specifically C1Sample Set and C2 Sample Set.
CN202210180890.3A 2022-02-25 2022-02-25 Social network and LSTM model accuracy calculating method based on deduplication sample Active CN114510649B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210180890.3A CN114510649B (en) 2022-02-25 2022-02-25 Social network and LSTM model accuracy calculating method based on deduplication sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210180890.3A CN114510649B (en) 2022-02-25 2022-02-25 Social network and LSTM model accuracy calculating method based on deduplication sample

Publications (2)

Publication Number Publication Date
CN114510649A CN114510649A (en) 2022-05-17
CN114510649B true CN114510649B (en) 2024-04-09

Family

ID=81554046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210180890.3A Active CN114510649B (en) 2022-02-25 2022-02-25 Social network and LSTM model accuracy calculating method based on deduplication sample

Country Status (1)

Country Link
CN (1) CN114510649B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390018A (en) * 2019-07-25 2019-10-29 哈尔滨工业大学 A kind of social networks comment generation method based on LSTM

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190108282A1 (en) * 2017-10-09 2019-04-11 Facebook, Inc. Parsing and Classifying Search Queries on Online Social Networks

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390018A (en) * 2019-07-25 2019-10-29 哈尔滨工业大学 A kind of social networks comment generation method based on LSTM

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邬明强 ; 邬佳明 ; 辛伟彬 ; .Word2Vec+LSTM多类别情感分类算法优化.计算机系统应用.2020,(第01期),全文. *

Also Published As

Publication number Publication date
CN114510649A (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN111428054B (en) Construction and storage method of knowledge graph in network space security field
CN108804521B (en) Knowledge graph-based question-answering method and agricultural encyclopedia question-answering system
CN108573411B (en) Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments
CN110968699B (en) Logic map construction and early warning method and device based on fact recommendation
CN105045875B (en) Personalized search and device
CN109522420B (en) Method and system for acquiring learning demand
CN109472033A (en) Entity relation extraction method and system in text, storage medium, electronic equipment
CN103927297B (en) Evidence theory based Chinese microblog credibility evaluation method
CN105653706A (en) Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN112579707A (en) Log data knowledge graph construction method
CN105095444A (en) Information acquisition method and device
CN104484380A (en) Personalized search method and personalized search device
CN111651566B (en) Multi-task small sample learning-based referee document dispute focus extraction method
CN113779272A (en) Data processing method, device and equipment based on knowledge graph and storage medium
CN104715063A (en) Search ranking method and search ranking device
CN112069327A (en) Knowledge graph construction method and system for teaching resources of online education classroom
CN114911893A (en) Method and system for automatically constructing knowledge base based on knowledge graph
CN112836067B (en) Intelligent searching method based on knowledge graph
CN111309930B (en) Medical knowledge graph entity alignment method based on representation learning
CN111984790B (en) Entity relation extraction method
CN111339258B (en) University computer basic exercise recommendation method based on knowledge graph
CN114510649B (en) Social network and LSTM model accuracy calculating method based on deduplication sample
CN116611447A (en) Information extraction and semantic matching system and method based on deep learning method
CN116431828A (en) Construction method of power grid center data asset knowledge graph database constructed based on neural network technology
CN115827890A (en) Hot event knowledge graph link estimation method based on network social platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant