CN109522414B - Document delivery object selection system - Google Patents

Document delivery object selection system Download PDF

Info

Publication number
CN109522414B
CN109522414B CN201811415165.XA CN201811415165A CN109522414B CN 109522414 B CN109522414 B CN 109522414B CN 201811415165 A CN201811415165 A CN 201811415165A CN 109522414 B CN109522414 B CN 109522414B
Authority
CN
China
Prior art keywords
information
abstract
module
page
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201811415165.XA
Other languages
Chinese (zh)
Other versions
CN109522414A (en
Inventor
丰小月
梁艳春
王冬晖
许东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN201811415165.XA priority Critical patent/CN109522414B/en
Publication of CN109522414A publication Critical patent/CN109522414A/en
Application granted granted Critical
Publication of CN109522414B publication Critical patent/CN109522414B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a document delivery object selection system, which comprises an information extraction module, an information management module, an information analysis module and an information sorting feedback module; the method can help the user to select a proper posted periodical, and avoid the situation of refusal of the manuscript, delay or few readers after publication caused by delivering an error periodical.

Description

Document delivery object selection system
Technical Field
The invention relates to the field of information recommendation, in particular to a document delivery object selection system.
Background
Due to the huge growth and increasing complexity of network information, it is difficult for users to accurately find their desired information from massive data, especially for researchers to know the dynamics of the research field in time, which is sometimes not easy. Today, much of the world's new knowledge is mainly represented in Digital form and stored in Digital library (Digital Libraries) systems, so Digital Libraries are entering a golden age. Such data libraries are available in the scientific and technical fields, such as ACM Library, IEEE Library, etc. However, with the development of technology and the continuous increase of information, the development of these trends causes an inevitable problem of information overload. For example, when researchers want to select an appropriate publication to publish a paper, they find that a large number of publications match their query, but largely independent of their actual needs, which makes their choice inappropriately. Researchers are urgently in need of a paper recommendation system to help them select appropriate publications.
Disclosure of Invention
In view of the above, the invention provides a document delivery object selection system, which is characterized by comprising an information extraction module, an information management module, an information analysis module and an information sorting feedback module;
the information extraction module comprises an information exchange path downloading unit, an inquiry scheme unit, a page information extraction unit and an analysis storage unit, and each unit is independent and is sequentially executed;
the information exchange path downloading unit comprises an extracting device, a screening device, a standardizing device and a duplicate removal device;
the extraction device extracts the information exchange path of the code of the HTML page by using an ELFhash function;
the screening device directly deletes the unnecessary information exchange paths and stores the necessary information exchange paths;
the standardized device converts all information exchange paths into absolute addresses;
the duplicate removal device performs website duplicate removal by creating a Hash table and a Hash function;
the information exchange path downloading unit is responsible for extracting and sorting all information exchange paths in the HTML page and used as the input of the query scheme unit;
the query scheme unit downloads the website extracted by the information exchange path extraction unit to obtain an information exchange path on a page of the website; when downloading the website, using a breadth-first scheme;
the page information extraction unit is responsible for extracting key information on a page of the website, including the title, abstract and author of the article;
the information classification unit is responsible for classifying and storing key information on the page of the website according to the category of the publication, and preprocessing the abstract to form an inverted index table;
the information extraction module stores the extracted information into the information management module;
the information management module is responsible for defining an information management standard, selecting a proper information storage mode and defining an information access channel according to the information management standard and the information storage mode;
a user accesses the information management module through the information access channel, performs processing flow definition through the graphical interface, and stores the generated processing flow definition in the information management module through the information access channel;
the information management module generates an information processing execution plan according to the processing flow definition;
the information analysis module is used for preprocessing the information in the information management module according to the information processing execution plan, wherein the preprocessing comprises four steps of capital and small case conversion, word segmentation, stop word filtering and stem extraction;
carrying out case conversion processing on the information, and converting all letters in the information into lower case letters;
carrying out word segmentation on the information, and taking a blank space, a punctuation mark and a paragraph as segmentation symbols to segment the information into independent words;
extracting word stems from the information, filtering out the past form of words with the same root but different tenses;
filtering stop words of the information, and filtering out auxiliary verbs, prepositions, conjunctions and exclamation words in the information;
the information sorting module is used for vectorizing the information, and the information of the key information on the page of each website corresponds to a characteristic vector;
the information feedback module compares the information of the article abstract given by the user with the information of the abstract in the inverted index table, and calculates the similarity of the article abstract and the abstract through a formula I:
the formula I is as follows:
Figure RE-GDA0001919461210000031
wherein, beta represents the similarity between the information of the article abstract given by the user and the information of the abstract in the inverted index table, W1j、W2jRespectively representing the feature vector corresponding to the information of the article abstract given by the user and the feature vector corresponding to the information of the abstract in the inverted index table, wherein j and n are positive integers, and j is less than or equal to n;
and arranging the abstracts in the inverted index table from high to low according to the similarity, and acquiring key information on the page of the website where the corresponding abstracts are located.
The beneficial results of the invention are as follows: the invention provides a document delivery object selection system which can help a user to select a proper posted periodical, avoid the conditions of refusal of the manuscript, delay or few readers after publication caused by delivering an error periodical, and has wide market prospect and application value.
Drawings
FIG. 1 is a diagram showing the variation of accuracy in a training set and a test set after selecting different numbers of recommended three types of features in an MI feature selection model;
FIG. 2 is a graph showing the variation of the accuracy in the training set and the test set after selecting different numbers of recommended three types of features in the IG feature selection model;
FIG. 3 is a diagram showing the variation of the accuracy in the training set and the test set after selecting different numbers of recommended three types of features under the CHI feature selection model;
FIG. 4 recommends a class of macro-averaged ROC plots under the MI, IG and CHI feature selection models, respectively.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more apparent, the present invention is described in detail below with reference to the embodiments. It should be noted that the specific embodiments described herein are only for illustrating the present invention and are not to be construed as limiting the present invention, and products that can achieve the same functions are included in the scope of the present invention. The specific method comprises the following steps:
example 1: the distribution of the abstract quantity of the training set and the test set is shown in table 1, and 14012 article information containing the title, the abstract and the author are collected by the data crawler of the invention. Two-thirds of the digests are used as training sets and one-third of the digests are used as test sets. In the experiments, the data were selected from published articles in class a journals and meetings published in 2013 and 2014 on the CCF. But for those journals and meetings that had too few published articles in 2014 and 2013, the present invention also collected their published articles in other years. To verify the correctness of the data set, the present invention manually verifies a summary of twenty percent in each journal and meeting.
TABLE 1 distribution of training set and test set Abstract number
Figure RE-GDA0001919461210000041
The recommendation system of the invention provides two recommendation results: one class is recommended and three classes are recommended. The version recommending one category (Top1) recommends only one journal or meeting and is also very strict in evaluating the results. Recommending three types (Top3) of versions would give three candidate periodicals or meetings. The recommended three versions are the first three categories with the highest classification scores are selected as the recommendation result. In other words, if one of the recommendations is hit in the correct journal or meeting, the recommendation is considered successful. Recommending three versions also provides users with more choices, because sometimes an article published in a certain field actually has a certain influence on other related fields, and it can be said that an article is not only valuable for a certain field, but also has a great influence on a plurality of fields. In addition, different meetings or periodicals often also publish papers in similar areas, such as: ICCV, CVPR, TIP, etc., and there are also many journal articles that are extensions of conference articles.
In order to generate a better feature space, the invention performs a lot of experiments on the selection of the feature quantity for each category. The invention uses three methods of MI, IG and CHI to compare in feature selection, and uses the top M words with the highest score as the feature vector FV of the ith categoryi
The present invention uses accuracy (equation 1), F-measure (equation 2), and ROC curves to evaluate the effectiveness of the system. Because the system of the present invention uses a multi-classification model, the present invention uses the macro-averaged ROC (equations 3 and 4) curve.
Figure RE-GDA0001919461210000051
Figure RE-GDA0001919461210000052
Figure RE-GDA0001919461210000053
Figure RE-GDA0001919461210000054
Wherein, PiRefers to the set of test samples predicted to be of the ith class, GiRefers to a set of test samples with a true category i. TPi,FNi,FPiAnd TN areiThe number of true positive rate, false negative rate, false positive rate and true negative rate of the ith category.
Fig. 1, 2, and 3 show the accuracy rate variation on the training set and the test set after selecting different numbers (i.e., M) of features for each category in the three feature selection models of MI, IG, and CHI, respectively. The results in fig. 1, 2 and 3 are obtained by using the results of recommending three types (Top3) of versions, in other words, if one of the three recommendation results given by the system is correct, the recommendation is considered to be successful. Figure 4 and table 1 show a comparison of the three feature selection models.
As can be seen in fig. 1, 2, 3:
(1) when using IG and CHI feature selection models, the accuracy increases as the number of features increases when the number of features per class reaches 30. However, the model for MI is 16.60% less accurate starting from feature number 30 to 70. The accuracy is minimal at 70 feature counts per class, and then begins to increase as the number of features per class increases. However, up to a feature number of 400, the accuracy of the model based on MI was only 55.7%, 7.9% and 9.3% lower than that of IG and CHI, respectively.
(2) The accuracy of selecting models based on CHI features is generally better than both MI and IG. For example, at the beginning of the test accuracy curve, the accuracy of the CHI model is 78.7% and 1.2% higher than that of MI and IG, respectively; at the end of the curve, i.e. the number of features per class is 400, the accuracy of the CHI model is 12.2% and 2.8% higher than MI and IG, respectively.
From fig. 4, the invention can be seen: the areas under the curves based on the CHI and IG models are much higher than the MI based models. Wherein CHI has an area under the curve value of 0.9404, IG has an area under the curve value of 0.9415, and MI has an area under the curve value of 0.8273, and CHI and IG are about 13% higher than MI. From the comparison of the areas under the curves, it can be seen that the MI model is not a good choice for feature selection in the paper recommendation system. Therefore, CHI and IG models are suitable.
TABLE 2 accuracy and F-measure of the three feature selection models under Top1 and Top3
Figure RE-GDA0001919461210000061
From table 2, it can be seen that:
(1) the results for Top3 are much higher than for Top 1. For example, under Top3, the classification accuracy based on the CHI model reached 61.37%, which is 75.2% higher than that under Top1, because Top3 gave a larger recommendation space. In fact, different meetings or periodicals often give similar publishing categories and in a sense, Top1 gives a recommendation that is too rigorous.
(2) The CHI model achieves the highest accuracy and F-measure. For example, the accuracy of the CHI model under Top3 reached 61.37%, 49.6% higher than MI and 1.4% higher than IG. For F-measure, the CHI model reached 0.23, which is 27.7% higher than MI and 9.5% higher than IG. This is because the CHI model increases the correlation with other classes when selecting features in each class. The IG accuracy rate is 47.5 percent higher than that of MI, and the F-measure value is 16.7 percent higher than that of MI. Meanwhile, for the relatively strict Top1, the accuracy and F-measure value of the CHI model are 35.03% and 0.18, respectively, which are higher than MI and IG by 70.5% and 2.8%, and the F-measure value is higher than MI and IG by 38.2% and 13.1%. From the comparison of experimental results, it can also be found that the CHI model and the IG model are more suitable for feature selection for the paper recommendation system.
In practical use in the recommendation system, in consideration of the balance between accuracy and efficiency, the value of M is set to be M according to the experimental results shown in fig. 1, 2, and 3200. Then, all the feature vectors are combined
Figure RE-GDA0001919461210000071
In combination, the repeated words are deleted to form the final new feature vector space. For the CHI feature selection method, the final feature vector space dimension is NFV-CHI11,521. For MI and IG, the invention also selects the first 200 words for each category, and the feature vector dimensions obtained according to the same merging and de-duplication method are respectively NFV-MI12,696 and NFV-IG=6,101。

Claims (1)

1. A document delivery object selection system is characterized by comprising an information extraction module, an information management module, an information analysis module and an information sorting feedback module;
the information extraction module comprises an information exchange path downloading unit, a query scheme unit, a page information extraction unit and an analysis storage unit, and each unit is independent and is sequentially executed;
the information exchange path downloading unit comprises an extracting device, a screening device, a standardizing device and a duplicate removal device;
the extraction device extracts the information exchange path of the code of the HTML page by using an ELFhash function;
the screening device directly deletes the unnecessary information exchange paths and stores the necessary information exchange paths;
the standardized device converts all the information exchange paths into absolute addresses;
the duplicate removal device carries out website duplicate removal by establishing a Hash table and a Hash function;
the information exchange path downloading unit is responsible for extracting and sorting all the information exchange paths in the HTML page and taking the information exchange paths as the input of the query scheme unit;
the query scheme unit downloads the website extracted by the information exchange path extraction unit to obtain the information exchange path on the page of the website; when the website is downloaded, a breadth-first scheme is used;
the page information extraction unit is responsible for extracting key information on the page of the website, including the title, abstract and author of an article;
the information classification unit is responsible for classifying and storing the key information on the page of the website according to the category of the publication, and preprocessing the abstract to form an inverted index table;
the information extraction module stores the extracted information of the inverted index table into the information management module;
the information management module is responsible for defining an information management standard, selecting a proper information storage mode and defining an information access channel according to the information management standard and the information storage mode;
a user accesses the information management module through the information access channel, carries out processing flow definition through a graphical interface, and stores the generated processing flow definition in the information management module through the information access channel;
the information management module generates an information processing execution plan according to the processing flow definition;
the information analysis module is used for preprocessing the information in the information management module according to the information processing execution plan, wherein the preprocessing comprises four steps of case conversion, word segmentation, stop word filtering and stem extraction;
performing the case-case conversion processing on the information, and converting all letters in the information into lower case letters;
performing word segmentation on the information, and taking a blank space, a punctuation mark and a paragraph as segmentation characters to segment the information into independent words;
extracting word stems from the information, filtering out the past form of the words which have the same root but different tenses;
filtering the stop words of the information to filter out auxiliary verbs, prepositions, conjunctions and exclamation words in the information;
the information sorting module is used for vectorizing the information, and the information of the key information on the page of each website corresponds to a feature vector;
the information feedback module compares the information of the article abstract given by the user with the information of the abstract in the inverted index table, and calculates the similarity of the article abstract and the abstract through a formula I:
the formula I is as follows:
Figure FDA0002967635570000021
wherein β represents a similarity between information of the article abstract given by the user and information of the abstract in the inverted index table, W1j、W2jRespectively representing the feature vector corresponding to the information of the article abstract given by the user and the feature vector corresponding to the information of the abstract in the inverted index table, wherein j and n are positive integers, and j is less than or equal to n;
and arranging the abstracts in the inverted index table from high to low according to the similarity, and acquiring key information on the page corresponding to the website where the abstracts are located.
CN201811415165.XA 2018-11-26 2018-11-26 Document delivery object selection system Expired - Fee Related CN109522414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811415165.XA CN109522414B (en) 2018-11-26 2018-11-26 Document delivery object selection system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811415165.XA CN109522414B (en) 2018-11-26 2018-11-26 Document delivery object selection system

Publications (2)

Publication Number Publication Date
CN109522414A CN109522414A (en) 2019-03-26
CN109522414B true CN109522414B (en) 2021-06-04

Family

ID=65779102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811415165.XA Expired - Fee Related CN109522414B (en) 2018-11-26 2018-11-26 Document delivery object selection system

Country Status (1)

Country Link
CN (1) CN109522414B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101169780A (en) * 2006-10-25 2008-04-30 华为技术有限公司 Semantic ontology retrieval system and method
CN101984435A (en) * 2010-11-17 2011-03-09 百度在线网络技术(北京)有限公司 Method and device for distributing texts
CN102073726A (en) * 2011-01-11 2011-05-25 百度在线网络技术(北京)有限公司 Search engine system and structured data import method for search engine system
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Method and device for matching texts
WO2014038306A1 (en) * 2012-09-06 2014-03-13 日本電気株式会社 Full-text search system using non-volatile content addressable memory, and text string comparison method employing same

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101169780A (en) * 2006-10-25 2008-04-30 华为技术有限公司 Semantic ontology retrieval system and method
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Method and device for matching texts
CN101984435A (en) * 2010-11-17 2011-03-09 百度在线网络技术(北京)有限公司 Method and device for distributing texts
CN102073726A (en) * 2011-01-11 2011-05-25 百度在线网络技术(北京)有限公司 Search engine system and structured data import method for search engine system
WO2014038306A1 (en) * 2012-09-06 2014-03-13 日本電気株式会社 Full-text search system using non-volatile content addressable memory, and text string comparison method employing same

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
科研文献信息抽取和检索分析子系统的设计与实现;杨铭;《cnki》;20170601;全文 *

Also Published As

Publication number Publication date
CN109522414A (en) 2019-03-26

Similar Documents

Publication Publication Date Title
CN108073568A (en) keyword extracting method and device
CN108846097B (en) User interest tag representation method, article recommendation device and equipment
CN107895303B (en) Personalized recommendation method based on OCEAN model
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
Kaur Incorporating sentimental analysis into development of a hybrid classification model: A comprehensive study
CN111680225A (en) WeChat financial message analysis method and system based on machine learning
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
CN110990670B (en) Growth incentive book recommendation method and recommendation system
CN116010552A (en) Engineering cost data analysis system and method based on keyword word library
CN111831810A (en) Intelligent question and answer method, device, equipment and storage medium
CN111612519A (en) Method, device and storage medium for identifying potential customers of financial product
CN116882414B (en) Automatic comment generation method and related device based on large-scale language model
CN109508557A (en) A kind of file path keyword recognition method of association user privacy
TW201243627A (en) Multi-label text categorization based on fuzzy similarity and k nearest neighbors
CN112163415A (en) User intention identification method and device for feedback content and electronic equipment
CN109522414B (en) Document delivery object selection system
Guadie et al. Amharic text summarization for news items posted on social media
CN111339303B (en) Text intention induction method and device based on clustering and automatic abstracting
CN114443961A (en) Content filtering scientific and technological achievement recommendation method, model and storage medium
Wang et al. Content-based weibo user interest recognition
CN117648635B (en) Sensitive information classification and classification method and system and electronic equipment
Anand et al. Integrating and querying similar tables from PDF documents using deep learning
CN112487302B (en) File resource accurate pushing method based on user behaviors
Guo et al. CNN-Based Model for Chinese Information Processing and Its Application in Large-Scale Book Purchasing
ElGhazaly Automatic text classification using neural network and statistical approaches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210604

Termination date: 20211126