CN109522414B

CN109522414B - Document delivery object selection system

Info

Publication number: CN109522414B
Application number: CN201811415165.XA
Authority: CN
Inventors: 丰小月; 梁艳春; 王冬晖; 许东
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2021-06-04
Anticipated expiration: 2038-11-26
Also published as: CN109522414A

Abstract

The invention relates to a document delivery object selection system, which comprises an information extraction module, an information management module, an information analysis module and an information sorting feedback module; the method can help the user to select a proper posted periodical, and avoid the situation of refusal of the manuscript, delay or few readers after publication caused by delivering an error periodical.

Description

Document delivery object selection system

Technical Field

The invention relates to the field of information recommendation, in particular to a document delivery object selection system.

Background

Due to the huge growth and increasing complexity of network information, it is difficult for users to accurately find their desired information from massive data, especially for researchers to know the dynamics of the research field in time, which is sometimes not easy. Today, much of the world's new knowledge is mainly represented in Digital form and stored in Digital library (Digital Libraries) systems, so Digital Libraries are entering a golden age. Such data libraries are available in the scientific and technical fields, such as ACM Library, IEEE Library, etc. However, with the development of technology and the continuous increase of information, the development of these trends causes an inevitable problem of information overload. For example, when researchers want to select an appropriate publication to publish a paper, they find that a large number of publications match their query, but largely independent of their actual needs, which makes their choice inappropriately. Researchers are urgently in need of a paper recommendation system to help them select appropriate publications.

Disclosure of Invention

In view of the above, the invention provides a document delivery object selection system, which is characterized by comprising an information extraction module, an information management module, an information analysis module and an information sorting feedback module;

the information extraction module comprises an information exchange path downloading unit, an inquiry scheme unit, a page information extraction unit and an analysis storage unit, and each unit is independent and is sequentially executed;

the information exchange path downloading unit comprises an extracting device, a screening device, a standardizing device and a duplicate removal device;

the extraction device extracts the information exchange path of the code of the HTML page by using an ELFhash function;

the screening device directly deletes the unnecessary information exchange paths and stores the necessary information exchange paths;

the standardized device converts all information exchange paths into absolute addresses;

the duplicate removal device performs website duplicate removal by creating a Hash table and a Hash function;

the information exchange path downloading unit is responsible for extracting and sorting all information exchange paths in the HTML page and used as the input of the query scheme unit;

the query scheme unit downloads the website extracted by the information exchange path extraction unit to obtain an information exchange path on a page of the website; when downloading the website, using a breadth-first scheme;

the page information extraction unit is responsible for extracting key information on a page of the website, including the title, abstract and author of the article;

the information classification unit is responsible for classifying and storing key information on the page of the website according to the category of the publication, and preprocessing the abstract to form an inverted index table;

the information extraction module stores the extracted information into the information management module;

the information management module is responsible for defining an information management standard, selecting a proper information storage mode and defining an information access channel according to the information management standard and the information storage mode;

a user accesses the information management module through the information access channel, performs processing flow definition through the graphical interface, and stores the generated processing flow definition in the information management module through the information access channel;

the information management module generates an information processing execution plan according to the processing flow definition;

the information analysis module is used for preprocessing the information in the information management module according to the information processing execution plan, wherein the preprocessing comprises four steps of capital and small case conversion, word segmentation, stop word filtering and stem extraction;

carrying out case conversion processing on the information, and converting all letters in the information into lower case letters;

carrying out word segmentation on the information, and taking a blank space, a punctuation mark and a paragraph as segmentation symbols to segment the information into independent words;

extracting word stems from the information, filtering out the past form of words with the same root but different tenses;

filtering stop words of the information, and filtering out auxiliary verbs, prepositions, conjunctions and exclamation words in the information;

the information sorting module is used for vectorizing the information, and the information of the key information on the page of each website corresponds to a characteristic vector;

the information feedback module compares the information of the article abstract given by the user with the information of the abstract in the inverted index table, and calculates the similarity of the article abstract and the abstract through a formula I:

the formula I is as follows:

wherein, beta represents the similarity between the information of the article abstract given by the user and the information of the abstract in the inverted index table, W_1j、W_2jRespectively representing the feature vector corresponding to the information of the article abstract given by the user and the feature vector corresponding to the information of the abstract in the inverted index table, wherein j and n are positive integers, and j is less than or equal to n;

and arranging the abstracts in the inverted index table from high to low according to the similarity, and acquiring key information on the page of the website where the corresponding abstracts are located.

The beneficial results of the invention are as follows: the invention provides a document delivery object selection system which can help a user to select a proper posted periodical, avoid the conditions of refusal of the manuscript, delay or few readers after publication caused by delivering an error periodical, and has wide market prospect and application value.

Drawings

FIG. 1 is a diagram showing the variation of accuracy in a training set and a test set after selecting different numbers of recommended three types of features in an MI feature selection model;

FIG. 2 is a graph showing the variation of the accuracy in the training set and the test set after selecting different numbers of recommended three types of features in the IG feature selection model;

FIG. 3 is a diagram showing the variation of the accuracy in the training set and the test set after selecting different numbers of recommended three types of features under the CHI feature selection model;

FIG. 4 recommends a class of macro-averaged ROC plots under the MI, IG and CHI feature selection models, respectively.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more apparent, the present invention is described in detail below with reference to the embodiments. It should be noted that the specific embodiments described herein are only for illustrating the present invention and are not to be construed as limiting the present invention, and products that can achieve the same functions are included in the scope of the present invention. The specific method comprises the following steps:

example 1: the distribution of the abstract quantity of the training set and the test set is shown in table 1, and 14012 article information containing the title, the abstract and the author are collected by the data crawler of the invention. Two-thirds of the digests are used as training sets and one-third of the digests are used as test sets. In the experiments, the data were selected from published articles in class a journals and meetings published in 2013 and 2014 on the CCF. But for those journals and meetings that had too few published articles in 2014 and 2013, the present invention also collected their published articles in other years. To verify the correctness of the data set, the present invention manually verifies a summary of twenty percent in each journal and meeting.

TABLE 1 distribution of training set and test set Abstract number

The recommendation system of the invention provides two recommendation results: one class is recommended and three classes are recommended. The version recommending one category (Top1) recommends only one journal or meeting and is also very strict in evaluating the results. Recommending three types (Top3) of versions would give three candidate periodicals or meetings. The recommended three versions are the first three categories with the highest classification scores are selected as the recommendation result. In other words, if one of the recommendations is hit in the correct journal or meeting, the recommendation is considered successful. Recommending three versions also provides users with more choices, because sometimes an article published in a certain field actually has a certain influence on other related fields, and it can be said that an article is not only valuable for a certain field, but also has a great influence on a plurality of fields. In addition, different meetings or periodicals often also publish papers in similar areas, such as: ICCV, CVPR, TIP, etc., and there are also many journal articles that are extensions of conference articles.

In order to generate a better feature space, the invention performs a lot of experiments on the selection of the feature quantity for each category. The invention uses three methods of MI, IG and CHI to compare in feature selection, and uses the top M words with the highest score as the feature vector FV of the ith categoryⁱ。

The present invention uses accuracy (equation 1), F-measure (equation 2), and ROC curves to evaluate the effectiveness of the system. Because the system of the present invention uses a multi-classification model, the present invention uses the macro-averaged ROC (equations 3 and 4) curve.

Wherein, P_iRefers to the set of test samples predicted to be of the ith class, G_iRefers to a set of test samples with a true category i. TP_i，FN_i，FP_iAnd TN are_iThe number of true positive rate, false negative rate, false positive rate and true negative rate of the ith category.

Fig. 1, 2, and 3 show the accuracy rate variation on the training set and the test set after selecting different numbers (i.e., M) of features for each category in the three feature selection models of MI, IG, and CHI, respectively. The results in fig. 1, 2 and 3 are obtained by using the results of recommending three types (Top3) of versions, in other words, if one of the three recommendation results given by the system is correct, the recommendation is considered to be successful. Figure 4 and table 1 show a comparison of the three feature selection models.

As can be seen in fig. 1, 2, 3:

(1) when using IG and CHI feature selection models, the accuracy increases as the number of features increases when the number of features per class reaches 30. However, the model for MI is 16.60% less accurate starting from feature number 30 to 70. The accuracy is minimal at 70 feature counts per class, and then begins to increase as the number of features per class increases. However, up to a feature number of 400, the accuracy of the model based on MI was only 55.7%, 7.9% and 9.3% lower than that of IG and CHI, respectively.

(2) The accuracy of selecting models based on CHI features is generally better than both MI and IG. For example, at the beginning of the test accuracy curve, the accuracy of the CHI model is 78.7% and 1.2% higher than that of MI and IG, respectively; at the end of the curve, i.e. the number of features per class is 400, the accuracy of the CHI model is 12.2% and 2.8% higher than MI and IG, respectively.

From fig. 4, the invention can be seen: the areas under the curves based on the CHI and IG models are much higher than the MI based models. Wherein CHI has an area under the curve value of 0.9404, IG has an area under the curve value of 0.9415, and MI has an area under the curve value of 0.8273, and CHI and IG are about 13% higher than MI. From the comparison of the areas under the curves, it can be seen that the MI model is not a good choice for feature selection in the paper recommendation system. Therefore, CHI and IG models are suitable.

TABLE 2 accuracy and F-measure of the three feature selection models under Top1 and Top3

From table 2, it can be seen that:

(1) the results for Top3 are much higher than for Top 1. For example, under Top3, the classification accuracy based on the CHI model reached 61.37%, which is 75.2% higher than that under Top1, because Top3 gave a larger recommendation space. In fact, different meetings or periodicals often give similar publishing categories and in a sense, Top1 gives a recommendation that is too rigorous.

(2) The CHI model achieves the highest accuracy and F-measure. For example, the accuracy of the CHI model under Top3 reached 61.37%, 49.6% higher than MI and 1.4% higher than IG. For F-measure, the CHI model reached 0.23, which is 27.7% higher than MI and 9.5% higher than IG. This is because the CHI model increases the correlation with other classes when selecting features in each class. The IG accuracy rate is 47.5 percent higher than that of MI, and the F-measure value is 16.7 percent higher than that of MI. Meanwhile, for the relatively strict Top1, the accuracy and F-measure value of the CHI model are 35.03% and 0.18, respectively, which are higher than MI and IG by 70.5% and 2.8%, and the F-measure value is higher than MI and IG by 38.2% and 13.1%. From the comparison of experimental results, it can also be found that the CHI model and the IG model are more suitable for feature selection for the paper recommendation system.

In practical use in the recommendation system, in consideration of the balance between accuracy and efficiency, the value of M is set to be M according to the experimental results shown in fig. 1, 2, and 3200. Then, all the feature vectors are combined

In combination, the repeated words are deleted to form the final new feature vector space. For the CHI feature selection method, the final feature vector space dimension is N_FV-CHI11,521. For MI and IG, the invention also selects the first 200 words for each category, and the feature vector dimensions obtained according to the same merging and de-duplication method are respectively N_FV-MI12,696 and N_FV-IG＝6,101。

Claims

1. A document delivery object selection system is characterized by comprising an information extraction module, an information management module, an information analysis module and an information sorting feedback module;

the information extraction module comprises an information exchange path downloading unit, a query scheme unit, a page information extraction unit and an analysis storage unit, and each unit is independent and is sequentially executed;

the standardized device converts all the information exchange paths into absolute addresses;

the duplicate removal device carries out website duplicate removal by establishing a Hash table and a Hash function;

the information exchange path downloading unit is responsible for extracting and sorting all the information exchange paths in the HTML page and taking the information exchange paths as the input of the query scheme unit;

the query scheme unit downloads the website extracted by the information exchange path extraction unit to obtain the information exchange path on the page of the website; when the website is downloaded, a breadth-first scheme is used;

the page information extraction unit is responsible for extracting key information on the page of the website, including the title, abstract and author of an article;

the information classification unit is responsible for classifying and storing the key information on the page of the website according to the category of the publication, and preprocessing the abstract to form an inverted index table;

the information extraction module stores the extracted information of the inverted index table into the information management module;

a user accesses the information management module through the information access channel, carries out processing flow definition through a graphical interface, and stores the generated processing flow definition in the information management module through the information access channel;

the information analysis module is used for preprocessing the information in the information management module according to the information processing execution plan, wherein the preprocessing comprises four steps of case conversion, word segmentation, stop word filtering and stem extraction;

performing the case-case conversion processing on the information, and converting all letters in the information into lower case letters;

performing word segmentation on the information, and taking a blank space, a punctuation mark and a paragraph as segmentation characters to segment the information into independent words;

extracting word stems from the information, filtering out the past form of the words which have the same root but different tenses;

filtering the stop words of the information to filter out auxiliary verbs, prepositions, conjunctions and exclamation words in the information;

the information sorting module is used for vectorizing the information, and the information of the key information on the page of each website corresponds to a feature vector;

the formula I is as follows:

wherein β represents a similarity between information of the article abstract given by the user and information of the abstract in the inverted index table, W_1j、W_2jRespectively representing the feature vector corresponding to the information of the article abstract given by the user and the feature vector corresponding to the information of the abstract in the inverted index table, wherein j and n are positive integers, and j is less than or equal to n;

and arranging the abstracts in the inverted index table from high to low according to the similarity, and acquiring key information on the page corresponding to the website where the abstracts are located.