CN113239181A

CN113239181A - Scientific and technological literature citation recommendation method based on deep learning

Info

Publication number: CN113239181A
Application number: CN202110525982.6A
Authority: CN
Inventors: 廖伟智; 左东舟
Original assignee: Individual
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-08-10
Anticipated expiration: 2041-05-14
Also published as: CN113239181B

Abstract

The invention discloses a scientific and technical literature citation recommendation method based on deep learning, which sequentially performs information extraction, noise removal, vocabulary index construction and vectorization processing on text data; firstly, performing semantic vectorization representation on a text to be matched through Bi-LSTM, then performing interactive coding on the text by adopting an Attention mechanism, and performing feature extraction on interactive information of the text through a multilayer CNN network so as to obtain final matching degree information of the text; in the first stage, a relevant citation recommendation set is generated through vector space similarity of texts, and in the second stage, a text reasoning matching method is used for carrying out language understanding on a candidate set so as to obtain an accurate relevance ranking list. The method can solve the problems of semantic understanding, content integrity and accuracy in the existing text reasoning matching, provides high-quality semantic features and a digitalized text input form, and performs feature extraction on interactive information of the text, thereby obtaining the final matching degree information of the text.

Description

Scientific and technological literature citation recommendation method based on deep learning

Technical Field

The invention belongs to the technical field of information retrieval and analysis, and particularly relates to a scientific and technological literature citation recommendation method based on deep learning.

Background

The technological innovation ability is a decisive factor for the development of the national science and technology career and is the core of the national competitiveness and the important basis of the strong nations and the people. The scientific and technological literature resources as the result of the condensed scientific and technological innovation activities are important carriers for spreading scientific and technological knowledge, are also basic sources and important supports for further improving the scientific and technological innovation capacity, and become one of the precious strategic resources of the country.

In the process of technological innovation, the knowledge of the development history and trend of the subject and the effective academic communication with the colleagues are important for the innovation subject, and the scientific literature related to the reading field is the best way to realize the process. However, the premise is whether the domain-related documents are comprehensive or not and whether the contents are closely related or not, and the domain-related documents are also the key factors influencing the innovation and the effect. Therefore, the support innovation subject has practical significance and great demand for research of scientific and technical literature acquisition methods.

Innovative subject the existing method for acquiring scientific and technical literature mainly takes a search tool based on a keyword method to search actively, but the method is influenced by human factors and the search tool, so that the requirements of the coverage and accuracy of the literature are often difficult to meet. The scientific literature is scientific journal, thesis, patent, standard and network scientific information resource existing in text form. The method has the advantages of large quantity, various varieties, dispersion isolation, dynamic isomerism, diversity and complexity, and extremely strong speciality, academic property and unstructured characteristics. The existing 'active search' mode requires that a searcher skillfully master a search tool and can also accurately set 'keywords'. However, the setting of the keyword is greatly influenced by searching human literacy, personal experience, professional knowledge background and accidental subjective factors, so that the searching process is time-consuming and labor-consuming, and the searching result is often full of a large amount of redundant and useless documents. Therefore, in order to solve the problems that the existing method excessively depends on human factors on the acquisition literature and efficiency and accuracy are difficult to guarantee, a citation recommendation method based on an on-demand active service concept is provided and becomes a research hotspot.

Citations recommend that there is a general need for use by all industry and scientific researchers because they help to release solidified scientific literature information resources and enhance their flowability. For example, aiming at personalized citation recommendation service of scientific research personnel, the method can promote the transformation of a 'passive retrieval' mode to an 'active recommendation' service mode in the past of scientific documents, thereby not only reducing the workload of the scientific research personnel in the writing processes of papers, technical reports and the like, but also improving the efficiency and the accuracy of document service; the patent citation recommendation aiming at enterprise product innovation personnel can promote the innovation activity of the enterprise product innovation personnel and provide support for acquiring the information of the right confirming, managing and maintaining processes. In addition, the search tool based on the keyword method cannot support semantic understanding and recognition matching of document contents due to the lack of semantic recognition and matching reasoning capabilities, and further cannot acquire comprehensive and accurate documents. In order to solve the problems that a traditional 'active search' mode excessively depends on human factors and a retrieval tool lacks matching reasoning capability, a text reasoning matching and citation recommendation method supporting scientific and technical literature acquisition becomes an urgent need and a hot research direction.

Disclosure of Invention

Aiming at the defects of the prior art, the technical problem to be solved by the invention is to provide a scientific and technological literature citation recommendation method based on deep learning, the method uses a vector space model based on text to quickly screen a document set matched with a target document so as to narrow the recommendation range and reduce the calculation load of the model, and uses an algorithm based on text reasoning matching to more accurately sort candidate document sets so as to achieve a more personalized recommendation effect, thereby solving the problems of semantic incomprehension, incomplete content and inaccuracy existing in the existing text reasoning matching.

The existing text matching recommendation method mainly adopts a mode based on keywords and rules to match texts, and then classifies and infers the text contents. Recommendation algorithms of the platform are mostly pushed in a way of keywords based on collaborative filtering. The method is difficult to understand the deep semantic information of the text, and the professionalism of the scientific and technological field cannot be considered, so that the matching and recommendation deviation is often caused. And in the matching process, rules need to be established manually, so that the matching precision is poor, and the recommendation quality is difficult to guarantee. The main problems are:

(1) lack the specialty in science and technology field, matching and recommendation effect are difficult to guarantee: in the existing method, most matching algorithms are based on a large number of general corpus training word vectors which are used as the construction basis of a machine learning model, but the general word vectors usually lack professional vocabularies in the field, so that the model is difficult to obtain input information more fitting scientific and technological resources, and the matching and recommendation quality is difficult to achieve the expectation.

(2) The construction of the long text model causes low processing efficiency and information loss: the existing method mainly takes full text as a word segmentation target, and then extracts text features according to a probability statistics method, and text semantic expression is easy to cause inaccuracy based on statistical text semantic representation, so that matching and recommendation effects are influenced.

(3) The method based on collaborative filtering and the citation relationship network has cold start, and the semantic capturing capability is poor: the existing collaborative filtering algorithm is similar to a recommendation algorithm in the business field, and similar commodities or commodities interacted by a recommender are recommended to a user. However, the scientific and technical text contains a large amount of semantic information, and it is difficult to capture new documents and small-sized documents with strong correlation on text contents by using a collaborative filtering algorithm alone. The existing method generally weakens the content information of the text by using the source data information of the text, and inevitably influences the recommendation effect. In conclusion, the existing text reasoning and matching algorithm based on the keywords and the vector space model of the platform has the defects of insufficient semantic understanding capability, the collaborative filtering recommendation and the source data recommendation used in the literature recommendation are often incomplete, the semantic expression capability of corresponding professional science and technology contents is lacked, the characteristics of information loss and the like easily caused in the long text modeling process, and the problems of cold start and the like in the recommendation process. Therefore, the text reasoning matching method based on the scientific and technical text content and the citation recommendation algorithm of the scientific and technical literature have urgent needs and important significance.

The technical scheme adopted by the invention is as follows: the scientific and technological literature citation recommendation method based on deep learning comprises the following steps:

(1) a text processing module: sequentially carrying out information extraction, noise removal, word list index construction and vectorization processing on the text data;

(2) the text reasoning matching module: firstly, performing semantic vectorization representation on a text to be matched through Bi-LSTM, then performing interactive coding on the text by adopting an Attention mechanism, and finally performing feature extraction on interactive information of the text through a multilayer CNN network so as to obtain final matching degree information of the text;

(3) citation recommendation module: and adopting two-stage citation recommendation, wherein a related citation recommendation set is generated in the first stage through vector space similarity of texts, and a text reasoning matching method is used in the second stage to carry out language understanding on the candidate set so as to obtain an accurate relevance ranking list.

The scheme mainly adopts a text processing module and a citation recommending module. The text processing module is mainly used for preprocessing the text, and performing short text conversion, denoising, word segmentation, word stop and the like on the scientific and technical text. The text is then vectorized using distributed word vector techniques. The citation recommendation module mainly comprises a recommendation candidate module and a candidate sorting module. Wherein the candidate module preliminarily filters a relevant candidate set by using a space vector model; in the sequencing module, a text reasoning and matching algorithm is used as a basic part of content recommendation, a deep learning and matching algorithm based on LSTM + CNN is adopted, the algorithm has the advantages that semantic information of a professional science and technology text can be coded by using the deep learning method to obtain better semantic representation, and meanwhile, the CNN is matched with more exquisite similarity information, so that the matching accuracy is improved. And the two are fused to obtain more comprehensive matching information, so that the citation recommendation effect is improved.

Further, in the step (1), the information extraction is to judge the type of the scientific and technical literature according to the source data information of the input text, extract the keyword description of the title and the abstract, and then extract the corresponding text content; and (3) denoising the text in the step (1) by using a re regularization module built in the python language.

Further, the vectorization processing in the step (1) adopts Word2vec as a distributed Word vector technology.

Further, the model structure of Word2vec adopts a predicted Word w_tCentered, K consecutive words are selected forward and backward as predicted words, when w is_tFor the input word, the connected context word is the predicted word, and assuming that K is 2, the context word is w_t-2、w_t-1And w_t+2、w_t+1Initializing a word vector matrix

V is the number of words in the word list, and m is the dimensionality of the word vector; searching the vector of the input word according to the index of the vector matrix, wherein each row in the word vector corresponds to a word vector e with the current sequence number, and the calculation process is shown as formulas (3-4) to (3-6):

h＝E(w_t) (3-4)

e＝Wh (3-5)

furthermore, the text reasoning matching module adopts an interactive reasoning matching model architecture and consists of an input layer, an interactive coding layer and a matching output layer, wherein the input layer carries out word segmentation processing on two sections of texts respectively, the segmented texts are represented by vectors, and the words have semantic expression ability by using pre-trained word vectors; on an interactive coding layer, two sections of texts are coded and expressed by using bidirectional LSTM (Bi-LSTM) respectively to obtain integral semantic expression of the texts, and meanwhile, Attention operation is carried out on an LSTM hidden vector sequence to enable interactive similarity matching to be carried out between local texts, and local importance information of the texts can be obtained according to Attention scores; after the interactive text vector sequence is obtained, matching operation is carried out on the two text sequences in a point multiplication mode to obtain a matching tensor which contains all local matching information characteristics of the two texts; finally, performing feature extraction work on the matching tensor at a feature extraction layer by utilizing the advantage of the step-by-step feature extraction of the CNN, wherein the extracted CNN feature layer contains high-order matching information of two sections of texts; the output layer outputs the matching result of the two short texts through a full-connection feedforward neural network.

Further, p ═ is used in the input layer (p)₁,p₂,...,p_lp) And q ═ q (q)₁,q₂,...,q_lq) Representing two sentences in natural language, p being a precondition, q being a hypothesis, respectively, where p is_i，

p_i,q_iUsing pre-trained word vectors, d representing the dimension of the word vector, l_pIndicates the length of the sentence p, l_qRepresents the length of sentence q;

writing, in the cross-coding layer, a hidden or output state generated by the BilSTM at time i for an input sequence p as

Writing an input sequence q at time i as a hidden or output state generated by BilSTM

And is calculated by the following formula:

for reasoning and matching between two sentences, the expression of the sentences is enriched by adopting two attention mechanisms, one adopts a self-attention mechanism to calculate the attention weight of each word and other words in the sentences; another interactive attention mechanism is adopted, firstly, attention weights of each word in the sentence p and all words in the sentence q are calculated, attention weights of each word in the sentence q and all words in the sentence p are calculated similarly, similarity among the words is calculated in a dot product mode, a two-dimensional similarity matrix is obtained, and a calculation formula is defined as follows:

M_pq＝pq (4-15)

M_qp＝qp (4-16)

in the formula M_pqAttention matrix, M, representing p and q_qpExpressing an attention matrix of q and p, normalizing attention weights, and expressing each word vector by weighting as follows:

in the formula

Is that

The result of the weighted summation of, wherein

Is that

Then computing element difference values based on attention and implicit vectors, and finally stitching the vectors together as a final expression of the word vector, as follows:

where |, indicates a dot product between two vectors: the product of the corresponding multiplication of each element of the vector, and after the steps, the calculated word vector contains the semantic information of the word in the current sentence and the different and identical information of the two sentences;

the matching output layer performs one-to-one matching of each dimension on the word sequences of the two sentences, and the matching calculation formula is as follows:

e_ij＝t_p⊙t_q (4-21)

element e_ijAnd forming a 4-dimensional similarity tensor, wherein the tensor represents the matching features of the two sentences in the word vector dimension, after obtaining high-order matching features in a matching layer, flattening the features to obtain a final vector representation, classifying the matching vectors by adopting an MLP feedforward neural network, outputting the vectors into 3 category vectors, and finally operating the prediction vector softmax to obtain the maximum result as the final prediction result.

Further, the citation recommendation module includes two modules in two phases: a citation candidate module of the first stage and a candidate sorting module of the second stage.

Further, the document embedding model is adopted in the quotation candidate module to calculate the query document d_qThe document vector corresponds to the semantic space position of the document in the vector space coordinates, K most similar documents are selected as candidate objects in the vector space, and the document embedding vector d is used_qAnd d_iIs used as the matching degree score of the documents, and the outward citations of the K nearest neighbor documents are added as candidates, wherein

(1) Definition of the model: adopting a supervised neural network model to express embedded vectors of text content words, segmenting the text, then calculating the embedded vectors of the whole text based on word vector characteristics, and expressing the characteristic vectors of the text by using Fd [ title ], wherein the formula is as follows:

wherein w_t ^dirIs a distributed word vector of words, w_t ^magRepresenting the weight value of a word in the text,

normalizing the embedded vector of each word, finally performing weighted calculation on the embedded vector of each word to obtain a final text expression vector, and performing contextualization by using the title and abstract information of the document d as a target text, wherein the formula is as follows:

wherein, alpha and beta are weight values of the title and the abstract respectively and are scalar parameters;

(2) training of the model: the form T ═ of the data set into triplets (d)_q,d⁺,d^-) Wherein d is_qIs a query document, d⁺Denotes d_qIn the document, but d^-Then is d_qIn documents without references, the loss function of the model is calculated as follows：

loss＝max(λ+s(d_q，d^-)+s(d_q，d⁺)，0) (5-3)

Wherein, s (d)_i,d_j) Defined as text embedding e_di,e_dCosine similarity between j, setting hyper-parameters of a lambda model, and adjusting before training;

(3) selection of negative examples: for (d)_q,d⁺) Selecting d in the data set_qThe cited document d + applies a random method or a negative nearest neighbor method to perform negative sample sampling.

Further, the candidate ranking module comprises the steps of:

(1) model structure: the output layer of the model is defined as:

s(d_i，d_j)＝MLP_Forward(h)(5-4)

wherein, MLP _ Forward is a three-layer feedforward neural network and a sigmoid layer;

(2) model training: the parameters of the model include w_* ^mag，w_* ^dir，w_* ^∩And parameters of a neural network layer, a loss function of the model is the same as that of the citation candidate module, and in a testing stage, the model is used for predicting the candidate document set with the highest score.

The invention has the beneficial effects that:

(1) aiming at the defects of the existing text reasoning matching method in the aspect of semantic understanding, the text reasoning matching method based on deep learning is provided. Aiming at the defect that the existing citation recommendation method is mostly based on source data and graph relation network and rarely utilizes the content information of the text, the two-stage citation recommendation method is provided by combining the advantages of a large amount of professional documents on a resource platform, the overall design of the technical scheme is realized, and the problems of semantic understanding, content integrity and accuracy in text reasoning and matching of the existing method are solved.

(2) According to the characteristics of large noise, more professional vocabularies, unstructured performance and the like of scientific and technical literature texts, preprocessing such as denoising, word segmentation, word stop removal and the like is carried out on the scientific and technical texts. In order to better perform semantic feature representation on scientific and technical texts, the word2vec technology is used for training the distributed word vector expression of the texts, and high-quality semantic features and a digitized text input form are provided for subsequent text reasoning matching and citation recommendation.

(3) Aiming at the defects that the existing scientific and technological text reasoning and matching method is low in accuracy and needs to manually extract features and the problem that the existing statistical-based method is difficult to understand text semantics, the text reasoning and matching method based on LSTM and CNN is provided. The method comprises the steps of firstly carrying out semantic vectorization representation on a text to be matched through BilSTM, then carrying out interactive coding on the text by adopting an Attention mechanism, and finally carrying out feature extraction on interactive information of the text through a multilayer CNN network so as to obtain the final matching degree information of the text. And the effect of the method of the scheme is verified through experiments.

(4) The method aims at the defect of lack of text content semantic information caused by the fact that the existing citation recommendation method is based on source data and citation relation network. A two-stage citation recommendation method based on text content is provided, a relevant citation recommendation set is generated through vector space similarity of texts in a first stage, and language understanding is conducted on a candidate set by using a text reasoning matching method in a second stage so that a more accurate relevance ranking list is obtained. The effect of the method of the scheme is verified through experiments.

Drawings

FIG. 1 is a schematic diagram of a text processing module;

FIG. 2 is a representation of space vector cosine similarity;

FIG. 3 is a schematic diagram of a citation recall module;

FIG. 4 is a schematic diagram of a citation sorting module;

FIG. 5 is a block diagram of similarity calculation between a query document and candidate documents.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The scientific and technological literature citation recommendation method based on deep learning comprises the following steps:

(1) a text processing module: as shown in fig. 1, information extraction, noise removal, vocabulary index construction and vectorization processing are sequentially performed on text data;

text preprocessing is a prerequisite and indispensable task for natural language processing, and is also a task which is most time-consuming and energy-consuming. The quality of data processing directly influences the research of subsequent algorithms and models. Compared with common texts, scientific texts have the characteristics of strong professional field, long text redundancy, special characters and the like, and have more noise. In order to not influence the semantic representation effect of the text and provide good input for the model, a series of preprocessing work such as denoising, word segmentation, vectorization and the like needs to be carried out on the text data according to the characteristics of the scientific and technological text before semantic matching and citation recommendation are carried out. The technical text data preprocessing method mainly comprises key information extraction, noise removal, word list index construction and vectorization.

Scientific and technical text resources usually have a relatively fixed format, such as scientific and technical paper documents or patent documents containing titles, abstracts, keywords, introduction, etc., and patents containing titles, abstracts, rights declaration, etc. Generally, the core content of a document is mainly represented by an abstract, and a title and the abstract usually express the research content and innovation point of an author, and the main research direction and result content can be basically determined by the title and the abstract. Therefore, for scientific and technical documents with title and abstract attributes, such as journal meeting papers, patents and the like, the scheme mainly extracts title and abstract information of scientific and technical resources. The extraction process comprises the following steps: judging the type of the scientific and technical literature according to the source data information of the input text, extracting keyword descriptions such as titles, abstracts and the like, and then extracting corresponding text contents. The special characters, link information, text messy codes and the like in the text are mixed in the normal text to easily influence the expression of the information, unnecessary interference is caused to subsequent text vectorization, the noise data is cleaned, and the pure text content can be obtained to ensure the integrity of input from the source. The regularization method is a classical and mature method in the field of text processing, and various existing programming languages provide a set of systematic solutions for regularization rules. According to the scheme, a re regularization module built in a python language is used for denoising the text.

Natural language models generally accept only data information in digital form, and converting text data into digital information is one of the important tasks of natural language processing. The segmentation is the basis of natural language processing, and the accuracy of the segmentation directly affects subsequent text processing and analysis tasks. Word segmentation is a process of dividing a continuous word sequence into independent words or phrases, thereby facilitating subsequent text digitization. The scientific and technological texts on the scientific and technological resource platform are mostly composed of Chinese texts or English texts. The English words contain independent semantic information, so that space division is only needed when the English text is converted into word sequences. For example, the word "I love China" can be expressed as "I/love/China". However, word segmentation of the Chinese text is more complicated, because the Chinese text is usually composed of a single word sequence, no segmentation symbol exists between words, and a single Chinese character is difficult to express complete semantic information, and words and phrases form a Chinese semantic unit. To segment the independent word units, a certain word segmentation technical means is usually required. The Chinese word segmentation method is generally divided into three categories: a word segmentation method based on a dictionary, a probability statistics word segmentation method and a deep learning word segmentation method. The dictionary-based method needs to manually construct a dictionary base, and in the word segmentation process, the Chinese text is matched with entries in the dictionary one by one according to a certain rule, so that the word segmentation result is obtained. Common dictionary-based word segmentation methods include a forward maximum matching method, a reverse maximum matching algorithm, a bidirectional matching algorithm, a shortest path method and the like. The algorithm can be customized according to the characteristics of language grammar, and is convenient to understand and maintain, but the manual construction of the dictionary is a huge project, and the execution of the algorithm is extremely dependent on a dictionary library. The algorithm is difficult to process ambiguity problems such as synonyms, synonyms and the like. The word segmentation algorithm based on probability statistics usually uses probability theory knowledge, based on a large amount of labeled text data, and uses methods such as machine learning to train. The probability-based algorithms commonly used are: hidden Markov Models (HMMs) and conditional random fields, etc. (CRFs). The method needs to understand the probability statistical knowledge more deeply and needs a large amount of marked data. The deep learning-based method usually simulates human language structure information, does not need to construct complex grammatical relation, directly uses an end-to-end training mode to output word segmentation results, such as a sequence model based on LSTM and CRF, automatically learns language rules and syntax information through inputting word vectors of single words and a neural network, and finally completes word segmentation at an output end. Although the deep learning method does not depend on an artificial dictionary, a large amount of labeled predictions are needed, and the training process of the model is slower than that of the statistical method. At present, word segmentation technology is mature, a large number of word segmentation toolkits are packaged by a plurality of organizations and researchers aiming at Chinese, and the word segmentation toolkits are simple to use and efficient. In order to rapidly divide the scientific and technical text data into words, the existing mature word division scheme is used in the scheme. Wherein Hanlp and jieba are Chinese word-separating libraries with mature application. The Hanlp word segmentation device mainly adopts a shortest path method and has the functions of Chinese word segmentation, part of speech tagging, new word discovery, named entity recognition, text summarization and the like. The jieba word segmentation mainly determines a word segmentation result through the combination with the highest word frequency probability, the functional main body of the jieba word segmentation is consistent with Hanlp, and the word segmentation algorithm is mainly based on a hidden Markov model, a Viterbi algorithm and the like. In addition, the jieba word segmentation also combines a word segmentation method based on a dictionary and statistics, is an open source tool, and has the characteristics of high efficiency and easiness in use, so that the jieba word segmentation device is adopted as a word segmentation tool of scientific and technical texts in the scheme. The word segmentation process comprises the following steps: judging the type of the text according to the input text characters, when the system identifies that the text type is English, directly segmenting words according to spaces and punctuation, if Chinese characters appear, starting a crust tool, and performing Chinese segmentation by using a segmentation api of a jieba tool.

The scientific text contains a large number of stop words which easily influence the calculation of the language model. Stop words are auxiliary functional words in natural language, which have no actual meaning per se. As in english text, 'a', 'the', 'of', 'is', etc. are common stop words, and common stop words in chinese text include 'included', 'yes', 'one', etc. In order to simplify the expression of the text, slow down the semantic interference of stop words, reduce the computational complexity of the model and optimize the storage space. The present scheme preferably uses a jieba tool to remove stop words. And constructing a stop word list in the scientific and technological field by the self-defined dictionary function of the jieba tool, and then removing stop words in the scientific and technological text by using the stop word removing function.

Since computers can only process digital information, text processing models are also based on mathematical operations to process text data. A numerical conversion is required before the text data is input to the model algorithm. There are two main ways of converting text into digital text: firstly, converting a text into a one-hot vector according to a statistical dictionary numbering rule; the second is based on distributed word embedding techniques to convert text into low-dimensional word vectors. Since scientific text of professional nature is different from text of other general fields such as news, encyclopedia, entertainment, sports, etc., a separate vectorized representation for a professional text corpus is required. The distributed word vector technology is taken as a popular text feature representation technology in recent years and is widely applied to the characteristics of accuracy, high efficiency and good interpretability, so that the technical text is subjected to vector conversion by adopting a distributed word vector method. The idea of one-hot coding is simple and intuitive: firstly, a word list is constructed through a text corpus set, then words in the word list are numbered, and the length of the word list is the maximum coding sequence number of the words. Assuming that the number of words in the vocabulary is d and the sequence number of the nth word is n, the vector of this word is represented as (0, 0.. multidot.1.. multidot.0), where 1 is located at the nth position of the vector and the rest are 0. Wherein xi represents the position value of the ith word, if the word is positioned at the current position, the value of xi is set to be 1, and the rest is 0. Although the method is simple and intuitive, the vectors are too sparse, if the number of words in the word list is too large, a large number of 0 s exist in the word vectors, and the sparse vectors cause the waste of computing resources. In addition, the encoding method does not consider semantic information of words at all, and is a main reason influencing the application of the words. The distributed word vector technology trains a language model through a neural network to generate intermediate vector information, and the intermediate vector information is used as the expression of words to achieve good effect. Currently, the most representative distributed Word vector technology is Word2 vec. Compared with a one-hot method, the distributed word vector technology can greatly reduce the dimensionality of the word vector, and the dimensionality of the vector can be customized according to the size of the corpus. Word2vec has two different model structures, namely a Continuous Bag of words model (CBOW) and a Skip-Gram model. According to the scheme, the skip-Gram model is selected according to the semantic characteristics of the scientific and technological text.

Further, the model structure of Word2vec adopts a predicted Word w_tCentered, K consecutive words are selected forward and backward as predicted words, when w is_tFor the input word, the connected context word is the predicted word, and assuming that K is 2, the context word is w_t-₂、w_t-1And w_t+2、w_t+1Initializing a word vector matrix

Wherein V is the number of words in the vocabulary, m is the dimension of the word vector,generally determined according to experience and the size of the training set; searching the vector of the input word according to the index of the vector matrix, wherein each row in the word vector corresponds to a word vector e with the current sequence number, and the calculation process is shown as formulas (3-4) to (3-6):

h＝E(w_t) (3-4)

e＝Wh (3-5)

when training is completed, the word vector matrix

All word vector information representing the current corpus, as described above, each row in the matrix represents a word vector for a word. As shown in formulas (3-7).

Wherein e ═ e (e)_i1,e_i2,...,e_im) A vector representing the ith word in the vocabulary.

And based on the trained word vectors, performing word list query on the text sequence after word segmentation, converting the text sequence into a final word vector list, and inputting the final word vector list into the model. If the text sequence L is (3,4,5), the queried word vector is L (e)₃,e₄,e₅)。

In conclusion, the vector dimension can be greatly reduced based on the distributed word vector method, the semantic features of the text can be learned based on the training of the language model, and the word2vec model of the Skip-gram has a good prediction effect on the words which do not appear in the word list. Therefore, in order to ensure that the word information of the scientific and technical text is fully expressed, the model training word vector based on the Skip-gram is preferentially selected. For simplicity and high efficiency, the technical scheme selects a more classical and mature gensim tool to carry out vectorization processing on the scientific and technological text, and the specific processing flow is as follows: using therein

The text is converted into numerical features according to the vector matrix when entering the downstream application model.

Word2vec realizes the Word vector training based on the scientific and technical text, and when the training is finished, the model obtains a Word vector matrix.

According to the scheme, a data preprocessing method based on a matching and citation recommendation model is designed for the characteristics of scientific and technological text matching and citation recommendation data. The main work comprises information extraction of a matching data set and a quotation recommendation data set, text denoising, word segmentation by using a jieba tool and word deactivation; training text word vectors in a dataset using a gensim open source tool; and converting the text sequence after word segmentation into word vector representation by using the trained word vector matrix. The work of the scheme provides data guarantee and text digitization support for follow-up reasoning matching and citation recommendation research facing science and technology text resources.

The text reasoning and matching algorithm of the scientific and technological resources is a necessary and key technology in the scientific and technological service process and is an important factor influencing the quality of the scientific and technological services, and the good text reasoning and matching algorithm can improve the precision of the scientific and technological text reasoning and matching and provide technical support for searching, recommending and other applications in the scientific and technological services.

Textual inferential matching may be described as given the premise p ═ p (p)₁,p₂,...,p_lp) And let q be (q)₁,q₂,...,q_lq) Two sections of texts have the relations of inclusion, neutrality and opposition, and the construction model is used for judging the relation between the premise and the hypothesis. The searching and determining of the logic relation between texts is the key of text reasoning matching. The text reasoning matching is a basic core technology in natural language processing, is widely used in the fields of search systems, recommendation systems, question and answer systems and the like for enhancing the understanding of machines to natural languages, and is rapidly developed.

The traditional method relies on the rules defined by experts and a known database, and does not start from the semantic information of the text, so that the effective result of the text on a certain task is difficult to migrate to other tasks. Therefore, how to better construct text features and capture text semantic information is the key research point of the scheme. In the deep learning based method, solving the text reasoning matching through the twin network architecture is a research focus in recent years. The basic idea of the method is to regard the text to be matched as a sequence, then use deep neural network coding text, automatically extract the semantic information implied in the text, and calculate the matching degree of the coded text to obtain the final inference matching result. In addition, the deep text reasoning matching model uses a word vector technology, can well solve the problem of word feature expression, and codes sparse word vector information in a dense vector space. However, the existing text reasoning and matching algorithm only uses the RNN or CNN coded text, and the matching characteristics of the text are less considered in the matching process of the text, so that the matching effect is poor. The method utilizes the advantages of the twin network interactive model superior to other deep learning methods in the aspect of matching feature representation, further researches and overcomes the problem that the existing method does not fully consider the text matching features, and further improves the effect of text reasoning matching.

The text sequence is coded and expressed, semantic information is captured, each time step represents information of one state, output information of the previous time is used as input of the next time step, iteration is continuously carried out, input and output are represented for the whole time step, sequence modeling is completed, and x represents the input and output of the whole time step₀，x₁,...,x_tRepresenting an input sequence of t instants, x_tRepresenting the input vector at time t, h_tRepresenting the output vector of the network at time t. While the output value h at the time t-1_t-1Hidden layer h as the initial state of the hidden layer at time t_tThe method is obtained by two basic feedforward neural network calculations, and the calculation process is shown as an equation (4-1):

h_t＝f(Ux_t+Wh_t-1+b) (4-1)

in the formula, U is input x_tW is h_t-1B is their bias vector, f is the activation function of neural network, generally used for non-linear activation such as tanhA function;

the hidden unit ht-1 is often used as an additional output result of the model, implies the correlation information between the output results, and keeps the memory of each cycle time on the cycle recursive path. However, the neural network generally updates the gradient and the parameter through back propagation, which relates to the derivation process of the activation functions, when the sequence is too long, the derivation process will generate multiplication-by-multiplication calculation of a plurality of activation functions, and the derivative values of the activation functions such as tanh are less than 1, which finally leads to that the result of the back derivation tends to 0, the gradient disappears, and the neural network parameter cannot be updated, so that the model training fails and the longer text semantic information cannot be captured.

In order to solve the problems of gradient extinction of the recurrent neural network and long text distance dependence, the RNN is improved to avoid the phenomenon. An LSTM structure modified according to RNN is proposed, the core structure of the LSTM comprising 3 control gate structures and a cell state C_t. Wherein C is_tIs a memory control unit, through C_tTo hold information for each state, the gating control structure being divided into input gates i_tOutput gate o_tForgetting door f_tInput door i_tThe information processing device is used for controlling input information and screening the information at each input moment; output gate o_tMemory cell C for output control by combining the state of the previous time and the input information_tThe state information at each time is stored in the calculation process of the whole sequence, and the calculation process is shown as formulas (4-2) to (4-7):

f_t＝σ(W_f·[h_t-1，x_t]+b_f) (4-2)

i_t＝σ(W_i·[h_t-1，x_t]+b_i) (4-3)

o_t＝σ(W_o·[h_t-1，x_t]+b_o) (4-6)

h_t＝o_t*tanh(C_t) (4-7)

wherein sigma is an activation function and is obtained through a formula (4-8);

through the improvement of the RNN structure, the LSTM structure can effectively solve the problems of long-distance dependence and gradient disappearance of the text. More and more natural language processing tasks such as text inference matching, dialog systems, recommendation systems, question and answer systems, etc. use this structure. The scientific and technological text studied by the scheme has the characteristics of long sequence and professionality, and the data size meets the requirement of LSTM structure training, so that the LSTM-based structure is adopted as the coded representation of the text.

The deep learning model needs to extract semantic features of a scientific text and also needs to fuse extracted feature information to form a final text representation. A common approach is to use vector addition or vector averaging as the final text representation. However, such methods treat the feature weight of each word as equal by default, while the importance of the word in the text is different, and the corresponding feature vectors have different weight information. The method of weighting or summing can cause loss of semantic information. Due to the characteristic of language expression diversity, the text inference matching process needs to consider semantic information of the text inference matching process and further needs interactive information between the text inference matching process and matched documents. The final text is only expressed by a weighting or summing method, and local interaction information between the two texts is ignored. Therefore, the Attention Mechanism (Attention Mechanism) is studied, and the differentiation information of the features is focused to avoid the problem of information loss, so as to obtain higher-quality text vector representation. Attention mechanisms were first proposed in the field of computer graphics, and were first applied to text translation in the field of natural semantic processing, and then widely applied to other tasks of natural language processing. Attention was proposed as inspired by human Attention mechanisms: human vision has the characteristic of rapidly scanning global images or important information in images, namely, the focus of attention, and less attention is paid to other unimportant information. In the text field, researchers are inspired, in a sequence model, important words in a text are given with larger weight through an attention mechanism, other irrelevant words are given with smaller weight, and finally semantic information with higher quality is obtained through weighted calculation of a plurality of words with different weights and used for subsequent calculation.

The calculation of the attention mechanism is divided into three steps:

(1) similarity score calculation

Defining the similarity Score between the Query vector (Query) and the queried vector (Key) as Score, and the formula (4-9) shows:

Score_i＝Similarity(Q，K_i) (4-9)

where Q represents a query vector, K_iRepresenting a key vector (queried vector), Score_iRepresenting the similarity scores of the query vector and the key vectors, where the similarity is calculated in a number of ways, as shown in the following table:

vector similarity score calculation function table

In the table, the calculation method of dot product directly calculates the dot product of query vector and key vector by using the operation mode of dot product of vector in mathematics; the method of the multilayer neural network is to adopt a feedforward neural network to calculate spliced inquiry vectors and key vectors; the scaled-dot Product method uses the self-attention score calculation in the transform structure to obtain the similarity between the query vector and the key vector.

(2) Weight calculation

After obtaining the attention score, performing normalized representation on the score distribution by using a softmax function so as to obtain a similarity probability distribution of the query vector and the key vector, wherein the calculation process is shown as an equation (4-10):

wherein l represents the number of words of the text sequence;

(3) weighted summation

Through the obtained normalized weight distribution, each Value vector (Value) needs to be weighted to obtain a final attention result, and the weighting process is shown as an equation (4-11):

wherein a is_jRepresenting the weight, Value, of each key_jRepresenting a vector of values.

The inference matching model is a common model in the field of natural language processing, and because the input of the model is two pieces of text, the model usually encodes the two pieces of text separately. A twin network structure is designed for the text reasoning matching task and achieves good effect, and more researches are carried out later on the basis of the model structure to carry out deeper text understanding tasks. The model can intuitively model a text reasoning matching model, has a general framework structure, and is easy to expand different functions on the basis. Therefore, the scheme is further researched on the basis of the twin network.

The twin network model mainly comprises a text encoder, a vector aggregation layer and a matching layer. Wherein the text coding layer is composed of p ═ (p)₁,p₂,...,p_lp) And q ═ q (q)₁,q₂,...,q_lq) The method comprises the following steps that (1) composition is carried out, wherein p and q respectively represent word vector sequences of input texts, and the input vectors are finally converted into sentence vectors u and v through a text encoder; the vector aggregation layer mainly carries out interactive calculation on vectors of the text coding layer, such as the addition, the dot multiplication and the absolute value of the vectors, and the interactive information is spliced to serve as a final matching vector; matching layerAnd predicting the matching vector by using a neural network structure to obtain a final matching result of the two sections of texts.

In order to obtain higher-quality text vector representation, an Attention mechanism is also introduced into the twin network framework, Attention is usually calculated on words in two text segments by using an Attention mechanism in an LSTM-based sequence encoder, the Attention mechanism is usually added to a text encoder, taking a text p as an example, an Attention weight value of each word in p and each word in q is calculated, a weighted Attention vector is spliced with an encoded word vector to serve as an interacted word vector, and the text q is similar. In the twin network-based framework, interactive information among texts can be acquired by adding an attention mechanism to the text encoding process. For professional texts such as scientific literature, richer text information can be mined by using an attention-based matching method.

In order to solve the problems that the existing method cannot fully consider text matching characteristic information and simultaneously considers text local and global semantic information, the scheme provides a mixed model method based on LSTM and CNN. The model is based on a Simese network structure, adopts an interactive reasoning matching model architecture and aims to adopt an Attention mechanism to carry out interactive text representation on an LSTM sequence, and uses a CNN network to carry out sufficient text matching feature extraction so as to improve the matching effect. The text reasoning matching module adopts an interactive reasoning matching model architecture and consists of an input layer, an interactive coding layer and a matching output layer, wherein the input layer carries out word segmentation processing on two sections of texts respectively, the segmented texts are represented by vectors, and the words have semantic expression ability by using pre-trained word vectors; on an interactive coding layer, two sections of texts are coded and expressed by using bidirectional LSTM (Bi-LSTM) respectively to obtain integral semantic expression of the texts, and meanwhile, Attention operation is carried out on an LSTM hidden vector sequence to enable interactive similarity matching to be carried out between local texts, and local importance information of the texts can be obtained according to Attention scores; after the interactive text vector sequence is obtained, matching operation is carried out on the two text sequences in a point multiplication mode to obtain a matching tensor which contains all local matching information characteristics of the two texts; finally, performing feature extraction work on the matching tensor at a feature extraction layer by utilizing the advantage of the step-by-step feature extraction of the CNN, wherein the extracted CNN feature layer contains high-order matching information of two sections of texts; and the output layer outputs the matching result of the two sections of texts through a fully connected feedforward neural network.

p_i,q_iUsing pre-trained word vectors, d representing the dimension of the word vector, l_pIndicates the length of the sentence p, l_qRepresents the length of sentence q; as the distributed word vector technology is widely applied to the field of natural language processing, the similarity between words can be measured in a vector space through Euclidean distance or Manhattan distance, and the distributed word vector technology has strong word representation capability. Therefore, pre-training word vectors Glove, which are also distributed word vector expressions obtained from a large number of prediction bases, are directly adopted, and the original text sequence is mapped to the space vector, so that the model can be fully utilized in the subsequent stage.

Because each word is represented by a pre-training word vector, the input sequence has certain language expression capability to a certain extent, but at the moment, each word is still isolated, and the complete semantics of a sentence cannot be accurately described. Therefore, the Bi-LSTM is adopted in the section to code the input sequence, has the expression capacity of the sequence, can code from two directions of sentences, has the advantage of context expression and is consistent with the expression habit of human language. Writing an input sequence p at time i into the interactivity coding layer as a hidden or output state generated by BilSTM

And is calculated by the following formula:

M_pq＝pq (4-15)

M_qp＝qp (4-16)

in the formula

Is that

The result of the weighted summation of, wherein

Is that

The result of the weighted summation.

By adopting two kinds of attention to encode sentences, the expressions of the two sentences are merged into semantic information in the sentences, and information between the two sentences is also merged. Then calculate the element difference value based on attention and implicit vector, finally splice the vectors together as the final expression of the word vector, as follows:

the CNN has strong feature extraction capability, and the main task of the matching output layer is to match semantic information of two sentences. And directly carrying out one-to-one matching of each dimension on the word sequences of the two sentences, wherein the matching calculation formula is as follows:

e_ij＝t_p⊙t_q (4-21)

element e_ijA 4-dimensional similarity tensor is formed, wherein the tensor represents the matching characteristics of two sentences in the dimension of the word vectorAnd (4) performing feature extraction by adopting a DensityNet network based on CNN (CNN) in a feature extraction mode similar to the feature extraction mode in picture processing. The DensityNet network is inspired by a residual error network, and is proposed to prevent the gradient disappearance phenomenon when the number of layers of the deep CNN network is excessive, the network simultaneously reserves the low-layer characteristics and the high-layer characteristic information, and the defect of loss of the low-layer original information is reduced. Experiments show that 8 layers of blocks are adopted, and each layer of block contains 3 convolution layers, so that the DensityNet effect is good. After the convolutional neural network extracts the matching information, the network can obtain higher-order matching information at the layer. After the matching layer obtains high-order matching features, flattening operation is carried out on the features to obtain a final vector representation, an MLP feedforward neural network is adopted to classify the matching vectors, 3 class vectors are output, finally, operation is carried out on a prediction vector softmax, and the maximum result is taken as a final prediction result.

And carrying out experimental verification on the text reasoning matching module, wherein the experimental data adopts an SNLI (simple network interface) data set, the data set comprises 57K sentence pairs, the linguistic data are labeled manually, and the data set is widely applied to the field of natural language reasoning. Given a sentence pair (a, b), judging the relationship between a and b, the result has three forms: a includes b, a is opposite to b, and a is independent of b.

The experimental model used a framework of tensorflow, version 1.13.1, python version 3.65, and operating system ubuntu 18.04. The hardware configuration of the experimental platform is shown in the following table.

Processor with a memory having a plurality of memory cells	Intel(R)Core(TM)i7-8700k CPU@3.70GHz
		Display card	NVIDIA GeForce RTX 2070 8G
Memory device	32G 2133MHz
		Hard disk	250G SSD+1T HDD

Experimental hardware configuration

And removing unlabeled data "-" in the data set, and respectively using the development data and the test data in the data set as a verification set and a test set during model training. The sentences in non-binary after word segmentation are used. Each batch of data is selected 32. During training, the length of each sentence is fixed to be 50, padding with the length less than 50 is 0, and a part longer than 50 is cut off by a sentence which is too long. In order to make the training data more reasonably distributed, the whole training data is scrambled using a randomization strategy. The word vector portion uses GloVe with 300 dimensions. Words outside the vocabulary are randomized using a mean distribution of [0, 1 ]. All word vectors are optimized along with the parameters during the training. Other parameters were initialized with a gaussian distribution with a mean of 0 and a variance of 0.01. The optimization method uses an Adam optimizer, the learning rate initialization value is 0.0004, and the learning rate is gradually reduced along with the training. A dropout strategy is adopted in the training process, and the value of dropout is set to be 0.5. And finally, the classification layer does not set dropout. The DensityNet layer employs 8 blocks, each block containing 3 layers of neural networks. When the accuracy difference value of the development set of every 10 batches of data is less than 0.04, the training is terminated. And finally, selecting the optimal model to test on the test set as the accuracy of the final model. The parameters are shown in the following table.

Word list size	40K
		Word vector dimension	300
Text sequence length	50
		Batch_size	32
Number of LSTM hidden units	300
		Number of LSTM layers	1
Volume core size	3*3
		Number of blocks	8
Number of Block layers	3
		Learning rate	4e-4
Dropout	0.5

Experimental parameter settings

Three sets of experiments of the model based on tree-CNN and the model based on Bi-LSTM, namely the model of the scheme, are respectively carried out, and the experimental results are shown in tables 4-4. the tree-CNN model belongs to a single meaning representation model, a tree structure CNN is used for coding sentences, a hierarchical structure is used in the aspect of high-level semantic expression, the bottom layer expresses N-gram semantics, and the high level expresses the meanings of phrases and sentences. It can be seen that the model has a deficiency in matching interactivity. After the model of the scheme is fused with the interactive expression based on the Bi-LSTM, the CNN is used as a feature extraction layer of the matching information, and compared with a model only depending on a tree-CNN structure, the model is improved by 5 percentage points. The Cheng and Parikh models both adopt an attention mechanism LSTM structure, belong to an interactive model, and have richer sentence expression capability compared with a tree-CNN-based univocal model. The model of the scheme is further extended on the basis, the DensityNet network structure based on CNN is used in a sentence matching layer to obtain the matching information characteristics of a high layer, the matching effect is further enhanced, and experiments prove that the model of the scheme is improved by 1 percentage point on an SNLI data set compared with an interactive model based on Bi-LSTM only.

Method	Training set accuracy	Test set accuracy
			300Dtree-basedCNNencoders	83.3	82.1
450DLSTM with deep attention fusion	88.5	86.3
			Intra-sentence	90.5	86.8
Ourmodel	92.4	87.3

Results on SNLI datasets

In order to further excavate the influence of each component of the model and some settings of the experiment on the result, the scheme respectively verifies the performance of the model aiming at each structural characteristic. And respectively observing the difference of the final prediction results of the model after removing a part of structure, thereby exploring the influence degree of each part on the performance of the model. The results of the experiments are shown in the following table.

Model	Accuracy
		Ourmodel	87.3
Model-dynamicembedding	87.0
		Model-CNN	86.8
model-LSTM	86.5
		Model-Inter-attention	86.1

Effect of different structures on model Performance

The "-" in the table represents the removal of a certain part of the structure in the model.

Firstly, dynamic word vectors are removed, the value of the word vectors is not changed in the training process of the model, the influence of the experimental result on the precision of the model is found to be small, and the change is caused because the adopted GloVe word vectors are trained from massive linguistic data which contain a large amount of cross-domain knowledge, so that the expression of the word vectors has rich semantics, and the good effect can be achieved only by adjusting the parameters of a neural network during the model training. And then, removing CNN of the feature extraction layer, directly connecting a full connection layer at the LSTM, extracting the output of the interactive coding layer by adopting average pooling and maximum pooling operations, ensuring the fixed length of the final vector, eliminating the influence of sentence length on semantic coding vector expression, and directly connecting the pooling layer to an MLP layer for classified output. The result shows that the removal of the CNN feature extraction layer has a certain influence on the comparison with the original model, and the accuracy is reduced by 0.2. As assumed earlier, CNN has some boosting effect on the representation of matching features. The addition of the feature extraction layer enables the model to further mine local and global matching features after coding, and finally, high-level matching feature classification is utilized. For the analysis of Bi-LSTM, two strategies were employed. One is to remove the entire LSTM encoding layer and interactive representation portion respectively: when the whole Bi-LSTM layer is removed, the performance of the model is obviously reduced to 0.8, which shows that the Bi-LSTM coding layer has stronger capability in sentence representation, the hypothesis of the scheme is verified, and the Bi-LSTM can capture the semantic information of the sentences; secondly, after only the interactive attention and self-attention parts of the Bi-LSTM coding layer are removed, the performance of the model is more obvious, and the performance of the model is reduced more greatly compared with the removal of the whole Bi-LSTM, which shows that the interactive attention expression can enhance the expression of semantic coding when being established on the basis of the Bi-LSTM, and the fusion with the sequence model is better. From the analysis of the results, the CNN and the Bi-LSTM model based on interactive attention can better mine semantic information and have a promoting effect on reasoning matching compared with an architecture only depending on one model.

A citation recommendation may be defined as a sort problem, i.e., given a waitReferencing a recommended manuscript document d_qAnd a large document database or an open document index library, and the document d needs to be recommended from the document database_qThe most relevant documents are ranked according to the relevance degree, and the recommended relevant document sets are ranked in a descending mode. Therefore, how to quickly screen out high-relevant citations from a large literature base is the first problem to be solved by citation recommendation.

The citation recommendation module comprises two modules in two stages: a citation candidate module of the first stage and a candidate sorting module of the second stage. Calculating query document d by document embedding model in citation candidate module_qThe document vector corresponds to the semantic space position of the document in the vector space coordinates, K most similar documents are selected as candidate objects in the vector space, and the document embedding vector d is used_qAnd d_iAs their matching degree scores, as shown in fig. 2, dist (a, B) represents the euclidean space distance, cos θ represents the cosine similarity, and the present scheme uses the cosine similarity to represent their matching degree in the vector space.

In order to fully utilize the reference information of the candidate document itself, the outsourcing citations of K nearest neighbor documents are added as candidates, and as shown in fig. 3, when K is 5, the document d7 is the outsourcing citation of the candidate document d 3. Wherein

representing an embedded vector for each wordAnd (3) normalization is carried out, finally, weighting calculation is carried out on the embedded vector of each word to obtain a final text expression vector, and the title and abstract information of the document d are used as target text for content, wherein the formula is as follows:

(2) training of the model: the form T ═ of the data set into triplets (d)_q,d⁺,d^-) Wherein d is_qIs a query document, d⁺Denotes d_qIn the document, but d^-Then is d_qFor documents not referenced in (1), the loss function of the model is calculated as follows:

loss＝max(λ+s(d_q，d^-)+s(d_q，d⁺)，0) (5-3)

and selecting a negative sample. For (d)_q,d⁺) In the data set, only d is selected_qReference d to⁺And (4) finishing. The model of the scheme needs to maximize the similarity of the positive samples and weaken the similarity of the negative samples, so that the selection of the negative samples is crucial to the performance of the model. The scheme adopts a random method and a negative nearest neighbor method to carry out negative sample sampling.

1) A random method. I.e. from d_qRandomly selecting a document and d from unrecited documents_qPaired into a negative sample pair.

2) The negative nearest neighbor. Selecting in the space vector with d_qThe document that is closest to but not referenced by it.

In conclusion, the citation candidate recall module is mainly used for filtering the recommended documents, reducing the recommendation range, reducing the calculation amount and providing service for subsequent accurate recommendation.

In the candidate ranking module stage, another model is trained, the input of which is (d)_q,d_i) Output is d_qReference d_iThe probability of (c). The triggering of the citation recommendation task is that the user is in the research document idea forming or writing stage, and should mainly contain some important information such as title, abstract and main innovation method. The manuscript at this stage often lacks some source data information such as keywords, year, collaborators, etc. Important information such as titles or summaries represents the thought of the author, and plays an important role in citation recommendation of documents. Therefore, the scheme mainly discusses that the citation recommendation is carried out only by means of the title and the abstract contents. In order to evaluate the influence of the source data, the method and the system consider adding the source data to perform citation recommendation.

The phase model architecture is shown in FIG. 4.

In fig. 5, for each text field and source data information, the similarity between the embedding of dq and the embedding of di is calculated separately. For more accurate semantic matching, the reasoning matching method proposed in chapter iv was used, and the good effect of the method on the reasoning matching of short texts has been verified previously, so the method is applied to similarity calculation to titles and abstracts of scientific and technical texts. In order to adapt to the document recommendation task, the last layer of output nodes of the modified text reasoning matching model needs to be reduced to 1, and meanwhile, the activation function is modified to sigmoid. In order to enhance the association information between documents, the present section expands the similarity calculation method, and as shown in fig. 5, it is a similarity calculation framework between a query literal and a candidate document. Counting the words that appear in both the query document and the candidate documents, and then calculating the sum of the weighted values of these common words, as expressed for the title document as

Meanwhile, counting that a plurality of words in the candidate documents are referred by the query document and are expressed as log (d)_i[in-citations])。

(1) Model structure: the output layer of the model is defined as:

s(d_i，d_j)＝MLP_Forward(h)(5-4)

And (4) carrying out experimental verification on the citation recommendation module, wherein the experimental data adopts DBLP and PubMeds data sets. A DBLP dataset is a dataset that contains 5 million computer science articles, where each article is referenced 5 times. PubMed is a data set containing over 4.5 million scientific documents in the medical field. The documents of both datasets contain article titles, summaries, places, authors, citations and keywords. And (3) filtering out papers with the reference times less than 10, and dividing the data set into a standard training set, a standard development set and a standard test set.

The experiment was coded using python, model development using tensierflow + keras framework version 1.13.1, python version 3.7, operating system ubuntu 20.04. As the experiment needs higher hardware configuration, the experiment uses a GPU with high performance, a model RTX2070 and a video memory 8 Gb.

BM25 and Clus-Cite comparisons were used. Among them, the BM25 is an algorithm for evaluating the correlation between search terms and documents, which is commonly used in the field of information retrieval, and is an algorithm based on a probabilistic retrieval model. The Clus-cite is a graph-based citation recommendation algorithm, and documents which are most likely to be cited are recommended in a mode of constructing source data of authors, published documents, years and the like into a network graph. The experiment will compare the analysis method of the present scheme with the experimental results of the baseline model based on the two data sets in this section, where the Clus-Cite experimental results refer directly to the conclusions given in the original paper.

The first stage of the model uses the selection of documents that are closest to the query document as the candidate set, here using the open-source annoy neighbor search algorithm. The Annoy algorithm can quickly search a quotation set near a query document by constructing a binary tree because the time complexity of the binary search tree is O (logN). The hyperplane is randomly selected to partition points in the high-dimensional vector space, using 100 trees in the immediate vicinity of the index. The hyperopt library is used to optimize various parameters in the model, including weight parameters, regularization parameters, learning rates of the neural network hidden layer. Hyperopt was run, multiple experiments were performed, and the hyper-parameters representing the best model were selected. And optimizing Recall @20 of the first-stage model on the verification set, and optimizing a hyper-parameter for F1@20 on the second-stage model, wherein parameters of similarity calculation of a matching module are set according to experiments of a matching model in chapter IV. The hyper-parametric adjustments of the model are as follows:

hyper-parameter	Candidate module	Sorting module
			Learning rate	1e-2	1e-2
L2 regularization parameter	0	1e-3
			L1 regularization parameter	1e-7	1e-4
dropout	0.6	0.5
			Word vector dimension	300	175
Number of immediate neighbours	10	-
			Batch_size	256	256

Referring to the evaluation indexes of the search field, the present scheme uses reciprocal mean rank (MRR) and F1@20 as the evaluation indexes of the present experiment. Where MRR is as in equation (5-6), MRR refers to the mean of the inverse rankings of the multiple queries, where | Q | represents the number of query instances, and ranki represents the ranking of the first correct answer of the ith query statement.

Wherein, R (q) is a citation document set of a query document q in the training set, and T (u) is a citation recommendation list set generated when q is recommended. P represents accuracy, R represents recall rate, and the formulas are shown in (5-7) to (5-9).

BM25 and Cluscite were chosen as baseline models and compared with the model of the protocol for two indices F1@20 and MRR, respectively, and the results are shown in the following table. The "candidate models" in the table are models that use only the first stage; the + sorting model is a two-stage model, namely, firstly, the contents of the candidate set in the first stage are used for fine sorting, and then, the candidate set generated in the first stage is further sorted; "+ Source data" is a reference recommendation that has an increased impact of source data such as author, location, keywords, etc. From experimental results, the results of citation recommendation of the candidate model on the two data sets are superior to those of the baseline model, and the fact that the space vector model based on the word vectors can achieve better effects compared with the traditional probability model and the graph network model is shown. The + order model experiment results are further improved, and are 2 percentage points higher than that of the model only using one stage. The best results are obtained after "+ Source data". The best results were found to be about 20% and 25% improvement over the F1 and MRR values of the baseline model, respectively.

From the above experimental results, it can be found that the effect of the model cannot be obviously improved by adding information such as document source data, so that the content of the preceding analysis represents the main center of the document, the contribution of the source data and the like in the document semantics is not significant, and in the actual process, the original document such as a manuscript is likely to lack the data, so that the robustness of the model is affected. In order to influence the data set interaction characteristic information on the sequencing result, the scheme respectively performs exclusion experiments on the three similarity calculation information. And respectively removing the reference frequency information, the neural network matching model and the same word statistical information. The results of the experiments on the PubMed data set are shown in the table below.

From the experimental results in the table, it can be seen that the reference time information between documents and the neural matching model make the greatest contribution to the ranking model, the indexes of F1 and MRR are respectively improved by 0.033 and 0.118 after the reference time information of the documents is increased, and the indexes of 0.016 and 0.007 after the cosine similarity model is replaced by the neural network matching model.

Experiments show that compared with the conventional method of the platform, the citation recommendation model based on deep learning provided by the scheme is greatly improved in two indexes. The main reasons are:

(1) the recommendation methods of the platform do not make use of the content information of the documents, usually based on keywords or title data. The phenomenon of topic and keyword shift is easy to occur.

(2) The recommendation method of the platform utilizes the traditional retrieval model strategy, ignores the semantic information of the document content, and is difficult to mine the inherent relevance of the text content when facing professional scientific and technical document resources.

The invention provides an improved scientific and technological resource text reasoning and matching algorithm aiming at the characteristics of strong professional, large noise, long text and the like of scientific and technological literature resources. On the basis, a deep learning quotation recommendation method based on content is provided for supporting professional field active recommendation service to provide accurate quotation recommendation results based on a deep text reasoning matching algorithm. The method comprises the steps of firstly, rapidly screening a document set matched with a target document by using a vector space model based on a text to narrow a recommendation range and reduce the calculation load of the model, and then, sequencing a candidate document set more accurately by using an algorithm based on text reasoning matching to achieve a more personalized recommendation effect.

The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. The scientific and technological literature citation recommendation method based on deep learning is characterized by comprising the following steps:

2. The method for recommending scientific and technical literature citations based on deep learning as claimed in claim 1, wherein the information extraction in step (1) is to determine the type of scientific and technical literature according to the source data information of the input text, extract the title and abstract keyword descriptions, and then extract the corresponding text content.

3. The deep learning-based scientific literature citation recommendation method according to claim 1, wherein the noise removal in step (1) is to denoise the text by using a re regularization module built in a python language.

4. The deep learning-based scientific and technical literature citation recommendation method according to claim 1, wherein the vectorization processing in the step (1) is Word2vec as a distributed Word vector technology.

5. The scientific and technical literature citation recommendation method based on deep learning as claimed in claim 4, wherein the model structure of Word2vec adopts a predictive Word w_tCentered, K consecutive words are selected forward and backward as predicted words, when w is_tFor the input word, the connected context word is the predicted word, and assuming that K is 2, the context word is w_t-2、w_t-1And w_t+2、w_t+1Initializing a word vector matrix

h＝E(w_t) (3-4)

e＝Wh (3-5)

6. the scientific and technical literature citation recommendation method based on deep learning of claim 1, characterized in that the text inference matching module adopts an interactive inference matching model architecture, and is composed of an input layer, an interactive coding layer and a matching output layer, wherein the input layer carries out word segmentation processing on two sections of texts respectively, the segmented texts are represented by vectors, and words have semantic expression ability by using pre-trained word vectors; on an interactive coding layer, two sections of texts are coded and expressed by using bidirectional LSTM (Bi-LSTM) respectively to obtain integral semantic expression of the texts, and meanwhile, Attention operation is carried out on an LSTM hidden vector sequence to enable interactive similarity matching to be carried out between local texts, and local importance information of the texts can be obtained according to Attention scores; after the interactive text vector sequence is obtained, matching operation is carried out on the two text sequences in a point multiplication mode to obtain a matching tensor which contains all local matching information characteristics of the two texts; finally, performing feature extraction work on the matching tensor at a feature extraction layer by utilizing the advantage of the step-by-step feature extraction of the CNN, wherein the extracted CNN feature layer contains high-order matching information of two sections of texts; the output layer outputs the matching result of the two short texts through a full-connection feedforward neural network.

7. The deep learning-based scientific and technical literature citation recommendation method according to claim 6, wherein p ═ p (p) is used in the input layer₁,p₂,...,p_lp) And q ═ q (q)₁,q₂,...,q_lq) Respectively representing two sentences in natural language, p being a precondition and q being a hypothesis, wherein

And is calculated by the following formula:

M_pq＝pq (4-15)

M_qp＝qp (4-16)

in the formula

Is that

The result of the weighted summation of, wherein

Is that

e_ij＝t_p⊙t_q (4-21)

8. The deep learning-based scientific and technical literature citation recommendation method according to claim 7, wherein the citation recommendation module comprises two modules in two stages: a citation candidate module of the first stage and a candidate sorting module of the second stage.

9. The scientific and technical literature citation recommendation method based on deep learning of claim 8, wherein the citation candidate module calculates query document d by using document embedding model_qThe document vector corresponds to the semantic space position of the document in the vector space coordinates, K most similar documents are selected as candidate objects in the vector space, and the document embedding vector d is used_qAnd d_iIs used as the matching degree score of the documents, and the outward citations of the K nearest neighbor documents are added as candidates, wherein

wherein w_t ^dirIs a distributed word vector of words,

representing the weight value of a word in the text,

(2) training of the model: the form T ═ of the data set into triplets (d)_q，d’d^-) Wherein d is_qIs a query document, d⁺Denotes d_qIn the document, but d^-Then is d_qFor documents not referenced in (1), the loss function of the model is calculated as follows:

loss＝max(λ+s(d_q，d^-)+s(d_q，d⁺)，0) (5-3)

wherein, s (d)_i，d_j) Defined as text embedding e_di，e_dCosine similarity between j, setting hyper-parameters of a lambda model, and adjusting before training;

(3) selection of negative examples: for (d)_q，d⁺) Selecting d in the data set_qThe cited document d + applies a random method or a negative nearest neighbor method to perform negative sample sampling.

10. The deep learning-based scientific and technical literature citation recommendation method according to claim 9, wherein the candidate ranking module comprises the steps of:

(1) model structure: the output layer of the model is defined as:

s(d_i，d_j)＝MLP_Forward(h) (5-4)

(2) model training: the parameters of the model include w_* ^mag，w_* ^dir，w_* ^∩The loss function of the model is the same as that of the candidate module of the quotation according to the parameters of the neural network layer, and in the testing stage, the model is enabled to be the same as that of the candidate module of the quotationThe model is used to predict the set of candidate documents with the highest scores.