CN109145190B - Local citation recommendation method and system based on neural machine translation technology - Google Patents
Local citation recommendation method and system based on neural machine translation technology Download PDFInfo
- Publication number
- CN109145190B CN109145190B CN201810994562.0A CN201810994562A CN109145190B CN 109145190 B CN109145190 B CN 109145190B CN 201810994562 A CN201810994562 A CN 201810994562A CN 109145190 B CN109145190 B CN 109145190B
- Authority
- CN
- China
- Prior art keywords
- word
- quotation
- list
- context
- citation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a local quotation recommendation method and system based on a neural machine translation technology, which comprises the steps of performing quotation extraction, morphology reduction and word frequency statistical data cleaning on an original data set to obtain parallel corpora of a quotation context and a quotation seal title and constructing an initial to-be-quotation seal list base; embedding words appearing in a quotation context and a quotation seal title into a low-dimensional semantic space by combining a word skipping model in a word vector model with a negative sampling method to obtain a word vector, constructing an encoder with a bidirectional gating circulation unit with an attention mechanism and a decoder framework with the gating circulation unit, converting the quotation context in parallel linguistic data into a word vector through the word vector model and then using the word vector as the input of the model, and using the quotation seal title as the output to train the model; cosine similarity calculation is carried out on the seed titles output by the encoder-decoder framework and all article titles in the chapter list to be quoted one by one; and selecting the articles meeting the requirements as a recommendation list according to the article years.
Description
Technical Field
The invention relates to the technical field of information retrieval, in particular to a local citation recommendation method and system based on neural machine translation.
Background
With the rapid development of internet technology, a large number of new scientific research articles are published every year, and how to quickly find out needed documents from massive documents becomes a great difficulty. The local citation recommendation can help quickly construct an intelligent model on semantics and contents on the premise of giving a section of context, help you quickly find the citation-available documents related to your research field from massive documents or directly recommend the citation-available documents for you, and therefore, a great amount of time for finding the related documents in scientific research work is saved for you. Local citation recommendations play a non-negligible role in scientific research.
This has been studied by many researchers in recent years. The method mainly comprises two categories, namely global citation recommendation, namely independent article recommendation citations; and secondly, a quotation is recommended for a piece of context text in the article. The research methods used are generally text similarity-based methods, topic model-based methods, translation model-based methods, collaborative filtering-based methods, deep learning-based methods, and some other methods.
Neural machine translation is a set of encoder-decoder framework proposed by google in 2014, making a great deal of progress in the machine translation problem.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a local citation recommendation method based on a neural machine translation technology, so as to improve the translation accuracy of the citation context and the citation article title.
A local citation recommendation method based on a neural machine translation technology comprises the following steps:
s1, performing operations of quotation extraction, morphology reduction and word frequency statistical data cleaning on the original data set to obtain a parallel corpus of a quotation context and a quotation seal title and construct an initial to-be-quotation seal list library;
s2, embedding words appearing in the quotation context and the quotation chapter titles into a low-dimensional semantic space to obtain word vectors by combining a word skipping model in a word vector model with a negative sampling method, and enabling semantically similar words to be closer to each other in the space through an embedding space;
s3, constructing an encoder with a bidirectional gate control circulation unit with an attention mechanism and a decoder framework with the gate control circulation unit based on a neural machine translation technology, converting the quotation context in the parallel corpus into word vectors through a word vector model and then using the word vectors as the input of the model, and training the model by using the quotation chapter titles as the output;
s4, cosine similarity calculation is carried out on the seed titles output by the encoder-decoder framework and all article titles in the to-be-quoted article list one by one;
s5, according to the article year, removing the articles with publication time after the article year of the citation context, and selecting the articles with similarity meeting the requirement as a recommendation list.
As a preferable solution of the above technical solution, step S1 specifically includes:
extracting all English quotation contexts, removing invalid symbols, reserving the quotation contexts with the word number in a set range, and restoring the word forms; counting word frequency, reserving the vocabulary with set ranking before ranking, replacing other vocabularies with < UNK >, expanding < PAD > for the words in the set range, extracting the title of the introduced chapter according to the context of the introduced text and carrying out similar cleaning operation.
As a preferable solution of the above technical solution, step S2 specifically includes:
s21, dividing the sentence into a plurality of input words and output words in a relative mode according to the size of the word window;
s22, converting all words into 0-1 vectors with the size equivalent to the word list;
s23, constructing a neural network, which comprises an input layer, a hidden layer and an output layer;
and S24, adding negative sampling reverse transfer errors into the word skipping model, wherein the weight matrix at the word embedding matrix is the finally obtained word vector representation.
As a preferable solution of the above technical solution, the step S3 specifically includes:
constructing a coder of a bidirectional gating circulation unit with an attention mechanism and a decoder framework of the gating circulation unit to learn semantic representation of a citation context, mining and decoding a seed title from a candidate word list on the basis of understanding semantics, and forming a seed title construction model taking semantic content as connection;
the specific framework for constructing the encoder of the bidirectional gated loop unit with the attention mechanism and the decoder of the gated loop unit is as follows:
the encoder is formed by a network of bi-directional gated cyclic units, receiving at each instant t a vector representation of the t-th word of the input sequence and deriving a hidden layer state h<t>Obtaining the translation weight of each input word through the attention mechanism and the hidden layer state action of the output layer, further obtaining a final context vector and sending the final context vector to a decoder to decode the word;
the formula of the encoder GRU unit is expressed as follows:
Gu=sigmod(Wu[h<t-1>,x<t>]+bu)
Gr=sigmod(Wr[h<t-1>,x<t>]+br)
wherein G isuFor updating the door, GrIn order to reset the gate, the gate is reset,to update hidden layer variables, C<t>For hidden layer variables flowing to the next instant, h<t>Hidden layer variable, x, representing time h<t>Representing input at time t, bu、br、bcIndicating bias, sigmod, tanh are activation functions W[u,r,c]Is a weight parameter.
The attention mechanism decoding part process is as follows:
when the decoder decodes the t-th word, the state s of the hidden layer at the moment t of the decoder needs to be calculated<t>Word y decoded at time t-1<t-1>The context vector c coming from the encoder at time t<t>Wherein the decoder hides the layer state s at time ttCan be obtained by the following formula:
S<t>=g(y<t-1>,s<t-1>,c<t>
where the context variable c is introduced at time t<t>Hidden layer variable h by encoder<t>And a translation attention determination for each encoded word and the decoded word, the formula being as follows:
whereinIs a vector type of attention, representing the encoder firstThe translation attention of an individual word to all the words of the decoder,can be obtained by the following formula:
whereinAttention of scalar type, representing encoderThe translation attention of a word to the t-th word of the decoder,can be obtained by the following formula:
wherein v isT,W[s,h]Is the parameter weight;
and circulating the processes until all the words are decoded, namely the seed titles.
As a preferable solution of the above technical solution, step S4 specifically includes:
s41, calculating pairwise similarity of vocabularies in all decoding candidate vocabularies, and establishing a word bank similarity search dictionary set;
s42, segmenting the seed titles and the article titles in the list library of the to-be-cited documents, and calculating the similarity between the seed titles and each to-be-cited document title word by word according to the similarity in the word library similarity search dictionary set;
s43, accumulating the calculation results in the step S42 as the similarity between the seed title and the article;
and S44, sorting the similarity results obtained in the step S43 to form a document recommendation list.
The invention also provides a local citation recommendation system based on the neural machine translation technology, which is applied to the method and comprises the following steps:
a quotation cleaning module for processing the input quotation context into a standard input corpus form required by the encoder-decoder framework;
the article expansion module is used for dynamically expanding a chapter list base to be quoted on the basis of the existing article list base, and crawling the latest open articles of the relevant document retrieval platform in time by utilizing a web crawler technology so as to ensure that the chapter list base to be quoted of the quotation context is more complete and comprehensive;
the candidate word updating module is used for recalculating word frequency after the list library of the to-be-cited stamp is updated, and dynamically updating a candidate word list when the decoder decodes the seed title;
and the citation recommending module calculates a recommended article list on the premise of giving citation context.
As a preferred scheme of the above technical solution, the citation cleaning module is specifically configured to:
and removing invalid symbols in the quotation context, replacing vocabularies which do not appear in a word list in the quotation context with < UNK >, filling up < PAD > when the words in the preset range are not enough, performing truncation operation and performing morphological restoration on all words when the words in the preset range are exceeded, and then converting all the vocabularies into word vectors by using a pre-trained word vector model.
As a preferred scheme of the above technical scheme, in the article expansion module, a latest open article of a relevant retrieval platform is crawled by using a web crawler technology, data cleaning operations such as citation extraction, morphological reduction, word frequency statistics and the like are performed on an original data set, parallel corpora of a citation context and a title of a cited chapter are obtained, an initial to-be-cited chapter list base is constructed, and the to-be-cited chapter list base is dynamically expanded and maintained.
As a preferred scheme of the above technical solution, in the candidate word updating module, after the list base of the to-be-cited documents is updated, the latest global list base of the to-be-cited documents is subjected to word segmentation and word frequency recalculation, and then the candidate word list when the decoder decodes the seed header is dynamically updated, so that the list of the to-be-cited documents and the candidate word list of the decoder are maintained in a synchronous association state.
As a preferable scheme of the above technical solution, in the citation recommendation module, the local citation recommendation and the neural machine translation are combined, the local citation recommendation is expressed as a machine translation problem from a source language to a target language, and a recommendation article list on the premise of giving a citation context is calculated by the citation recommendation module.
The invention provides a novel local citation recommendation method. Compared with the traditional citation recommendation method, the citation recommendation and the neural machine translation are combined by constructing the parallel corpus pair of the citation context and the cited chapter title, the local citation recommendation is regarded as a machine translation problem from the citation context (source language) to the cited chapter title (target language), so that the citation context and the article title have stronger semantic consistency, and finally, cosine similarity calculation is performed according to the translated seed title and the articles in the to-be-cited chapter list library to obtain the to-be-cited chapter list.
The invention embeds the vocabulary in the quotation context and the quotation chapter title into the low-dimensional semantic space, so that semantically similar words are closer in the space; an encoder of a bidirectional gating circulating unit with an attention mechanism and a decoder framework of the gating circulating unit are constructed, and the translation accuracy of a citation context and a citation article title is greatly improved by calculating influence weights (attention) of encoding and decoding vocabularies one by one; furthermore, the cosine similarity calculation is carried out on the decoded referred chapter titles and the article titles in the to-be-referred chapter list library one by one, and articles meeting the requirements are selected as recommended article lists, so that the coupling between the citation-title translation and the article recommendation is greatly reduced, and the two works can be independently carried out.
Drawings
FIG. 1 is a schematic diagram of the steps of a local citation recommendation method based on neural machine translation technology;
FIG. 2 is a functional diagram of a local citation recommendation system based on neural machine translation technology;
FIG. 3 is a logic diagram of step S2 in a local citation recommendation method based on neural machine translation technology;
fig. 4 is a logic block diagram of step S3 in a local citation recommendation method based on a neural machine translation technology.
Detailed Description
As shown in fig. 1 and fig. 2, fig. 1 and fig. 2 are a method and a system for local citation recommendation based on neural machine translation according to the present invention.
Referring to fig. 1, the present invention provides a local citation recommendation method based on neural machine translation, including the following steps:
s1, performing data cleaning operations such as quotation extraction, morphology reduction, word frequency statistics and the like on the original data set to obtain a parallel corpus of a quotation context and a quotation seal title and construct an initial to-be-quotation seal list library;
in the embodiment, all English quotation contexts are extracted, invalid symbols are removed, the quotation contexts with the word number between 10 and 28 are reserved, and the word shapes are restored; counting word frequency, reserving former 10000 words, replacing other words with < UNK > ", expanding < PAD >", extracting the title of the introduced chapter according to the context of the introduction and carrying out similar cleaning operation when the word is less than 28 words.
In the actual operation process, step S1 specifically includes the following steps:
s11, extracting initial article title data corresponding to the cited context from the original data by adopting a dictionary corresponding matching algorithm according to the cited position in the cited context, and constructing the corresponding relation between the original cited context and the cited chapter title by adopting a context-title knowledge base connection algorithm;
s12, reserving all English quotation contexts by utilizing the built deactivation symbol knowledge base, and removing all invalid characters comprising some escape symbols, punctuation symbols, formula symbols, special symbols and the like to form a quotation context expression set taking words as relationship links;
s13, performing word segmentation operation on the quotation context by adopting a balance word segmentation library, counting word frequency, counting words of the previous 10000 word frequency, and constructing a coding word list library;
s14, traversing all the quotation contexts, replacing vocabularies which are not in the coding vocabulary list base with "< UNK >", truncating the quotation contexts of more than 28 words, and supplementing the quotation contexts of less than 28 words with "< PAD >", and generating the quotation contexts with standard formats;
s15, performing operations similar to S12, S13 and S14 on the cited chapter title, and generating a parallel corpus of the cited context and the cited chapter title according to the corresponding relation extracted in S11;
s2, embedding words appearing in the quotation context and the quotation chapter titles into a low-dimensional semantic space to obtain word vectors by combining a word skipping model in a word vector model with a negative sampling method, and enabling semantically similar words to be closer to each other in the space through an embedding space;
in the actual operation process, step S2 specifically includes the following steps:
s21, dividing the sentence into a plurality of (input word) - (output word) pairs according to the size of the word window;
s22, converting all words into 0-1 vector with the size equivalent to word list (10000 words)
S23, constructing a neural network, which includes an input layer (accepting 10000-dimensional 0-1 vectors), a hidden layer 100 neurons (word vector dimension), and an output layer 10000 neurons, and the structure of which is shown in fig. 3:
and S24, adding negative sampling reverse transfer errors into the word skipping model, wherein a 10000 x 300 weight matrix at the word embedding matrix is the finally obtained word vector representation.
S3, constructing an encoder with a bidirectional gate control circulation unit with an attention mechanism and a decoder framework with the gate control circulation unit based on a neural machine translation technology, converting the quotation context in the parallel corpus into word vectors through a word vector model and then using the word vectors as the input of the model, and training the model by using the quotation chapter titles as the output;
in the present embodiment, the neural machine translation technology and the local citation recommendation are combined, and the local citation recommendation is expressed as a machine translation problem from a source language (citation context) to a target language (cited chapter heading). And constructing a coder with an attention mechanism bidirectional gating cycle unit and a decoder framework of the gating cycle unit to learn semantic representation of the quotation context, and mining and decoding the seed titles from the candidate word list on the basis of semantic understanding to form a seed title construction model taking semantic content as linkage.
In step S3, a citation context refers to a plurality of articles, which are processed in the form of a plurality of parallel corpora.
In the actual operation process, step S3 specifically includes:
constructing an encoder with attention-driven bidirectional gated loop units and a decoder framework of gated loop units, as shown in FIG. four:
the encoder is composed of a bidirectional gating circulation unit network, receives vector representation of the t-th word of an input sequence at each moment t and obtains a hidden layer state h < t >, obtains translation weight of each input word through the action of an attention mechanism and the hidden layer state of an output layer, further obtains a final context vector and sends the final context vector to the decoder to decode the word.
The formula of the encoder GRU unit is expressed as follows:
Gu=sigmod(Wu[h<t-1>,x<t>]+bu) -a retrofit gate
Gr=sigmod(Wr[h<t-1>,x<t>]+br) -a reset gate
Wherein h is<t>Hidden layer variable, x, representing time h<t>Representing input at time t, bu、br、bcIndicating bias, sigmod, tanh are activation functions W[u,r,c]Is a weight parameter.
The attention mechanism decoding part process is as follows:
when the decoder decodes the t-th word, the state s of the hidden layer at the moment t of the decoder needs to be calculated<t>Word y decoded at time t-1<t-1>The context vector c coming from the encoder at time t<t>Wherein the decoder hides the layer state s at time ttCan be obtained by the following formula:
s<t>=g(y<t-1>,s<t-1>,c<t>
where the context variable c is introduced at time t<t>Hidden layer variable h by encoder<t>And a translation attention determination for each encoded word and the decoded word, the formula being as follows:
whereinIs a vector type of attention, representing the encoder firstThe translation attention of an individual word to all the words of the decoder,can be obtained by the following formula:
whereinAttention of scalar type, representing encoderThe translation attention of a word to the t-th word of the decoder,can be obtained by the following formula:
wherein v isT,W[s,h]Is the parameter weight.
And circulating the processes until all the words are decoded, namely the seed titles.
S4, cosine similarity calculation is carried out on the seed titles output by the encoder-decoder framework and all article titles in the to-be-quoted article list one by one;
in this embodiment, the seed header is a sequence decoded from a group of candidate words and is used as a similarity comparison of all the titles of the chapter to be referred.
In the actual operation process, step S4 specifically includes the following steps:
s41, calculating pairwise similarity of vocabularies in all decoding candidate vocabularies, and suggesting a lexicon similarity search dictionary set;
s42, segmenting the seed titles and the article titles in the list library of the to-be-cited documents, and calculating the similarity between the seed titles and each to-be-cited document title word by word according to the similarity in the word library similarity search dictionary set;
s43, accumulating the calculation results in the step S42 as the similarity between the seed title and the article;
s44, sorting the similarity results obtained in the step S43 to form a document recommendation list;
s5, according to the article year, removing the articles with publication time after the article year of the citation context, and selecting the top 20 of similarity as a recommendation list.
In this embodiment, the top 20 recommendation list is a basic value, and the number of the to-be-cited chapters can be adjusted according to a specific scene.
In the actual operation process, step S5 specifically includes the following steps:
directly recommending the top 20 articles in the document recommendation list obtained in the step S4 for articles or citation contexts without year information; for the articles with year information and the citation context, the articles with publication years after the article year of the citation context are removed from the document recommendation list obtained in step S4, and the local citation recommendation is completed after the first 20 articles in the recommendation remaining list.
Referring to fig. 2, the present invention provides a local citation recommendation system based on neural machine translation, including:
a quotation cleaning module for processing the input quotation context into a standard input corpus form required by the encoder-decoder framework;
in this embodiment, the citation cleaning module is specifically configured to:
removing invalid symbols in the quotation context, replacing vocabularies which do not appear in a word list in the quotation context with "< UNK >", filling up "< PAD >" when the words are insufficient for 28, performing truncation operation and morphological restoration on all the words when the words exceed 28, and then converting all the vocabularies into word vectors by using a pre-trained word vector model;
the article expansion module is used for dynamically expanding a chapter list base to be quoted on the basis of the existing article list base, and crawling the latest open articles of the relevant document retrieval platform in time by utilizing a web crawler technology so as to ensure that the chapter list base to be quoted of the quotation context is more complete and comprehensive;
in this embodiment, the article expansion module is specifically configured to:
crawling the latest open article of the relevant retrieval platform by using a web crawler technology, washing in a similar step S1, then persistently entering the article into a list library of the articles to be referred to, and dynamically expanding and maintaining the list library of the articles to be referred to;
the candidate word updating module is used for recalculating word frequency after the list library of the to-be-cited stamp is updated, and dynamically updating a candidate word list when the decoder decodes the seed title;
in this embodiment, the candidate word updating module is specifically configured to:
after the list base of the articles to be cited is updated, performing word segmentation on the latest global article list base title of the articles to be cited and recalculating word frequency, and then dynamically updating a candidate word list when a decoder decodes the seed title so as to maintain the list of the articles to be cited and the candidate word list of the decoder in a synchronous association state;
the citation recommending module calculates a recommended article list under the condition of giving citation context through the core algorithm in S2, S3 and S4;
in this embodiment, the citation recommendation module is specifically configured to:
the local quotation recommendation is expressed as a machine translation problem from a source language (quotation context) to a target language (with quotation chapter titles) by combining the local quotation recommendation and the neural machine translation, and a recommendation article list on the premise of giving the quotation context is calculated by the core algorithm in S2, S3 and S4.
The invention provides a novel local citation recommendation method. Compared with the traditional citation recommendation method, the citation recommendation and the neural machine translation are combined by constructing the parallel corpus pair of the citation context and the cited chapter title, and the local citation recommendation is regarded as a machine translation problem from the citation context (source language) to the cited chapter title (target language), so that the citation context and the article title have stronger semantic consistency. The invention embeds the vocabulary in the quotation context and the quotation chapter title into the low-dimensional semantic space, so that semantically similar words are closer in the space; an encoder of a bidirectional gating circulating unit with an attention mechanism and a decoder framework of the gating circulating unit are constructed, and the translation accuracy of a citation context and a citation article title is greatly improved by calculating influence weights (attention) of encoding and decoding vocabularies one by one; furthermore, cosine similarity calculation is carried out on the decoded referred chapter titles and the article titles in the to-be-referred chapter list library one by one, and the top 20 articles are selected as recommended article lists, so that the coupling between the citation-title translation and the article recommendation is greatly reduced, and the two works can be independently carried out. The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.
Claims (9)
1. A local citation recommendation method based on a neural machine translation technology is characterized by comprising the following steps:
s1, performing operations of quotation extraction, morphology reduction and word frequency statistical data cleaning on the original data set to obtain a parallel corpus of a quotation context and a quotation seal title and construct an initial to-be-quotation seal list library;
s2, embedding words appearing in the quotation context and the quotation chapter titles into a low-dimensional semantic space to obtain word vectors by combining a word skipping model in a word vector model with a negative sampling method, and enabling semantically similar words to be closer to each other in the space through an embedding space;
s3, constructing an encoder with a bidirectional gate control circulation unit with an attention mechanism and a decoder framework with the gate control circulation unit based on a neural machine translation technology, converting the quotation context in the parallel corpus into word vectors through a word vector model and then using the word vectors as the input of the model, and training the model by using the quotation chapter titles as the output; constructing a coder of a bidirectional gating circulation unit with an attention mechanism and a decoder framework of the gating circulation unit to learn semantic representation of a citation context, mining and decoding a seed title from a candidate word list on the basis of understanding semantics, and forming a seed title construction model taking semantic content as connection;
the specific framework for constructing the encoder of the bidirectional gated loop unit with the attention mechanism and the decoder of the gated loop unit is as follows:
the encoder is formed by a network of bi-directional gated cyclic units, receiving at each instant t a vector representation of the t-th word of the input sequence and deriving a hidden layer state h<t>Obtaining the translation weight of each input word through the attention mechanism and the hidden layer state action of the output layer, further obtaining a final context vector and sending the final context vector to a decoder to decode the word;
the formula of the encoder GRU unit is expressed as follows:
Gu=sigmod(Wa[h<t-1>,x<t>]+bu)
Gr=sigmod(Wr[h<t-1>,x<t>]+br)
wherein G isuFor updating the door, GrIn order to reset the gate, the gate is reset,to update hidden layer variables, C<t>For hidden layer variables flowing to the next instant, h<t>Hidden layer variable, x, representing time h<t>Representing input at time t, bu、br、bcIndicating bias, sigmod, tanh are activation functions W[u,r,c]Is a weight parameter;
the attention mechanism decoding part process is as follows:
when the decoder decodes the t-th word, the state s of the hidden layer at the moment t of the decoder needs to be calculated<t>Word y decoded at time t-1<t-1>The context vector c coming from the encoder at time t<t>Wherein the decoder hides the layer state s at time ttCan be obtained by the following formula:
s<t>=g(y<t-1>,s<t-1>,c<t>)
where the context variable c is introduced at time t<t>Hidden layer variable h by encoder<t>And a translation attention determination for each encoded word and the decoded word, the formula being as follows:
whereinIs a vector type of attention, representing the encoder firstThe translation attention of an individual word to all the words of the decoder,can be obtained by the following formula:
which is composed ofAttention of scalar type, indicating encoder firstThe translation attention of a word to the t-th word of the decoder,can be obtained by the following formula:
wherein v isT,W[s,h]Is the parameter weight;
the above processes are circulated until all words are decoded, namely the seed titles;
s4, cosine similarity calculation is carried out on the seed titles output by the encoder-decoder framework and all article titles in the to-be-quoted article list one by one;
s5, according to the article year, removing the articles with publication time after the article year of the citation context, and selecting the articles with similarity meeting the requirement as a recommendation list.
2. The local citation recommendation method based on neural machine translation technology according to claim 1, wherein step S1 specifically includes:
extracting all English quotation contexts, removing invalid symbols, reserving the quotation contexts with the word number in a set range, and restoring the word forms; counting word frequency, reserving the vocabulary with set ranking before ranking, replacing other vocabularies with < UNK >, expanding < PAD > for the words in the set range, extracting the title of the introduced chapter according to the context of the introduced text and carrying out similar cleaning operation.
3. The local citation recommendation method based on neural machine translation technology according to claim 1, wherein step S2 specifically includes:
s21, dividing the sentence into a plurality of input words and output words in a relative mode according to the size of the word window;
s22, converting all words into 0-1 vectors with the size equivalent to the word list;
s23, constructing a neural network, which comprises an input layer, a hidden layer and an output layer;
and S24, adding negative sampling reverse transfer errors into the word skipping model, wherein the weight matrix at the word embedding matrix is the finally obtained word vector representation.
4. The local citation recommendation method based on neural machine translation technology according to claim 1, wherein step S4 specifically includes:
s41, calculating pairwise similarity of vocabularies in all decoding candidate vocabularies, and establishing a word bank similarity search dictionary set;
s42, segmenting the seed titles and the article titles in the list library of the to-be-cited documents, and calculating the similarity between the seed titles and each to-be-cited document title word by word according to the similarity in the word library similarity search dictionary set;
s43, accumulating the calculation results in the step S42 as the similarity between the seed title and the article;
and S44, sorting the similarity results obtained in the step S43 to form a document recommendation list.
5. A local citation recommendation system based on neural machine translation technology, which is applied to the method of any one of the preceding claims 1 to 4, and comprises:
a quotation cleaning module for processing the input quotation context into a standard input corpus form required by the encoder-decoder framework;
the article expansion module is used for dynamically expanding a chapter list base to be quoted on the basis of the existing article list base, and crawling the latest open articles of the relevant document retrieval platform in time by utilizing a web crawler technology so as to ensure that the chapter list base to be quoted of the quotation context is more complete and comprehensive;
the candidate word updating module is used for recalculating word frequency after the list library of the to-be-cited stamp is updated, and dynamically updating a candidate word list when the decoder decodes the seed title;
and the citation recommending module calculates a recommended article list on the premise of giving citation context.
6. The local citation recommendation system based on the neural machine translation technology as claimed in claim 5, wherein the citation washing module is specifically configured to:
and removing invalid symbols in the quotation context, replacing vocabularies which do not appear in a word list in the quotation context with < UNK >, filling up < PAD > when the words in the preset range are not enough, performing truncation operation and performing morphological restoration on all words when the words in the preset range are exceeded, and then converting all the vocabularies into word vectors by using a pre-trained word vector model.
7. The local citation recommendation system based on the neural machine translation technology as claimed in claim 5, wherein in the article expansion module, the latest published article of the relevant retrieval platform is crawled by using a web crawler technology, citation extraction, morphological reduction and word frequency statistical data cleaning operations are performed on the original data set to obtain the parallel corpora of the citation context and the title of the cited chapter, an initial to-be-cited chapter list base is constructed, and the to-be-cited chapter list base is dynamically expanded and maintained.
8. The local citation recommendation system based on the neural machine translation technology as claimed in claim 5, wherein in the candidate word updating module, after the list base of the citation waiting list is updated, the latest global citation waiting list base title is participled and the word frequency is recalculated, and then the candidate word list when the decoder decodes the seed title is dynamically updated, so that the list of the citation waiting list and the decoder candidate word list are maintained in a synchronous association state.
9. The local citation recommendation system based on neural machine translation technology as claimed in claim 5, wherein the citation recommendation module combines the local citation recommendation and the neural machine translation, represents the local citation recommendation as a machine translation problem from a source language to a target language, and calculates a recommendation article list given the citation context through the citation recommendation module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810994562.0A CN109145190B (en) | 2018-08-27 | 2018-08-27 | Local citation recommendation method and system based on neural machine translation technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810994562.0A CN109145190B (en) | 2018-08-27 | 2018-08-27 | Local citation recommendation method and system based on neural machine translation technology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109145190A CN109145190A (en) | 2019-01-04 |
CN109145190B true CN109145190B (en) | 2021-07-30 |
Family
ID=64828908
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810994562.0A Active CN109145190B (en) | 2018-08-27 | 2018-08-27 | Local citation recommendation method and system based on neural machine translation technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109145190B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109740164B (en) * | 2019-01-09 | 2023-08-15 | 国网浙江省电力有限公司舟山供电公司 | Electric power defect grade identification method based on depth semantic matching |
CN109740168B (en) * | 2019-01-09 | 2020-10-13 | 北京邮电大学 | Traditional Chinese medicine classical book and ancient sentence translation method based on traditional Chinese medicine knowledge graph and attention mechanism |
CN109753567A (en) * | 2019-01-31 | 2019-05-14 | 安徽大学 | A kind of file classification method of combination title and text attention mechanism |
CN110276082B (en) * | 2019-06-06 | 2023-06-30 | 百度在线网络技术(北京)有限公司 | Translation processing method and device based on dynamic window |
CN110472727B (en) * | 2019-07-25 | 2021-05-11 | 昆明理工大学 | Neural machine translation method based on re-reading and feedback mechanism |
CN111061935B (en) * | 2019-12-16 | 2022-04-12 | 北京理工大学 | Science and technology writing recommendation method based on self-attention mechanism |
CN111581401B (en) * | 2020-05-06 | 2023-04-07 | 西安交通大学 | Local citation recommendation system and method based on depth correlation matching |
CN112035607B (en) * | 2020-08-19 | 2022-05-20 | 中南大学 | Method, device and storage medium for matching citation difference based on MG-LSTM |
CN112395892B (en) * | 2020-12-03 | 2022-03-18 | 内蒙古工业大学 | Mongolian Chinese machine translation method for realizing placeholder disambiguation based on pointer generation network |
CN112765342B (en) * | 2021-03-22 | 2022-10-14 | 中国电子科技集团公司第二十八研究所 | Article recommendation method based on time and semantics |
CN113268951B (en) * | 2021-04-30 | 2023-05-30 | 南京邮电大学 | Deep learning-based quotation recommendation method |
CN113239181B (en) * | 2021-05-14 | 2023-04-18 | 电子科技大学 | Scientific and technological literature citation recommendation method based on deep learning |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105589948A (en) * | 2015-12-18 | 2016-05-18 | 重庆邮电大学 | Document citation network visualization and document recommendation method and system |
US9607058B1 (en) * | 2016-05-20 | 2017-03-28 | BlackBox IP Corporation | Systems and methods for managing documents associated with one or more patent applications |
CN106682172A (en) * | 2016-12-28 | 2017-05-17 | 江苏大学 | Keyword-based document research hotspot recommending method |
CN106844368A (en) * | 2015-12-03 | 2017-06-13 | 华为技术有限公司 | For interactive method, nerve network system and user equipment |
CN107341199A (en) * | 2017-06-21 | 2017-11-10 | 北京林业大学 | A kind of recommendation method based on documentation & info general model |
GB2556664A (en) * | 2016-11-07 | 2018-06-06 | Google Llc | Third party application configuration for issuing notifications |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10331658B2 (en) * | 2011-06-03 | 2019-06-25 | Gdial Inc. | Systems and methods for atomizing and individuating data as data quanta |
US9218344B2 (en) * | 2012-06-29 | 2015-12-22 | Thomson Reuters Global Resources | Systems, methods, and software for processing, presenting, and recommending citations |
US10769865B2 (en) * | 2016-07-15 | 2020-09-08 | Charlena L. Thorpe | Licensing and ticketing system for traffic violation |
US10817814B2 (en) * | 2016-08-26 | 2020-10-27 | Conduent Business Services, Llc | System and method for coordinating parking enforcement officer patrol in real time with the aid of a digital computer |
-
2018
- 2018-08-27 CN CN201810994562.0A patent/CN109145190B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106844368A (en) * | 2015-12-03 | 2017-06-13 | 华为技术有限公司 | For interactive method, nerve network system and user equipment |
CN105589948A (en) * | 2015-12-18 | 2016-05-18 | 重庆邮电大学 | Document citation network visualization and document recommendation method and system |
US9607058B1 (en) * | 2016-05-20 | 2017-03-28 | BlackBox IP Corporation | Systems and methods for managing documents associated with one or more patent applications |
GB2556664A (en) * | 2016-11-07 | 2018-06-06 | Google Llc | Third party application configuration for issuing notifications |
CN106682172A (en) * | 2016-12-28 | 2017-05-17 | 江苏大学 | Keyword-based document research hotspot recommending method |
CN107341199A (en) * | 2017-06-21 | 2017-11-10 | 北京林业大学 | A kind of recommendation method based on documentation & info general model |
Also Published As
Publication number | Publication date |
---|---|
CN109145190A (en) | 2019-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109145190B (en) | Local citation recommendation method and system based on neural machine translation technology | |
CN110210037B (en) | Syndrome-oriented medical field category detection method | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN110275936B (en) | Similar legal case retrieval method based on self-coding neural network | |
CN110688854B (en) | Named entity recognition method, device and computer readable storage medium | |
CN111291188B (en) | Intelligent information extraction method and system | |
CN111241816A (en) | Automatic news headline generation method | |
CN111881677A (en) | Address matching algorithm based on deep learning model | |
CN110688862A (en) | Mongolian-Chinese inter-translation method based on transfer learning | |
CN111243699A (en) | Chinese electronic medical record entity extraction method based on word information fusion | |
CN111950283B (en) | Chinese word segmentation and named entity recognition system for large-scale medical text mining | |
CN109918477B (en) | Distributed retrieval resource library selection method based on variational self-encoder | |
CN114153971B (en) | Error correction recognition and classification equipment for Chinese text containing errors | |
CN112905736B (en) | Quantum theory-based unsupervised text emotion analysis method | |
CN111967267B (en) | XLNET-based news text region extraction method and system | |
CN115019906B (en) | Drug entity and interaction combined extraction method for multi-task sequence labeling | |
CN114298010A (en) | Text generation method integrating dual-language model and sentence detection | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
CN111523320A (en) | Chinese medical record word segmentation method based on deep learning | |
CN115687567A (en) | Method for searching similar long text by short text without marking data | |
CN115098673A (en) | Business document information extraction method based on variant attention and hierarchical structure | |
CN112287641B (en) | Synonym sentence generating method, system, terminal and storage medium | |
CN109117471A (en) | A kind of calculation method and terminal of the word degree of correlation | |
CN109960782A (en) | A kind of Tibetan language segmenting method and device based on deep neural network | |
CN116595970A (en) | Sentence synonymous rewriting method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |