CN109145190B - Local citation recommendation method and system based on neural machine translation technology - Google Patents

Local citation recommendation method and system based on neural machine translation technology Download PDF

Info

Publication number
CN109145190B
CN109145190B CN201810994562.0A CN201810994562A CN109145190B CN 109145190 B CN109145190 B CN 109145190B CN 201810994562 A CN201810994562 A CN 201810994562A CN 109145190 B CN109145190 B CN 109145190B
Authority
CN
China
Prior art keywords
word
quotation
list
context
citation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810994562.0A
Other languages
Chinese (zh)
Other versions
CN109145190A (en
Inventor
赵姝
王鑫
刘洋
陈洁
段震
张燕平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN201810994562.0A priority Critical patent/CN109145190B/en
Publication of CN109145190A publication Critical patent/CN109145190A/en
Application granted granted Critical
Publication of CN109145190B publication Critical patent/CN109145190B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a local quotation recommendation method and system based on a neural machine translation technology, which comprises the steps of performing quotation extraction, morphology reduction and word frequency statistical data cleaning on an original data set to obtain parallel corpora of a quotation context and a quotation seal title and constructing an initial to-be-quotation seal list base; embedding words appearing in a quotation context and a quotation seal title into a low-dimensional semantic space by combining a word skipping model in a word vector model with a negative sampling method to obtain a word vector, constructing an encoder with a bidirectional gating circulation unit with an attention mechanism and a decoder framework with the gating circulation unit, converting the quotation context in parallel linguistic data into a word vector through the word vector model and then using the word vector as the input of the model, and using the quotation seal title as the output to train the model; cosine similarity calculation is carried out on the seed titles output by the encoder-decoder framework and all article titles in the chapter list to be quoted one by one; and selecting the articles meeting the requirements as a recommendation list according to the article years.

Description

Local citation recommendation method and system based on neural machine translation technology
Technical Field
The invention relates to the technical field of information retrieval, in particular to a local citation recommendation method and system based on neural machine translation.
Background
With the rapid development of internet technology, a large number of new scientific research articles are published every year, and how to quickly find out needed documents from massive documents becomes a great difficulty. The local citation recommendation can help quickly construct an intelligent model on semantics and contents on the premise of giving a section of context, help you quickly find the citation-available documents related to your research field from massive documents or directly recommend the citation-available documents for you, and therefore, a great amount of time for finding the related documents in scientific research work is saved for you. Local citation recommendations play a non-negligible role in scientific research.
This has been studied by many researchers in recent years. The method mainly comprises two categories, namely global citation recommendation, namely independent article recommendation citations; and secondly, a quotation is recommended for a piece of context text in the article. The research methods used are generally text similarity-based methods, topic model-based methods, translation model-based methods, collaborative filtering-based methods, deep learning-based methods, and some other methods.
Neural machine translation is a set of encoder-decoder framework proposed by google in 2014, making a great deal of progress in the machine translation problem.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a local citation recommendation method based on a neural machine translation technology, so as to improve the translation accuracy of the citation context and the citation article title.
A local citation recommendation method based on a neural machine translation technology comprises the following steps:
s1, performing operations of quotation extraction, morphology reduction and word frequency statistical data cleaning on the original data set to obtain a parallel corpus of a quotation context and a quotation seal title and construct an initial to-be-quotation seal list library;
s2, embedding words appearing in the quotation context and the quotation chapter titles into a low-dimensional semantic space to obtain word vectors by combining a word skipping model in a word vector model with a negative sampling method, and enabling semantically similar words to be closer to each other in the space through an embedding space;
s3, constructing an encoder with a bidirectional gate control circulation unit with an attention mechanism and a decoder framework with the gate control circulation unit based on a neural machine translation technology, converting the quotation context in the parallel corpus into word vectors through a word vector model and then using the word vectors as the input of the model, and training the model by using the quotation chapter titles as the output;
s4, cosine similarity calculation is carried out on the seed titles output by the encoder-decoder framework and all article titles in the to-be-quoted article list one by one;
s5, according to the article year, removing the articles with publication time after the article year of the citation context, and selecting the articles with similarity meeting the requirement as a recommendation list.
As a preferable solution of the above technical solution, step S1 specifically includes:
extracting all English quotation contexts, removing invalid symbols, reserving the quotation contexts with the word number in a set range, and restoring the word forms; counting word frequency, reserving the vocabulary with set ranking before ranking, replacing other vocabularies with < UNK >, expanding < PAD > for the words in the set range, extracting the title of the introduced chapter according to the context of the introduced text and carrying out similar cleaning operation.
As a preferable solution of the above technical solution, step S2 specifically includes:
s21, dividing the sentence into a plurality of input words and output words in a relative mode according to the size of the word window;
s22, converting all words into 0-1 vectors with the size equivalent to the word list;
s23, constructing a neural network, which comprises an input layer, a hidden layer and an output layer;
and S24, adding negative sampling reverse transfer errors into the word skipping model, wherein the weight matrix at the word embedding matrix is the finally obtained word vector representation.
As a preferable solution of the above technical solution, the step S3 specifically includes:
constructing a coder of a bidirectional gating circulation unit with an attention mechanism and a decoder framework of the gating circulation unit to learn semantic representation of a citation context, mining and decoding a seed title from a candidate word list on the basis of understanding semantics, and forming a seed title construction model taking semantic content as connection;
the specific framework for constructing the encoder of the bidirectional gated loop unit with the attention mechanism and the decoder of the gated loop unit is as follows:
the encoder is formed by a network of bi-directional gated cyclic units, receiving at each instant t a vector representation of the t-th word of the input sequence and deriving a hidden layer state h<t>Obtaining the translation weight of each input word through the attention mechanism and the hidden layer state action of the output layer, further obtaining a final context vector and sending the final context vector to a decoder to decode the word;
the formula of the encoder GRU unit is expressed as follows:
Gu=sigmod(Wu[h<t-1>,x<t>]+bu)
Gr=sigmod(Wr[h<t-1>,x<t>]+br)
Figure BDA0001778004440000031
Figure BDA0001778004440000032
wherein G isuFor updating the door, GrIn order to reset the gate, the gate is reset,
Figure BDA0001778004440000033
to update hidden layer variables, C<t>For hidden layer variables flowing to the next instant, h<t>Hidden layer variable, x, representing time h<t>Representing input at time t, bu、br、bcIndicating bias, sigmod, tanh are activation functions W[u,r,c]Is a weight parameter.
The attention mechanism decoding part process is as follows:
when the decoder decodes the t-th word, the state s of the hidden layer at the moment t of the decoder needs to be calculated<t>Word y decoded at time t-1<t-1>The context vector c coming from the encoder at time t<t>Wherein the decoder hides the layer state s at time ttCan be obtained by the following formula:
S<t>=g(y<t-1>,s<t-1>,c<t>
where the context variable c is introduced at time t<t>Hidden layer variable h by encoder<t>And a translation attention determination for each encoded word and the decoded word, the formula being as follows:
Figure BDA0001778004440000041
wherein
Figure BDA0001778004440000042
Is a vector type of attention, representing the encoder first
Figure BDA0001778004440000043
The translation attention of an individual word to all the words of the decoder,
Figure BDA0001778004440000044
can be obtained by the following formula:
Figure BDA0001778004440000045
wherein
Figure BDA0001778004440000046
Attention of scalar type, representing encoder
Figure BDA0001778004440000047
The translation attention of a word to the t-th word of the decoder,
Figure BDA0001778004440000048
can be obtained by the following formula:
Figure BDA0001778004440000049
wherein v isT,W[s,h]Is the parameter weight;
and circulating the processes until all the words are decoded, namely the seed titles.
As a preferable solution of the above technical solution, step S4 specifically includes:
s41, calculating pairwise similarity of vocabularies in all decoding candidate vocabularies, and establishing a word bank similarity search dictionary set;
s42, segmenting the seed titles and the article titles in the list library of the to-be-cited documents, and calculating the similarity between the seed titles and each to-be-cited document title word by word according to the similarity in the word library similarity search dictionary set;
s43, accumulating the calculation results in the step S42 as the similarity between the seed title and the article;
and S44, sorting the similarity results obtained in the step S43 to form a document recommendation list.
The invention also provides a local citation recommendation system based on the neural machine translation technology, which is applied to the method and comprises the following steps:
a quotation cleaning module for processing the input quotation context into a standard input corpus form required by the encoder-decoder framework;
the article expansion module is used for dynamically expanding a chapter list base to be quoted on the basis of the existing article list base, and crawling the latest open articles of the relevant document retrieval platform in time by utilizing a web crawler technology so as to ensure that the chapter list base to be quoted of the quotation context is more complete and comprehensive;
the candidate word updating module is used for recalculating word frequency after the list library of the to-be-cited stamp is updated, and dynamically updating a candidate word list when the decoder decodes the seed title;
and the citation recommending module calculates a recommended article list on the premise of giving citation context.
As a preferred scheme of the above technical solution, the citation cleaning module is specifically configured to:
and removing invalid symbols in the quotation context, replacing vocabularies which do not appear in a word list in the quotation context with < UNK >, filling up < PAD > when the words in the preset range are not enough, performing truncation operation and performing morphological restoration on all words when the words in the preset range are exceeded, and then converting all the vocabularies into word vectors by using a pre-trained word vector model.
As a preferred scheme of the above technical scheme, in the article expansion module, a latest open article of a relevant retrieval platform is crawled by using a web crawler technology, data cleaning operations such as citation extraction, morphological reduction, word frequency statistics and the like are performed on an original data set, parallel corpora of a citation context and a title of a cited chapter are obtained, an initial to-be-cited chapter list base is constructed, and the to-be-cited chapter list base is dynamically expanded and maintained.
As a preferred scheme of the above technical solution, in the candidate word updating module, after the list base of the to-be-cited documents is updated, the latest global list base of the to-be-cited documents is subjected to word segmentation and word frequency recalculation, and then the candidate word list when the decoder decodes the seed header is dynamically updated, so that the list of the to-be-cited documents and the candidate word list of the decoder are maintained in a synchronous association state.
As a preferable scheme of the above technical solution, in the citation recommendation module, the local citation recommendation and the neural machine translation are combined, the local citation recommendation is expressed as a machine translation problem from a source language to a target language, and a recommendation article list on the premise of giving a citation context is calculated by the citation recommendation module.
The invention provides a novel local citation recommendation method. Compared with the traditional citation recommendation method, the citation recommendation and the neural machine translation are combined by constructing the parallel corpus pair of the citation context and the cited chapter title, the local citation recommendation is regarded as a machine translation problem from the citation context (source language) to the cited chapter title (target language), so that the citation context and the article title have stronger semantic consistency, and finally, cosine similarity calculation is performed according to the translated seed title and the articles in the to-be-cited chapter list library to obtain the to-be-cited chapter list.
The invention embeds the vocabulary in the quotation context and the quotation chapter title into the low-dimensional semantic space, so that semantically similar words are closer in the space; an encoder of a bidirectional gating circulating unit with an attention mechanism and a decoder framework of the gating circulating unit are constructed, and the translation accuracy of a citation context and a citation article title is greatly improved by calculating influence weights (attention) of encoding and decoding vocabularies one by one; furthermore, the cosine similarity calculation is carried out on the decoded referred chapter titles and the article titles in the to-be-referred chapter list library one by one, and articles meeting the requirements are selected as recommended article lists, so that the coupling between the citation-title translation and the article recommendation is greatly reduced, and the two works can be independently carried out.
Drawings
FIG. 1 is a schematic diagram of the steps of a local citation recommendation method based on neural machine translation technology;
FIG. 2 is a functional diagram of a local citation recommendation system based on neural machine translation technology;
FIG. 3 is a logic diagram of step S2 in a local citation recommendation method based on neural machine translation technology;
fig. 4 is a logic block diagram of step S3 in a local citation recommendation method based on a neural machine translation technology.
Detailed Description
As shown in fig. 1 and fig. 2, fig. 1 and fig. 2 are a method and a system for local citation recommendation based on neural machine translation according to the present invention.
Referring to fig. 1, the present invention provides a local citation recommendation method based on neural machine translation, including the following steps:
s1, performing data cleaning operations such as quotation extraction, morphology reduction, word frequency statistics and the like on the original data set to obtain a parallel corpus of a quotation context and a quotation seal title and construct an initial to-be-quotation seal list library;
in the embodiment, all English quotation contexts are extracted, invalid symbols are removed, the quotation contexts with the word number between 10 and 28 are reserved, and the word shapes are restored; counting word frequency, reserving former 10000 words, replacing other words with < UNK > ", expanding < PAD >", extracting the title of the introduced chapter according to the context of the introduction and carrying out similar cleaning operation when the word is less than 28 words.
In the actual operation process, step S1 specifically includes the following steps:
s11, extracting initial article title data corresponding to the cited context from the original data by adopting a dictionary corresponding matching algorithm according to the cited position in the cited context, and constructing the corresponding relation between the original cited context and the cited chapter title by adopting a context-title knowledge base connection algorithm;
s12, reserving all English quotation contexts by utilizing the built deactivation symbol knowledge base, and removing all invalid characters comprising some escape symbols, punctuation symbols, formula symbols, special symbols and the like to form a quotation context expression set taking words as relationship links;
s13, performing word segmentation operation on the quotation context by adopting a balance word segmentation library, counting word frequency, counting words of the previous 10000 word frequency, and constructing a coding word list library;
s14, traversing all the quotation contexts, replacing vocabularies which are not in the coding vocabulary list base with "< UNK >", truncating the quotation contexts of more than 28 words, and supplementing the quotation contexts of less than 28 words with "< PAD >", and generating the quotation contexts with standard formats;
s15, performing operations similar to S12, S13 and S14 on the cited chapter title, and generating a parallel corpus of the cited context and the cited chapter title according to the corresponding relation extracted in S11;
s2, embedding words appearing in the quotation context and the quotation chapter titles into a low-dimensional semantic space to obtain word vectors by combining a word skipping model in a word vector model with a negative sampling method, and enabling semantically similar words to be closer to each other in the space through an embedding space;
in the actual operation process, step S2 specifically includes the following steps:
s21, dividing the sentence into a plurality of (input word) - (output word) pairs according to the size of the word window;
s22, converting all words into 0-1 vector with the size equivalent to word list (10000 words)
S23, constructing a neural network, which includes an input layer (accepting 10000-dimensional 0-1 vectors), a hidden layer 100 neurons (word vector dimension), and an output layer 10000 neurons, and the structure of which is shown in fig. 3:
and S24, adding negative sampling reverse transfer errors into the word skipping model, wherein a 10000 x 300 weight matrix at the word embedding matrix is the finally obtained word vector representation.
S3, constructing an encoder with a bidirectional gate control circulation unit with an attention mechanism and a decoder framework with the gate control circulation unit based on a neural machine translation technology, converting the quotation context in the parallel corpus into word vectors through a word vector model and then using the word vectors as the input of the model, and training the model by using the quotation chapter titles as the output;
in the present embodiment, the neural machine translation technology and the local citation recommendation are combined, and the local citation recommendation is expressed as a machine translation problem from a source language (citation context) to a target language (cited chapter heading). And constructing a coder with an attention mechanism bidirectional gating cycle unit and a decoder framework of the gating cycle unit to learn semantic representation of the quotation context, and mining and decoding the seed titles from the candidate word list on the basis of semantic understanding to form a seed title construction model taking semantic content as linkage.
In step S3, a citation context refers to a plurality of articles, which are processed in the form of a plurality of parallel corpora.
In the actual operation process, step S3 specifically includes:
constructing an encoder with attention-driven bidirectional gated loop units and a decoder framework of gated loop units, as shown in FIG. four:
the encoder is composed of a bidirectional gating circulation unit network, receives vector representation of the t-th word of an input sequence at each moment t and obtains a hidden layer state h < t >, obtains translation weight of each input word through the action of an attention mechanism and the hidden layer state of an output layer, further obtains a final context vector and sends the final context vector to the decoder to decode the word.
The formula of the encoder GRU unit is expressed as follows:
Gu=sigmod(Wu[h<t-1>,x<t>]+bu) -a retrofit gate
Gr=sigmod(Wr[h<t-1>,x<t>]+br) -a reset gate
Figure BDA0001778004440000101
-updating hidden layer variables
Figure BDA0001778004440000102
- -hidden layer variable flowing to the next instant
Wherein h is<t>Hidden layer variable, x, representing time h<t>Representing input at time t, bu、br、bcIndicating bias, sigmod, tanh are activation functions W[u,r,c]Is a weight parameter.
The attention mechanism decoding part process is as follows:
when the decoder decodes the t-th word, the state s of the hidden layer at the moment t of the decoder needs to be calculated<t>Word y decoded at time t-1<t-1>The context vector c coming from the encoder at time t<t>Wherein the decoder hides the layer state s at time ttCan be obtained by the following formula:
s<t>=g(y<t-1>,s<t-1>,c<t>
where the context variable c is introduced at time t<t>Hidden layer variable h by encoder<t>And a translation attention determination for each encoded word and the decoded word, the formula being as follows:
Figure BDA0001778004440000103
wherein
Figure BDA0001778004440000104
Is a vector type of attention, representing the encoder first
Figure BDA0001778004440000105
The translation attention of an individual word to all the words of the decoder,
Figure BDA0001778004440000106
can be obtained by the following formula:
Figure BDA0001778004440000111
wherein
Figure BDA0001778004440000112
Attention of scalar type, representing encoder
Figure BDA0001778004440000113
The translation attention of a word to the t-th word of the decoder,
Figure BDA0001778004440000114
can be obtained by the following formula:
Figure BDA0001778004440000115
wherein v isT,W[s,h]Is the parameter weight.
And circulating the processes until all the words are decoded, namely the seed titles.
S4, cosine similarity calculation is carried out on the seed titles output by the encoder-decoder framework and all article titles in the to-be-quoted article list one by one;
in this embodiment, the seed header is a sequence decoded from a group of candidate words and is used as a similarity comparison of all the titles of the chapter to be referred.
In the actual operation process, step S4 specifically includes the following steps:
s41, calculating pairwise similarity of vocabularies in all decoding candidate vocabularies, and suggesting a lexicon similarity search dictionary set;
s42, segmenting the seed titles and the article titles in the list library of the to-be-cited documents, and calculating the similarity between the seed titles and each to-be-cited document title word by word according to the similarity in the word library similarity search dictionary set;
s43, accumulating the calculation results in the step S42 as the similarity between the seed title and the article;
s44, sorting the similarity results obtained in the step S43 to form a document recommendation list;
s5, according to the article year, removing the articles with publication time after the article year of the citation context, and selecting the top 20 of similarity as a recommendation list.
In this embodiment, the top 20 recommendation list is a basic value, and the number of the to-be-cited chapters can be adjusted according to a specific scene.
In the actual operation process, step S5 specifically includes the following steps:
directly recommending the top 20 articles in the document recommendation list obtained in the step S4 for articles or citation contexts without year information; for the articles with year information and the citation context, the articles with publication years after the article year of the citation context are removed from the document recommendation list obtained in step S4, and the local citation recommendation is completed after the first 20 articles in the recommendation remaining list.
Referring to fig. 2, the present invention provides a local citation recommendation system based on neural machine translation, including:
a quotation cleaning module for processing the input quotation context into a standard input corpus form required by the encoder-decoder framework;
in this embodiment, the citation cleaning module is specifically configured to:
removing invalid symbols in the quotation context, replacing vocabularies which do not appear in a word list in the quotation context with "< UNK >", filling up "< PAD >" when the words are insufficient for 28, performing truncation operation and morphological restoration on all the words when the words exceed 28, and then converting all the vocabularies into word vectors by using a pre-trained word vector model;
the article expansion module is used for dynamically expanding a chapter list base to be quoted on the basis of the existing article list base, and crawling the latest open articles of the relevant document retrieval platform in time by utilizing a web crawler technology so as to ensure that the chapter list base to be quoted of the quotation context is more complete and comprehensive;
in this embodiment, the article expansion module is specifically configured to:
crawling the latest open article of the relevant retrieval platform by using a web crawler technology, washing in a similar step S1, then persistently entering the article into a list library of the articles to be referred to, and dynamically expanding and maintaining the list library of the articles to be referred to;
the candidate word updating module is used for recalculating word frequency after the list library of the to-be-cited stamp is updated, and dynamically updating a candidate word list when the decoder decodes the seed title;
in this embodiment, the candidate word updating module is specifically configured to:
after the list base of the articles to be cited is updated, performing word segmentation on the latest global article list base title of the articles to be cited and recalculating word frequency, and then dynamically updating a candidate word list when a decoder decodes the seed title so as to maintain the list of the articles to be cited and the candidate word list of the decoder in a synchronous association state;
the citation recommending module calculates a recommended article list under the condition of giving citation context through the core algorithm in S2, S3 and S4;
in this embodiment, the citation recommendation module is specifically configured to:
the local quotation recommendation is expressed as a machine translation problem from a source language (quotation context) to a target language (with quotation chapter titles) by combining the local quotation recommendation and the neural machine translation, and a recommendation article list on the premise of giving the quotation context is calculated by the core algorithm in S2, S3 and S4.
The invention provides a novel local citation recommendation method. Compared with the traditional citation recommendation method, the citation recommendation and the neural machine translation are combined by constructing the parallel corpus pair of the citation context and the cited chapter title, and the local citation recommendation is regarded as a machine translation problem from the citation context (source language) to the cited chapter title (target language), so that the citation context and the article title have stronger semantic consistency. The invention embeds the vocabulary in the quotation context and the quotation chapter title into the low-dimensional semantic space, so that semantically similar words are closer in the space; an encoder of a bidirectional gating circulating unit with an attention mechanism and a decoder framework of the gating circulating unit are constructed, and the translation accuracy of a citation context and a citation article title is greatly improved by calculating influence weights (attention) of encoding and decoding vocabularies one by one; furthermore, cosine similarity calculation is carried out on the decoded referred chapter titles and the article titles in the to-be-referred chapter list library one by one, and the top 20 articles are selected as recommended article lists, so that the coupling between the citation-title translation and the article recommendation is greatly reduced, and the two works can be independently carried out. The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (9)

1. A local citation recommendation method based on a neural machine translation technology is characterized by comprising the following steps:
s1, performing operations of quotation extraction, morphology reduction and word frequency statistical data cleaning on the original data set to obtain a parallel corpus of a quotation context and a quotation seal title and construct an initial to-be-quotation seal list library;
s2, embedding words appearing in the quotation context and the quotation chapter titles into a low-dimensional semantic space to obtain word vectors by combining a word skipping model in a word vector model with a negative sampling method, and enabling semantically similar words to be closer to each other in the space through an embedding space;
s3, constructing an encoder with a bidirectional gate control circulation unit with an attention mechanism and a decoder framework with the gate control circulation unit based on a neural machine translation technology, converting the quotation context in the parallel corpus into word vectors through a word vector model and then using the word vectors as the input of the model, and training the model by using the quotation chapter titles as the output; constructing a coder of a bidirectional gating circulation unit with an attention mechanism and a decoder framework of the gating circulation unit to learn semantic representation of a citation context, mining and decoding a seed title from a candidate word list on the basis of understanding semantics, and forming a seed title construction model taking semantic content as connection;
the specific framework for constructing the encoder of the bidirectional gated loop unit with the attention mechanism and the decoder of the gated loop unit is as follows:
the encoder is formed by a network of bi-directional gated cyclic units, receiving at each instant t a vector representation of the t-th word of the input sequence and deriving a hidden layer state h<t>Obtaining the translation weight of each input word through the attention mechanism and the hidden layer state action of the output layer, further obtaining a final context vector and sending the final context vector to a decoder to decode the word;
the formula of the encoder GRU unit is expressed as follows:
Gu=sigmod(Wa[h<t-1>,x<t>]+bu)
Gr=sigmod(Wr[h<t-1>,x<t>]+br)
Figure FDA00031101701400000210
Figure FDA00031101701400000211
wherein G isuFor updating the door, GrIn order to reset the gate, the gate is reset,
Figure FDA00031101701400000212
to update hidden layer variables, C<t>For hidden layer variables flowing to the next instant, h<t>Hidden layer variable, x, representing time h<t>Representing input at time t, bu、br、bcIndicating bias, sigmod, tanh are activation functions W[u,r,c]Is a weight parameter;
the attention mechanism decoding part process is as follows:
when the decoder decodes the t-th word, the state s of the hidden layer at the moment t of the decoder needs to be calculated<t>Word y decoded at time t-1<t-1>The context vector c coming from the encoder at time t<t>Wherein the decoder hides the layer state s at time ttCan be obtained by the following formula:
s<t>=g(y<t-1>,s<t-1>,c<t>)
where the context variable c is introduced at time t<t>Hidden layer variable h by encoder<t>And a translation attention determination for each encoded word and the decoded word, the formula being as follows:
Figure FDA0003110170140000021
wherein
Figure FDA0003110170140000022
Is a vector type of attention, representing the encoder first
Figure FDA0003110170140000023
The translation attention of an individual word to all the words of the decoder,
Figure FDA0003110170140000024
can be obtained by the following formula:
Figure FDA0003110170140000025
which is composed of
Figure FDA0003110170140000026
Attention of scalar type, indicating encoder first
Figure FDA0003110170140000027
The translation attention of a word to the t-th word of the decoder,
Figure FDA0003110170140000028
can be obtained by the following formula:
Figure FDA0003110170140000029
wherein v isT,W[s,h]Is the parameter weight;
the above processes are circulated until all words are decoded, namely the seed titles;
s4, cosine similarity calculation is carried out on the seed titles output by the encoder-decoder framework and all article titles in the to-be-quoted article list one by one;
s5, according to the article year, removing the articles with publication time after the article year of the citation context, and selecting the articles with similarity meeting the requirement as a recommendation list.
2. The local citation recommendation method based on neural machine translation technology according to claim 1, wherein step S1 specifically includes:
extracting all English quotation contexts, removing invalid symbols, reserving the quotation contexts with the word number in a set range, and restoring the word forms; counting word frequency, reserving the vocabulary with set ranking before ranking, replacing other vocabularies with < UNK >, expanding < PAD > for the words in the set range, extracting the title of the introduced chapter according to the context of the introduced text and carrying out similar cleaning operation.
3. The local citation recommendation method based on neural machine translation technology according to claim 1, wherein step S2 specifically includes:
s21, dividing the sentence into a plurality of input words and output words in a relative mode according to the size of the word window;
s22, converting all words into 0-1 vectors with the size equivalent to the word list;
s23, constructing a neural network, which comprises an input layer, a hidden layer and an output layer;
and S24, adding negative sampling reverse transfer errors into the word skipping model, wherein the weight matrix at the word embedding matrix is the finally obtained word vector representation.
4. The local citation recommendation method based on neural machine translation technology according to claim 1, wherein step S4 specifically includes:
s41, calculating pairwise similarity of vocabularies in all decoding candidate vocabularies, and establishing a word bank similarity search dictionary set;
s42, segmenting the seed titles and the article titles in the list library of the to-be-cited documents, and calculating the similarity between the seed titles and each to-be-cited document title word by word according to the similarity in the word library similarity search dictionary set;
s43, accumulating the calculation results in the step S42 as the similarity between the seed title and the article;
and S44, sorting the similarity results obtained in the step S43 to form a document recommendation list.
5. A local citation recommendation system based on neural machine translation technology, which is applied to the method of any one of the preceding claims 1 to 4, and comprises:
a quotation cleaning module for processing the input quotation context into a standard input corpus form required by the encoder-decoder framework;
the article expansion module is used for dynamically expanding a chapter list base to be quoted on the basis of the existing article list base, and crawling the latest open articles of the relevant document retrieval platform in time by utilizing a web crawler technology so as to ensure that the chapter list base to be quoted of the quotation context is more complete and comprehensive;
the candidate word updating module is used for recalculating word frequency after the list library of the to-be-cited stamp is updated, and dynamically updating a candidate word list when the decoder decodes the seed title;
and the citation recommending module calculates a recommended article list on the premise of giving citation context.
6. The local citation recommendation system based on the neural machine translation technology as claimed in claim 5, wherein the citation washing module is specifically configured to:
and removing invalid symbols in the quotation context, replacing vocabularies which do not appear in a word list in the quotation context with < UNK >, filling up < PAD > when the words in the preset range are not enough, performing truncation operation and performing morphological restoration on all words when the words in the preset range are exceeded, and then converting all the vocabularies into word vectors by using a pre-trained word vector model.
7. The local citation recommendation system based on the neural machine translation technology as claimed in claim 5, wherein in the article expansion module, the latest published article of the relevant retrieval platform is crawled by using a web crawler technology, citation extraction, morphological reduction and word frequency statistical data cleaning operations are performed on the original data set to obtain the parallel corpora of the citation context and the title of the cited chapter, an initial to-be-cited chapter list base is constructed, and the to-be-cited chapter list base is dynamically expanded and maintained.
8. The local citation recommendation system based on the neural machine translation technology as claimed in claim 5, wherein in the candidate word updating module, after the list base of the citation waiting list is updated, the latest global citation waiting list base title is participled and the word frequency is recalculated, and then the candidate word list when the decoder decodes the seed title is dynamically updated, so that the list of the citation waiting list and the decoder candidate word list are maintained in a synchronous association state.
9. The local citation recommendation system based on neural machine translation technology as claimed in claim 5, wherein the citation recommendation module combines the local citation recommendation and the neural machine translation, represents the local citation recommendation as a machine translation problem from a source language to a target language, and calculates a recommendation article list given the citation context through the citation recommendation module.
CN201810994562.0A 2018-08-27 2018-08-27 Local citation recommendation method and system based on neural machine translation technology Active CN109145190B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810994562.0A CN109145190B (en) 2018-08-27 2018-08-27 Local citation recommendation method and system based on neural machine translation technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810994562.0A CN109145190B (en) 2018-08-27 2018-08-27 Local citation recommendation method and system based on neural machine translation technology

Publications (2)

Publication Number Publication Date
CN109145190A CN109145190A (en) 2019-01-04
CN109145190B true CN109145190B (en) 2021-07-30

Family

ID=64828908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810994562.0A Active CN109145190B (en) 2018-08-27 2018-08-27 Local citation recommendation method and system based on neural machine translation technology

Country Status (1)

Country Link
CN (1) CN109145190B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740164B (en) * 2019-01-09 2023-08-15 国网浙江省电力有限公司舟山供电公司 Electric power defect grade identification method based on depth semantic matching
CN109740168B (en) * 2019-01-09 2020-10-13 北京邮电大学 Traditional Chinese medicine classical book and ancient sentence translation method based on traditional Chinese medicine knowledge graph and attention mechanism
CN109753567A (en) * 2019-01-31 2019-05-14 安徽大学 A kind of file classification method of combination title and text attention mechanism
CN110276082B (en) * 2019-06-06 2023-06-30 百度在线网络技术(北京)有限公司 Translation processing method and device based on dynamic window
CN110472727B (en) * 2019-07-25 2021-05-11 昆明理工大学 Neural machine translation method based on re-reading and feedback mechanism
CN111061935B (en) * 2019-12-16 2022-04-12 北京理工大学 Science and technology writing recommendation method based on self-attention mechanism
CN111581401B (en) * 2020-05-06 2023-04-07 西安交通大学 Local citation recommendation system and method based on depth correlation matching
CN112035607B (en) * 2020-08-19 2022-05-20 中南大学 Method, device and storage medium for matching citation difference based on MG-LSTM
CN112395892B (en) * 2020-12-03 2022-03-18 内蒙古工业大学 Mongolian Chinese machine translation method for realizing placeholder disambiguation based on pointer generation network
CN112765342B (en) * 2021-03-22 2022-10-14 中国电子科技集团公司第二十八研究所 Article recommendation method based on time and semantics
CN113268951B (en) * 2021-04-30 2023-05-30 南京邮电大学 Deep learning-based quotation recommendation method
CN113239181B (en) * 2021-05-14 2023-04-18 电子科技大学 Scientific and technological literature citation recommendation method based on deep learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589948A (en) * 2015-12-18 2016-05-18 重庆邮电大学 Document citation network visualization and document recommendation method and system
US9607058B1 (en) * 2016-05-20 2017-03-28 BlackBox IP Corporation Systems and methods for managing documents associated with one or more patent applications
CN106682172A (en) * 2016-12-28 2017-05-17 江苏大学 Keyword-based document research hotspot recommending method
CN106844368A (en) * 2015-12-03 2017-06-13 华为技术有限公司 For interactive method, nerve network system and user equipment
CN107341199A (en) * 2017-06-21 2017-11-10 北京林业大学 A kind of recommendation method based on documentation & info general model
GB2556664A (en) * 2016-11-07 2018-06-06 Google Llc Third party application configuration for issuing notifications

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10331658B2 (en) * 2011-06-03 2019-06-25 Gdial Inc. Systems and methods for atomizing and individuating data as data quanta
US9218344B2 (en) * 2012-06-29 2015-12-22 Thomson Reuters Global Resources Systems, methods, and software for processing, presenting, and recommending citations
US10769865B2 (en) * 2016-07-15 2020-09-08 Charlena L. Thorpe Licensing and ticketing system for traffic violation
US10817814B2 (en) * 2016-08-26 2020-10-27 Conduent Business Services, Llc System and method for coordinating parking enforcement officer patrol in real time with the aid of a digital computer

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844368A (en) * 2015-12-03 2017-06-13 华为技术有限公司 For interactive method, nerve network system and user equipment
CN105589948A (en) * 2015-12-18 2016-05-18 重庆邮电大学 Document citation network visualization and document recommendation method and system
US9607058B1 (en) * 2016-05-20 2017-03-28 BlackBox IP Corporation Systems and methods for managing documents associated with one or more patent applications
GB2556664A (en) * 2016-11-07 2018-06-06 Google Llc Third party application configuration for issuing notifications
CN106682172A (en) * 2016-12-28 2017-05-17 江苏大学 Keyword-based document research hotspot recommending method
CN107341199A (en) * 2017-06-21 2017-11-10 北京林业大学 A kind of recommendation method based on documentation & info general model

Also Published As

Publication number Publication date
CN109145190A (en) 2019-01-04

Similar Documents

Publication Publication Date Title
CN109145190B (en) Local citation recommendation method and system based on neural machine translation technology
CN110210037B (en) Syndrome-oriented medical field category detection method
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN110275936B (en) Similar legal case retrieval method based on self-coding neural network
CN110688854B (en) Named entity recognition method, device and computer readable storage medium
CN111291188B (en) Intelligent information extraction method and system
CN111241816A (en) Automatic news headline generation method
CN111881677A (en) Address matching algorithm based on deep learning model
CN110688862A (en) Mongolian-Chinese inter-translation method based on transfer learning
CN111243699A (en) Chinese electronic medical record entity extraction method based on word information fusion
CN111950283B (en) Chinese word segmentation and named entity recognition system for large-scale medical text mining
CN109918477B (en) Distributed retrieval resource library selection method based on variational self-encoder
CN114153971B (en) Error correction recognition and classification equipment for Chinese text containing errors
CN112905736B (en) Quantum theory-based unsupervised text emotion analysis method
CN111967267B (en) XLNET-based news text region extraction method and system
CN115019906B (en) Drug entity and interaction combined extraction method for multi-task sequence labeling
CN114298010A (en) Text generation method integrating dual-language model and sentence detection
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN111523320A (en) Chinese medical record word segmentation method based on deep learning
CN115687567A (en) Method for searching similar long text by short text without marking data
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
CN112287641B (en) Synonym sentence generating method, system, terminal and storage medium
CN109117471A (en) A kind of calculation method and terminal of the word degree of correlation
CN109960782A (en) A kind of Tibetan language segmenting method and device based on deep neural network
CN116595970A (en) Sentence synonymous rewriting method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant