CN112800757A - Keyword generation method, device, equipment and medium - Google Patents

Keyword generation method, device, equipment and medium Download PDF

Info

Publication number
CN112800757A
CN112800757A CN202110365391.7A CN202110365391A CN112800757A CN 112800757 A CN112800757 A CN 112800757A CN 202110365391 A CN202110365391 A CN 202110365391A CN 112800757 A CN112800757 A CN 112800757A
Authority
CN
China
Prior art keywords
document
domain knowledge
feature information
subject
keyword generation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110365391.7A
Other languages
Chinese (zh)
Other versions
CN112800757B (en
Inventor
嵇望
安毫亿
梁青
朱鹏飞
王伟凯
钱艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yuanchuan Xinye Technology Co ltd
Original Assignee
Hangzhou Yuanchuan New Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yuanchuan New Technology Co ltd filed Critical Hangzhou Yuanchuan New Technology Co ltd
Priority to CN202110365391.7A priority Critical patent/CN112800757B/en
Publication of CN112800757A publication Critical patent/CN112800757A/en
Application granted granted Critical
Publication of CN112800757B publication Critical patent/CN112800757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a keyword generation method, a keyword generation device, electronic equipment and a computer storage medium, relates to the technical field of natural language processing, and aims to generate high-quality keywords. The method comprises the following steps: acquiring training data, wherein the training data comprises a document, subject words and a domain knowledge map entity contained in the document, and labeling data of the document; respectively extracting the characteristics of the document, the subject term and the domain knowledge map entity to respectively obtain document characteristic information, subject characteristic information and domain characteristic information; fusing document feature information, theme feature information and field feature information to obtain fusion features; performing end-to-end model training through the labeled data and the fusion characteristics of the documents to obtain a trained keyword generation model; and receiving the document to be predicted, and outputting the keywords of the document to be predicted through the keyword generation model.

Description

Keyword generation method, device, equipment and medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a keyword generation method, apparatus, device, and medium.
Background
Keyword extraction is an important research direction in the field of natural language processing, and in the big data era, text keyword extraction technology has wide application in various fields.
The text keyword extraction technology can extract words with important information from the original text as keywords, and can improve the capability of quickly acquiring the most valuable information of the text. The first method is an extraction method, which mainly extracts a plurality of words from the original text as keywords, and the keywords exist in the original text. The second is a generation method, which generates keywords through end-to-end based model training, and the obtained keywords may not exist in the original text. However, the existing keyword extraction method generally extracts words by means of statistical information of the words, neglects the influence of topics, and only focuses on the individual optimization of the keywords, but ignores the overall quality of the keywords.
Aiming at the problem of low quality of extracted keywords, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the invention provides a keyword generation method, a device, equipment and a medium, which are used for at least solving the problem of low quality of keywords.
In a first aspect, an embodiment of the present invention provides a keyword generation method, including the following steps:
acquiring training data, wherein the training data comprises a document, subject words and domain knowledge graph entities contained in the document, and marking data of the document;
respectively extracting the characteristics of the document, the subject term and the domain knowledge map entity to respectively obtain document characteristic information, subject characteristic information and domain characteristic information;
fusing the document feature information, the subject feature information and the field feature information to obtain a fusion feature;
performing end-to-end model training through the labeled data of the document and the fusion characteristics to obtain a trained keyword generation model;
and receiving a document to be predicted, and outputting the keywords of the document to be predicted through the keyword generation model.
In some embodiments, the obtaining training data is preceded by:
obtaining the document, and training the document through a GSDMM model to obtain subject terms contained in the document;
and acquiring a corresponding domain knowledge graph, and determining a domain knowledge graph entity contained in the document according to the domain knowledge graph.
In some embodiments, the feature extracting the document, the subject term, and the domain knowledge graph entity, respectively, includes:
converting the document, the subject term, and the domain knowledge graph entity into a term level vector representation of the document, a term level vector representation of the subject term, and a term level vector representation of the domain knowledge graph entity, respectively;
feature extraction is performed by an encoder on the term-level vector representation of the document, the term-level vector representation of the subject term, and the term-level vector representation of the domain knowledge graph entity.
In some embodiments, the converting the document into a term-level vector representation of the document comprises:
performing word segmentation on the document to obtain a word segmentation result;
and coding the word segmentation result through a word vector to obtain word level vector representation of the document.
In some embodiments, the converting the domain knowledge-graph entity into the word-level vector representation of the domain knowledge-graph entity comprises:
acquiring a corresponding domain knowledge graph;
and training a TransE vector of the domain knowledge graph according to the domain knowledge graph.
In some embodiments, the converting the domain knowledge-graph entity into a word-level vector representation of the domain knowledge-graph entity comprises:
and coding the domain knowledge graph entity through the TransE vector to obtain word level vector representation of the domain knowledge graph entity.
In some embodiments, the fusing the document feature information, the topic feature information, and the domain feature information to obtain a fused feature includes:
respectively representing the document feature information, the theme feature information and the field feature information by an attention mechanism to obtain an attention mechanism representation of the document feature information, an attention mechanism representation of the theme feature information and an attention mechanism representation of the field feature information;
and fusing the attention mechanism representation of the document feature information, the attention mechanism representation of the subject feature information and the attention mechanism representation of the field feature information to obtain a fusion feature.
In a second aspect, an embodiment of the present invention provides a keyword generation apparatus, including:
the training data acquisition module is used for acquiring training data, wherein the training data comprises a document, subject words and domain knowledge map entities contained in the document, and labeling data of the document;
the coding module is used for respectively extracting the characteristics of the document, the subject term and the domain knowledge map entity to respectively obtain document characteristic information, subject characteristic information and domain characteristic information;
the feature fusion module is used for fusing the document feature information, the theme feature information and the field feature information to obtain fusion features;
the model training module is used for performing end-to-end model training through the marking data of the document and the fusion characteristics to obtain a trained keyword generation model;
and the keyword prediction module is used for receiving the document to be predicted and outputting the keywords of the document to be predicted through the keyword generation model.
In a third aspect, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the keyword generation method according to the first aspect when executing the computer program.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the keyword generation method according to the first aspect.
Compared with the prior art, the embodiment of the invention provides a keyword generation method, device, equipment and medium, and the training of an end-to-end model is completed by fusing topic information and domain knowledge map information, so that the trained model can generate high-quality keywords.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a keyword generation method of the present invention;
FIG. 2 is a flowchart illustrating a keyword generation method according to an embodiment of the present invention;
FIG. 3 is a block diagram showing the structure of a keyword generation apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the purpose and technical solution of the present invention more apparent, the present invention will be described and illustrated with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments provided by the present invention, belong to the protection scope of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
The Chinese documents generally contain a large amount of information about topics (or interests), and the comprehensiveness of the document topics, the readability and the difference of keywords can be comprehensively considered by fusing topic features during model training, so that the overall quality of keyword generation can be improved. The knowledge graph of the specific field is introduced into the generation of the keywords as auxiliary information, so that the problem of sparse text knowledge information can be effectively solved, the encoding capacity of an end-to-end model encoder on background knowledge or common knowledge information is further enhanced, and especially for many vertical fields, the quality of the generated keywords can be effectively improved by professional knowledge.
Therefore, the keyword generation method fusing the topic information and the domain knowledge map information provided by the invention can reduce the redundancy phenomenon of candidate words to a certain extent, enrich the background knowledge information of the text, and improve the topic coverage and accuracy of the keywords.
Example 1
Based on the above principle, the present embodiment provides a method for generating keywords, and fig. 1 is a flowchart of the method for generating keywords according to the present invention.
As shown in fig. 1, the keyword generation method includes the following steps:
s101, obtaining training data, wherein the training data comprises a document, subject words and domain knowledge graph entities contained in the document, and labeling data of the document.
The embodiment acquires a large amount of Chinese keyword extraction sample data, the sample data comprises a plurality of documents and keywords corresponding to the documents, and the corresponding documents are labeled through the keywords to obtain labeled data of the documents. Meanwhile, subject words and domain knowledge map entities contained in the documents are acquired, and the subject words and the domain knowledge map entities contained in the documents and the labeled data of the documents are used as training data to perform supervised training on subsequent models.
Preferably, before performing the model training, parameters of the model training are preset, where the parameters may include a dimension of a word vector, a parameter of batch processing, a size of a window, an initial learning rate, a word vector matrix, an auxiliary vector matrix, a chinese vocabulary, a number of generated keywords, and the like, and in an actual application scenario, the parameters of the model training may be set by itself according to an actual situation, which is not specifically limited herein.
In addition, when the domain knowledge graph entity contained in the document is determined, the document can be subjected to word segmentation, and then the domain knowledge graph entity contained in the document is determined through matching with the entity in the knowledge graph of the corresponding domain. Of course, in other embodiments, other methods of determining the domain knowledge graph contained in the document may be freely selected.
It should be noted that, for documents belonging to different fields, knowledge maps of different fields are selected for entity determination and model training, and training data may include document data of multiple fields and subject words and entities included in each document, so as to improve generalization ability of the model.
S102, respectively extracting the characteristics of the document, the subject term and the domain knowledge map entity to respectively obtain document characteristic information, subject characteristic information and domain characteristic information.
The end-to-end model (Seq 2 Seq) uses two recurrent neural networks, one for analyzing the input sequence (encoding process) and the other for generating the output sequence (decoding process), respectively called encoder and decoder.
In the encoding process, the document, the subject word and the domain knowledge map entity are encoded through the encoder respectively so as to extract the characteristics of the document information, the subject information and the domain knowledge information. The principle and the encoding process of the encoder are conventional technical means in the art, and are not described herein again.
S103, fusing the document feature information, the theme feature information and the field feature information to obtain fusion features.
In the decoding stage, fusion characteristics are formed by the document characteristic information, the theme characteristic information and the field characteristic information, so that the document information, the theme information and the field knowledge information are fused, and when model training is carried out, the training can be carried out by combining the text information, the theme information and the field knowledge information (background knowledge information) contained in the document, so that the trained model can generate high-quality keywords.
And S104, performing end-to-end model training through the document and the fusion characteristics to obtain a trained keyword generation model.
The end-to-end model obtains a prediction result from an input end to an output end, the prediction result is compared with a real result to obtain an error, the error is reversely propagated to each layer of the network, the weight and the parameters of the network are adjusted until the model converges or reaches an expected effect, all the operations in the middle are contained in the neural network, and the neural network is not divided into a plurality of models for processing.
And in the training process, calculating errors by taking a negative log-likelihood function as a loss function, reversely spreading the errors to each layer of the network, adjusting the weight of the network until optimal parameters are determined, and obtaining a trained keyword generation model after the optimal parameters are solidified.
And S105, receiving the document to be predicted, and outputting the keywords of the document to be predicted through the keyword generation model.
The method comprises the steps of receiving a document to be predicted, using the document to be predicted as input information of a keyword generation model, and outputting keywords through the keyword generation model, wherein the outputted keywords are keywords fusing topic information and domain knowledge map information and have higher quality.
When the keyword generation method of the embodiment is applied, the topic words and the domain knowledge graph entities contained in the document are used as training data, and in the model training stage, the topic information and the domain knowledge graph information are fused to perform model training, so that different information contained in the document can be comprehensively considered, the topic coverage and accuracy of the keywords generated by the model are improved, and the quality of the generated keywords is improved.
Preferably, before the training data is acquired, the method comprises:
obtaining a document, and training the document through a GSDMM model to obtain subject terms contained in the document;
and acquiring a corresponding domain knowledge graph, and determining a domain knowledge graph entity contained in the document according to the domain knowledge graph.
The GSDMM model is a contraction type Gibbs sampling algorithm based on a Dirichlet polynomial mixed model, and a large number of texts can be clustered according to certain similarity calculated through the GSDMM model.
Training a document in the training data through a GSDMM model to obtain a subject term contained in the document, and specifically comprising the following steps:
step 1: initializing GSDMM model parameters: k, alpha, beta, I, initialization variables
Figure 921490DEST_PATH_IMAGE001
Step 2: obtainingDocument collection
Figure 369789DEST_PATH_IMAGE002
To document collection
Figure 415105DEST_PATH_IMAGE002
All documents d in (1) are initialized:
each document d in the document set is assigned a random topic:
Figure 341473DEST_PATH_IMAGE003
(ii) a Updating variables:
Figure 421424DEST_PATH_IMAGE004
and updating the variable of each word w contained in the document:
Figure 276510DEST_PATH_IMAGE005
(ii) a A set of K different topics is obtained, and each document belongs to only one topic.
And step 3: gibbs sampling was performed: for each document d, recording the theme to which the document d currently belongs: information that removes d in the current topic:
Figure 809123DEST_PATH_IMAGE006
and updating the variable of each word w contained in the document:
Figure 539181DEST_PATH_IMAGE007
(ii) a Reassigning the topic for document d according to the conditional distribution:
Figure 739218DEST_PATH_IMAGE008
(ii) a Updating variables:
Figure 529320DEST_PATH_IMAGE009
and 4, step 4: and (5) repeating the step (3) until the maximum iteration number I is reached, and obtaining the subject term contained in each document.
The corresponding domain knowledge graph may be an open-source domain-specific knowledge graph or a self-constructed domain-specific knowledge graph, and the method for constructing the knowledge graph is common knowledge in the field and will not be described here.
Preferably, the feature extraction is performed on the document, the subject term, and the domain knowledge graph entity respectively, and includes:
respectively converting the document, the subject word and the domain knowledge graph entity into word level vector representation of the document, word level vector representation of the subject word and word level vector representation of the domain knowledge graph entity;
feature extraction is performed on the term-level vector representation of the document, the term-level vector representation of the subject term, and the term-level vector representation of the domain knowledge graph entity by an encoder.
Firstly, a document, subject words contained in the document and a domain knowledge graph are coded through word vectors to obtain word level vector representation of the document, word level vector representation of the subject words and word level vector representation of a domain knowledge graph entity.
Preferably, converting the document into a term-level vector representation of the document comprises:
performing word segmentation on the document to obtain word segmentation results;
and coding the word segmentation result through a word vector to obtain word level vector representation of the document.
Specifically, word segmentation processing can be performed on the document by adopting word segmentation tools such as open-source Jieba, THULAC and LTP or word segmentation algorithms such as an N-gram and a hidden markov model, so as to obtain word segmentation results. The word segmentation of the text is a conventional technical means in the field, and besides the word segmentation tools and the word segmentation algorithms mentioned above, a person skilled in the art can also select other word segmentation methods to complete the word segmentation of the document.
Preferably, converting the subject term contained in the document into a term-level vector representation comprises:
and coding the subject term through the term vector to obtain the term level vector representation of each subject term.
The word vector is based on a specific vertical domain. The method comprises the steps of obtaining a large number of client question-answer corpora and specific field product documents to obtain training corpora, cleaning and segmenting the training corpora, and then pre-training word vectors by using a BERT model to generate word vectors based on a specific vertical field. And coding the word segmentation result and the subject word of the document through the generated word vector to respectively obtain the vector representation of the word segmentation result and the vector representation of the subject word. The BERT (bidirectional Encoder retrieval from transformations) model is a pre-training model and a typical bidirectional encoding model, is a well-known model in the field, and can realize that a self-supervision learning method is operated on the basis of mass linguistic data to learn a good feature representation for words.
Preferably, before converting the domain knowledge-graph entity into the word-level vector representation of the domain knowledge-graph entity, comprising:
acquiring a corresponding domain knowledge graph;
training the TransE vector of the domain knowledge graph according to the domain knowledge graph,
converting the domain knowledge graph entity into a word-level vector representation of the domain knowledge graph entity, comprising:
and coding the domain knowledge graph entity through the TransE vector to obtain word level vector representation of the domain knowledge graph entity.
Training a knowledge graph TransE vector of a specific field through a TransE algorithm; the TransE algorithm, a knowledge representation learning algorithm known in the art, describes triples in a knowledge graph with a distributed representation (distributed representation).
The vector representation is then encoded by an encoder to achieve feature extraction. In this embodiment, a BiGRU model is selected as an encoder to complete feature extraction, and the BiGRU model is a known neural network model, and the principle and algorithm thereof are not described herein again.
Performing word segmentation on the document to obtain word segmentation results
Figure DEST_PATH_IMAGE011A
Coding the word segmentation result to obtain word level vector representation of the document; by means of a BiGRU dieThe vector is encoded by the vector as shown in the following equation:
Figure 237644DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 505814DEST_PATH_IMAGE013
representing the ith word segmentation of the document.
For the subject term (k) contained in the document1,k2,…,kM) Coding is carried out by using the word vector to obtain the expression of the word level vector of the subject word, and the subject word is coded by using a BiGRU model, wherein the expression is shown as the following formula:
Figure 560358DEST_PATH_IMAGE014
wherein the content of the first and second substances,
Figure 786940DEST_PATH_IMAGE015
representing the jth subject word.
The domain knowledge graph entity contained in the document is coded through a TransE vector to obtain word level vector representation of the domain knowledge graph entity, and then the vector is coded through a BiGRU model, wherein the word level vector representation is specifically shown in the following formula:
Figure 559724DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 100427DEST_PATH_IMAGE017
a word-level vector representation representing the kth knowledge-graph entity.
Preferably, the fusing the document feature information, the topic feature information and the domain feature information to obtain a fused feature includes:
respectively representing the attention mechanism of the document characteristic information, the topic characteristic information and the domain characteristic information to obtain the attention mechanism representation of the document characteristic information, the attention mechanism representation of the topic characteristic information and the attention mechanism representation of the domain characteristic information;
and fusing the attention mechanism representation of the document feature information, the attention mechanism representation of the subject feature information and the attention mechanism representation of the domain feature information to obtain a fusion feature.
In the decoding stage, a document, subject words and domain knowledge graph entities are fused by using an attention mechanism, the serialization prediction of key words is completed, and when the ith word is decoded, the text information (document), the subject information (subject words) and the background knowledge information (domain knowledge graph entities) are represented by the attention mechanism:
the document is expressed as
Figure 776521DEST_PATH_IMAGE018
The method specifically comprises the following steps:
Figure 377267DEST_PATH_IMAGE019
wherein the content of the first and second substances,
Figure 371767DEST_PATH_IMAGE020
the subject term is expressed as based on the attention mechanism
Figure 981740DEST_PATH_IMAGE021
Figure 10876DEST_PATH_IMAGE022
Wherein the content of the first and second substances,
Figure 48102DEST_PATH_IMAGE023
the domain knowledge graph entity is expressed as
Figure 264320DEST_PATH_IMAGE024
Figure 677984DEST_PATH_IMAGE025
Wherein the content of the first and second substances,
Figure 296047DEST_PATH_IMAGE026
wherein, the above
Figure 769754DEST_PATH_IMAGE027
MLP, which is usually multilayered, and whose activation function is tanh, Si-1Is the (i-1) th output in decoding.
The three kinds of characteristic information are fused, and the information fusing all the characteristics is expressed as
Figure 473267DEST_PATH_IMAGE028
:
Figure 660928DEST_PATH_IMAGE029
At decoding time, the original text information (the annotation data of the document) and the fusion feature are combined
Figure 664657DEST_PATH_IMAGE030
As inputs to the GRU, the following are specified:
Figure 43685DEST_PATH_IMAGE031
wherein, in the step (A),
Figure 234495DEST_PATH_IMAGE032
the keyword sequence which is predicted before the step i.
Finally, use
Figure 989962DEST_PATH_IMAGE033
To complete the prediction of the keyword distribution, and a negative log-likelihood function is used as a loss function in the training process.
FIG. 2 is a flowchart illustrating a keyword generation method according to an embodiment of the present invention,as shown in fig. 2, the document, the subject word and the domain knowledge map entity contained in the document are obtained to obtain the text information, the subject information and the background knowledge information of the document, the subject word and the domain knowledge map entity are respectively subjected to vector representation and feature extraction, and feature fusion is performed through an attention mechanism (feature fusion is performed in the figure)
Figure 317038DEST_PATH_IMAGE034
Figure 132547DEST_PATH_IMAGE021
Figure DEST_PATH_IMAGE035
Respectively representing a document based on attention mechanism representation, a subject term based on attention mechanism representation and a domain knowledge graph entity based on attention mechanism representation), realizing training and application of an end-to-end model through fusion characteristics, realizing prediction of keywords through the trained model, outputting high-quality keywords, and paying attention to the fact that in the model training process, parameters and weights of the model are adjusted through back propagation errors, optimal parameters are sought, and an optimal model is constructed.
Example 2
The present embodiment provides a keyword generation apparatus, which is used to implement the foregoing embodiments and preferred embodiments, and has been described without further description, and terms "module", "unit", "subunit", and the like used below may be a combination of software and/or hardware that can implement a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware or a combination of software and hardware is also possible and contemplated.
Fig. 3 is a block diagram of a keyword generation apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus includes:
a training data obtaining module 31, configured to obtain training data, where the training data includes a document, a subject word and a domain knowledge map entity included in the document, and label data of the document;
the encoding module 32 is configured to perform feature extraction on the document, the topic word, and the domain knowledge graph entity, respectively, to obtain document feature information, topic feature information, and domain feature information, respectively;
the feature fusion module 33 is configured to fuse the document feature information, the topic feature information, and the domain feature information to obtain a fusion feature;
the model training module 34 is used for performing end-to-end model training through the labeled data and the fusion characteristics of the documents to obtain a trained keyword generation model;
and the keyword prediction module 35 is configured to receive the document to be predicted, and output the keyword of the document to be predicted through the keyword generation model.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
Example 3
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, an electronic device is provided, where the electronic device may be a server, and its internal structural diagram may be as shown in fig. 4. The electronic device comprises a processor, a memory, an input device and an output device; wherein the number of processors in the electronic device may be one or more, and one processor is taken as an example in fig. 4; the processor, memory, input devices and output devices in the electronic apparatus may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example.
The memory, which is a computer-readable storage medium, may include a high-speed random access memory, a non-volatile memory, and the like, and may be used to store an operating system, a software program, a computer-executable program, and a database, such as program instructions/modules corresponding to the keyword generation method in embodiment 1 of the present invention, and may further include a memory, which may be used to provide a running environment for the operating system and the computer program. In some examples, the memory may further include memory located remotely from the processor, and these remote memories may be connected to the electronic device through a network.
The processor, which is used to provide computing and control capabilities, may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of embodiments of the present Application. The processor executes various functional applications and data processing of the electronic device by running the computer-executable program, software program, instructions, and modules stored in the memory, that is, implements the keyword generation method of embodiment 1.
The output device of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
The electronic device may further include a network interface/communication interface, the network interface of the electronic device being for communicating with an external terminal through a network connection. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with the present application, and does not constitute a limitation on the electronic device to which the present application is applied, and a particular electronic device may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.
It will be understood by those skilled in the art that all or part of the processes in the keyword generation method according to embodiment 1 may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and the computer program may include the processes of the embodiments of the methods described above when executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
Example 4
An embodiment of the present invention provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to implement a keyword generation method, including:
acquiring training data, wherein the training data comprises a document, subject words and a domain knowledge map entity contained in the document, and labeling data of the document;
respectively extracting the characteristics of the document, the subject term and the domain knowledge map entity to respectively obtain document characteristic information, subject characteristic information and domain characteristic information;
fusing document feature information, theme feature information and field feature information to obtain fusion features;
performing end-to-end model training through the labeled data and the fusion characteristics of the documents to obtain a trained keyword generation model;
and receiving the document to be predicted, and outputting the keywords of the document to be predicted through the keyword generation model.
Of course, the storage medium provided by the embodiment of the present invention includes computer-executable instructions, and the computer-executable instructions are not limited to the operations of the keyword generation method in the above embodiments, and may also perform related operations in the keyword generation method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, where the computer software product may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, and includes several instructions to enable an electronic device (which may be a mobile phone, a personal computer, a server, or a network device) to execute the keyword generation method according to the embodiments of the present invention.
It should be noted that, in the embodiment of the keyword generation method, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims (10)

1. A keyword generation method is characterized by comprising the following steps:
acquiring training data, wherein the training data comprises a document, subject words and domain knowledge graph entities contained in the document, and marking data of the document;
respectively extracting the characteristics of the document, the subject term and the domain knowledge map entity to respectively obtain document characteristic information, subject characteristic information and domain characteristic information;
fusing the document feature information, the subject feature information and the field feature information to obtain a fusion feature;
performing end-to-end model training through the labeled data of the document and the fusion characteristics to obtain a trained keyword generation model;
and receiving a document to be predicted, and outputting the keywords of the document to be predicted through the keyword generation model.
2. The keyword generation method according to claim 1, wherein the obtaining of the training data comprises:
obtaining the document, and training the document through a GSDMM model to obtain subject terms contained in the document;
and acquiring a corresponding domain knowledge graph, and determining a domain knowledge graph entity contained in the document according to the domain knowledge graph.
3. The method of generating keywords according to claim 1, wherein the performing feature extraction on the document, the subject term, and the domain knowledge graph entity respectively comprises:
converting the document, the subject term, and the domain knowledge graph entity into a term level vector representation of the document, a term level vector representation of the subject term, and a term level vector representation of the domain knowledge graph entity, respectively;
feature extraction is performed by an encoder on the term-level vector representation of the document, the term-level vector representation of the subject term, and the term-level vector representation of the domain knowledge graph entity.
4. The keyword generation method of claim 3, wherein said converting the document into a term-level vector representation of the document comprises:
performing word segmentation on the document to obtain a word segmentation result;
and coding the word segmentation result through a word vector to obtain word level vector representation of the document.
5. The keyword generation method of claim 3, wherein prior to converting the domain knowledge-graph entity to a word-level vector representation of a domain knowledge-graph entity, comprising:
acquiring a corresponding domain knowledge graph;
and training a TransE vector of the domain knowledge graph according to the domain knowledge graph.
6. The keyword generation method of claim 5, wherein the converting the domain knowledge-graph entity to a word-level vector representation of the domain knowledge-graph entity comprises:
and coding the domain knowledge graph entity through the TransE vector to obtain word level vector representation of the domain knowledge graph entity.
7. The method for generating keywords according to claim 1, wherein the fusing the document feature information, the topic feature information and the domain feature information to obtain a fused feature comprises:
respectively representing the document feature information, the theme feature information and the field feature information by an attention mechanism to obtain an attention mechanism representation of the document feature information, an attention mechanism representation of the theme feature information and an attention mechanism representation of the field feature information;
and fusing the attention mechanism representation of the document feature information, the attention mechanism representation of the subject feature information and the attention mechanism representation of the field feature information to obtain a fusion feature.
8. A keyword generation apparatus, comprising:
the training data acquisition module is used for acquiring training data, wherein the training data comprises a document, subject words and domain knowledge map entities contained in the document, and labeling data of the document;
the coding module is used for respectively extracting the characteristics of the document, the subject term and the domain knowledge map entity to respectively obtain document characteristic information, subject characteristic information and domain characteristic information;
the feature fusion module is used for fusing the document feature information, the theme feature information and the field feature information to obtain fusion features;
the model training module is used for performing end-to-end model training through the marking data of the document and the fusion characteristics to obtain a trained keyword generation model;
and the keyword prediction module is used for receiving the document to be predicted and outputting the keywords of the document to be predicted through the keyword generation model.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the keyword generation method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the keyword generation method according to any one of claims 1 to 7.
CN202110365391.7A 2021-04-06 2021-04-06 Keyword generation method, device, equipment and medium Active CN112800757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110365391.7A CN112800757B (en) 2021-04-06 2021-04-06 Keyword generation method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110365391.7A CN112800757B (en) 2021-04-06 2021-04-06 Keyword generation method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN112800757A true CN112800757A (en) 2021-05-14
CN112800757B CN112800757B (en) 2021-07-09

Family

ID=75816312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110365391.7A Active CN112800757B (en) 2021-04-06 2021-04-06 Keyword generation method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN112800757B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442733A (en) * 2019-08-08 2019-11-12 恒生电子股份有限公司 A kind of subject generating method, device and equipment and medium
CN113761167A (en) * 2021-09-09 2021-12-07 上海明略人工智能(集团)有限公司 Session information extraction method, system, electronic device and storage medium
CN113886606A (en) * 2021-12-08 2022-01-04 北京海致星图科技有限公司 Data annotation method, device, medium and equipment based on knowledge graph
CN115169367A (en) * 2022-09-06 2022-10-11 杭州远传新业科技股份有限公司 Dialogue generating method and device, and storage medium
CN116501875A (en) * 2023-04-28 2023-07-28 中电科大数据研究院有限公司 Document processing method and system based on natural language and knowledge graph

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376131A (en) * 2018-03-14 2018-08-07 中山大学 Keyword abstraction method based on seq2seq deep neural network models
CN109543017A (en) * 2018-11-21 2019-03-29 广州语义科技有限公司 Legal issue keyword generation method and its system
CN110119765A (en) * 2019-04-18 2019-08-13 浙江工业大学 A kind of keyword extracting method based on Seq2seq frame
CN110188344A (en) * 2019-04-23 2019-08-30 浙江工业大学 A kind of keyword extracting method of multiple features fusion
CN111814770A (en) * 2020-09-04 2020-10-23 中山大学深圳研究院 Content keyword extraction method of news video, terminal device and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376131A (en) * 2018-03-14 2018-08-07 中山大学 Keyword abstraction method based on seq2seq deep neural network models
CN109543017A (en) * 2018-11-21 2019-03-29 广州语义科技有限公司 Legal issue keyword generation method and its system
CN110119765A (en) * 2019-04-18 2019-08-13 浙江工业大学 A kind of keyword extracting method based on Seq2seq frame
CN110188344A (en) * 2019-04-23 2019-08-30 浙江工业大学 A kind of keyword extracting method of multiple features fusion
CN111814770A (en) * 2020-09-04 2020-10-23 中山大学深圳研究院 Content keyword extraction method of news video, terminal device and medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442733A (en) * 2019-08-08 2019-11-12 恒生电子股份有限公司 A kind of subject generating method, device and equipment and medium
CN113761167A (en) * 2021-09-09 2021-12-07 上海明略人工智能(集团)有限公司 Session information extraction method, system, electronic device and storage medium
CN113761167B (en) * 2021-09-09 2023-10-20 上海明略人工智能(集团)有限公司 Session information extraction method, system, electronic equipment and storage medium
CN113886606A (en) * 2021-12-08 2022-01-04 北京海致星图科技有限公司 Data annotation method, device, medium and equipment based on knowledge graph
CN115169367A (en) * 2022-09-06 2022-10-11 杭州远传新业科技股份有限公司 Dialogue generating method and device, and storage medium
CN115169367B (en) * 2022-09-06 2022-12-09 杭州远传新业科技股份有限公司 Dialogue generating method and device, and storage medium
CN116501875A (en) * 2023-04-28 2023-07-28 中电科大数据研究院有限公司 Document processing method and system based on natural language and knowledge graph
CN116501875B (en) * 2023-04-28 2024-04-26 中电科大数据研究院有限公司 Document processing method and system based on natural language and knowledge graph

Also Published As

Publication number Publication date
CN112800757B (en) 2021-07-09

Similar Documents

Publication Publication Date Title
CN112800757B (en) Keyword generation method, device, equipment and medium
JP6955580B2 (en) Document summary automatic extraction method, equipment, computer equipment and storage media
WO2021082953A1 (en) Machine reading understanding method and apparatus, storage medium, and device
Bahdanau et al. Learning to compute word embeddings on the fly
CN112487182A (en) Training method of text processing model, and text processing method and device
CN111967266A (en) Chinese named entity recognition model and construction method and application thereof
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN113536795B (en) Method, system, electronic device and storage medium for entity relation extraction
CN110472255B (en) Neural network machine translation method, model, electronic terminal, and storage medium
CN111967264B (en) Named entity identification method
CN110704576A (en) Text-based entity relationship extraction method and device
CN114676234A (en) Model training method and related equipment
CN114818891B (en) Small sample multi-label text classification model training method and text classification method
CN114026556A (en) Semantic element prediction method, computer device and storage medium background
CN110188158B (en) Keyword and topic label generation method, device, medium and electronic equipment
CN116737938A (en) Fine granularity emotion detection method and device based on fine tuning large model online data network
CN108763230B (en) Neural machine translation method using external information
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN114064852A (en) Method and device for extracting relation of natural language, electronic equipment and storage medium
CN112464655A (en) Word vector representation method, device and medium combining Chinese characters and pinyin
CN113806646A (en) Sequence labeling system and training system of sequence labeling model
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
CN112417155B (en) Court trial query generation method, device and medium based on pointer-generation Seq2Seq model
CN111898339B (en) Ancient poetry generating method, device, equipment and medium based on constraint decoding
Tesfagergish et al. Deep learning-based sentiment classification of social network texts in amharic language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Room 23011, Yuejiang commercial center, No. 857, Xincheng Road, Puyan street, Binjiang District, Hangzhou, Zhejiang 311611

Patentee after: Hangzhou Yuanchuan Xinye Technology Co.,Ltd.

Address before: 23 / F, World Trade Center, 857 Xincheng Road, Binjiang District, Hangzhou City, Zhejiang Province, 310051

Patentee before: Hangzhou Yuanchuan New Technology Co.,Ltd.

CP03 Change of name, title or address
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Keyword generation methods, devices, equipment, and media

Effective date of registration: 20230509

Granted publication date: 20210709

Pledgee: China Everbright Bank Limited by Share Ltd. Hangzhou branch

Pledgor: Hangzhou Yuanchuan Xinye Technology Co.,Ltd.

Registration number: Y2023980040155

PE01 Entry into force of the registration of the contract for pledge of patent right