Disclosure of Invention
The embodiment of the invention provides a keyword generation method, a device, equipment and a medium, which are used for at least solving the problem of low quality of keywords.
In a first aspect, an embodiment of the present invention provides a keyword generation method, including the following steps:
acquiring training data, wherein the training data comprises a document, subject words and domain knowledge graph entities contained in the document, and marking data of the document;
respectively extracting the characteristics of the document, the subject term and the domain knowledge map entity to respectively obtain document characteristic information, subject characteristic information and domain characteristic information;
fusing the document feature information, the subject feature information and the field feature information to obtain a fusion feature;
performing end-to-end model training through the labeled data of the document and the fusion characteristics to obtain a trained keyword generation model;
and receiving a document to be predicted, and outputting the keywords of the document to be predicted through the keyword generation model.
In some embodiments, the obtaining training data is preceded by:
obtaining the document, and training the document through a GSDMM model to obtain subject terms contained in the document;
and acquiring a corresponding domain knowledge graph, and determining a domain knowledge graph entity contained in the document according to the domain knowledge graph.
In some embodiments, the feature extracting the document, the subject term, and the domain knowledge graph entity, respectively, includes:
converting the document, the subject term, and the domain knowledge graph entity into a term level vector representation of the document, a term level vector representation of the subject term, and a term level vector representation of the domain knowledge graph entity, respectively;
feature extraction is performed by an encoder on the term-level vector representation of the document, the term-level vector representation of the subject term, and the term-level vector representation of the domain knowledge graph entity.
In some embodiments, the converting the document into a term-level vector representation of the document comprises:
performing word segmentation on the document to obtain a word segmentation result;
and coding the word segmentation result through a word vector to obtain word level vector representation of the document.
In some embodiments, the converting the domain knowledge-graph entity into the word-level vector representation of the domain knowledge-graph entity comprises:
acquiring a corresponding domain knowledge graph;
and training a TransE vector of the domain knowledge graph according to the domain knowledge graph.
In some embodiments, the converting the domain knowledge-graph entity into a word-level vector representation of the domain knowledge-graph entity comprises:
and coding the domain knowledge graph entity through the TransE vector to obtain word level vector representation of the domain knowledge graph entity.
In some embodiments, the fusing the document feature information, the topic feature information, and the domain feature information to obtain a fused feature includes:
respectively representing the document feature information, the theme feature information and the field feature information by an attention mechanism to obtain an attention mechanism representation of the document feature information, an attention mechanism representation of the theme feature information and an attention mechanism representation of the field feature information;
and fusing the attention mechanism representation of the document feature information, the attention mechanism representation of the subject feature information and the attention mechanism representation of the field feature information to obtain a fusion feature.
In a second aspect, an embodiment of the present invention provides a keyword generation apparatus, including:
the training data acquisition module is used for acquiring training data, wherein the training data comprises a document, subject words and domain knowledge map entities contained in the document, and labeling data of the document;
the coding module is used for respectively extracting the characteristics of the document, the subject term and the domain knowledge map entity to respectively obtain document characteristic information, subject characteristic information and domain characteristic information;
the feature fusion module is used for fusing the document feature information, the theme feature information and the field feature information to obtain fusion features;
the model training module is used for performing end-to-end model training through the marking data of the document and the fusion characteristics to obtain a trained keyword generation model;
and the keyword prediction module is used for receiving the document to be predicted and outputting the keywords of the document to be predicted through the keyword generation model.
In a third aspect, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the keyword generation method according to the first aspect when executing the computer program.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the keyword generation method according to the first aspect.
Compared with the prior art, the embodiment of the invention provides a keyword generation method, device, equipment and medium, and the training of an end-to-end model is completed by fusing topic information and domain knowledge map information, so that the trained model can generate high-quality keywords.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the invention.
Detailed Description
In order to make the purpose and technical solution of the present invention more apparent, the present invention will be described and illustrated with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments provided by the present invention, belong to the protection scope of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
The Chinese documents generally contain a large amount of information about topics (or interests), and the comprehensiveness of the document topics, the readability and the difference of keywords can be comprehensively considered by fusing topic features during model training, so that the overall quality of keyword generation can be improved. The knowledge graph of the specific field is introduced into the generation of the keywords as auxiliary information, so that the problem of sparse text knowledge information can be effectively solved, the encoding capacity of an end-to-end model encoder on background knowledge or common knowledge information is further enhanced, and especially for many vertical fields, the quality of the generated keywords can be effectively improved by professional knowledge.
Therefore, the keyword generation method fusing the topic information and the domain knowledge map information provided by the invention can reduce the redundancy phenomenon of candidate words to a certain extent, enrich the background knowledge information of the text, and improve the topic coverage and accuracy of the keywords.
Example 1
Based on the above principle, the present embodiment provides a method for generating keywords, and fig. 1 is a flowchart of the method for generating keywords according to the present invention.
As shown in fig. 1, the keyword generation method includes the following steps:
s101, obtaining training data, wherein the training data comprises a document, subject words and domain knowledge graph entities contained in the document, and labeling data of the document.
The embodiment acquires a large amount of Chinese keyword extraction sample data, the sample data comprises a plurality of documents and keywords corresponding to the documents, and the corresponding documents are labeled through the keywords to obtain labeled data of the documents. Meanwhile, subject words and domain knowledge map entities contained in the documents are acquired, and the subject words and the domain knowledge map entities contained in the documents and the labeled data of the documents are used as training data to perform supervised training on subsequent models.
Preferably, before performing the model training, parameters of the model training are preset, where the parameters may include a dimension of a word vector, a parameter of batch processing, a size of a window, an initial learning rate, a word vector matrix, an auxiliary vector matrix, a chinese vocabulary, a number of generated keywords, and the like, and in an actual application scenario, the parameters of the model training may be set by itself according to an actual situation, which is not specifically limited herein.
In addition, when the domain knowledge graph entity contained in the document is determined, the document can be subjected to word segmentation, and then the domain knowledge graph entity contained in the document is determined through matching with the entity in the knowledge graph of the corresponding domain. Of course, in other embodiments, other methods of determining the domain knowledge graph contained in the document may be freely selected.
It should be noted that, for documents belonging to different fields, knowledge maps of different fields are selected for entity determination and model training, and training data may include document data of multiple fields and subject words and entities included in each document, so as to improve generalization ability of the model.
S102, respectively extracting the characteristics of the document, the subject term and the domain knowledge map entity to respectively obtain document characteristic information, subject characteristic information and domain characteristic information.
The end-to-end model (Seq 2 Seq) uses two recurrent neural networks, one for analyzing the input sequence (encoding process) and the other for generating the output sequence (decoding process), respectively called encoder and decoder.
In the encoding process, the document, the subject word and the domain knowledge map entity are encoded through the encoder respectively so as to extract the characteristics of the document information, the subject information and the domain knowledge information. The principle and the encoding process of the encoder are conventional technical means in the art, and are not described herein again.
S103, fusing the document feature information, the theme feature information and the field feature information to obtain fusion features.
In the decoding stage, fusion characteristics are formed by the document characteristic information, the theme characteristic information and the field characteristic information, so that the document information, the theme information and the field knowledge information are fused, and when model training is carried out, the training can be carried out by combining the text information, the theme information and the field knowledge information (background knowledge information) contained in the document, so that the trained model can generate high-quality keywords.
And S104, performing end-to-end model training through the document and the fusion characteristics to obtain a trained keyword generation model.
The end-to-end model obtains a prediction result from an input end to an output end, the prediction result is compared with a real result to obtain an error, the error is reversely propagated to each layer of the network, the weight and the parameters of the network are adjusted until the model converges or reaches an expected effect, all the operations in the middle are contained in the neural network, and the neural network is not divided into a plurality of models for processing.
And in the training process, calculating errors by taking a negative log-likelihood function as a loss function, reversely spreading the errors to each layer of the network, adjusting the weight of the network until optimal parameters are determined, and obtaining a trained keyword generation model after the optimal parameters are solidified.
And S105, receiving the document to be predicted, and outputting the keywords of the document to be predicted through the keyword generation model.
The method comprises the steps of receiving a document to be predicted, using the document to be predicted as input information of a keyword generation model, and outputting keywords through the keyword generation model, wherein the outputted keywords are keywords fusing topic information and domain knowledge map information and have higher quality.
When the keyword generation method of the embodiment is applied, the topic words and the domain knowledge graph entities contained in the document are used as training data, and in the model training stage, the topic information and the domain knowledge graph information are fused to perform model training, so that different information contained in the document can be comprehensively considered, the topic coverage and accuracy of the keywords generated by the model are improved, and the quality of the generated keywords is improved.
Preferably, before the training data is acquired, the method comprises:
obtaining a document, and training the document through a GSDMM model to obtain subject terms contained in the document;
and acquiring a corresponding domain knowledge graph, and determining a domain knowledge graph entity contained in the document according to the domain knowledge graph.
The GSDMM model is a contraction type Gibbs sampling algorithm based on a Dirichlet polynomial mixed model, and a large number of texts can be clustered according to certain similarity calculated through the GSDMM model.
Training a document in the training data through a GSDMM model to obtain a subject term contained in the document, and specifically comprising the following steps:
step 1: initializing GSDMM model parameters: k, alpha, beta, I, initialization variables
;
Step 2: obtainingDocument collection
To document collection
All documents d in (1) are initialized:
each document d in the document set is assigned a random topic:
(ii) a Updating variables:
and updating the variable of each word w contained in the document:
(ii) a A set of K different topics is obtained, and each document belongs to only one topic.
And step 3: gibbs sampling was performed: for each document d, recording the theme to which the document d currently belongs: information that removes d in the current topic:
and updating the variable of each word w contained in the document:
(ii) a Reassigning the topic for document d according to the conditional distribution:
(ii) a Updating variables:
。
and 4, step 4: and (5) repeating the step (3) until the maximum iteration number I is reached, and obtaining the subject term contained in each document.
The corresponding domain knowledge graph may be an open-source domain-specific knowledge graph or a self-constructed domain-specific knowledge graph, and the method for constructing the knowledge graph is common knowledge in the field and will not be described here.
Preferably, the feature extraction is performed on the document, the subject term, and the domain knowledge graph entity respectively, and includes:
respectively converting the document, the subject word and the domain knowledge graph entity into word level vector representation of the document, word level vector representation of the subject word and word level vector representation of the domain knowledge graph entity;
feature extraction is performed on the term-level vector representation of the document, the term-level vector representation of the subject term, and the term-level vector representation of the domain knowledge graph entity by an encoder.
Firstly, a document, subject words contained in the document and a domain knowledge graph are coded through word vectors to obtain word level vector representation of the document, word level vector representation of the subject words and word level vector representation of a domain knowledge graph entity.
Preferably, converting the document into a term-level vector representation of the document comprises:
performing word segmentation on the document to obtain word segmentation results;
and coding the word segmentation result through a word vector to obtain word level vector representation of the document.
Specifically, word segmentation processing can be performed on the document by adopting word segmentation tools such as open-source Jieba, THULAC and LTP or word segmentation algorithms such as an N-gram and a hidden markov model, so as to obtain word segmentation results. The word segmentation of the text is a conventional technical means in the field, and besides the word segmentation tools and the word segmentation algorithms mentioned above, a person skilled in the art can also select other word segmentation methods to complete the word segmentation of the document.
Preferably, converting the subject term contained in the document into a term-level vector representation comprises:
and coding the subject term through the term vector to obtain the term level vector representation of each subject term.
The word vector is based on a specific vertical domain. The method comprises the steps of obtaining a large number of client question-answer corpora and specific field product documents to obtain training corpora, cleaning and segmenting the training corpora, and then pre-training word vectors by using a BERT model to generate word vectors based on a specific vertical field. And coding the word segmentation result and the subject word of the document through the generated word vector to respectively obtain the vector representation of the word segmentation result and the vector representation of the subject word. The BERT (bidirectional Encoder retrieval from transformations) model is a pre-training model and a typical bidirectional encoding model, is a well-known model in the field, and can realize that a self-supervision learning method is operated on the basis of mass linguistic data to learn a good feature representation for words.
Preferably, before converting the domain knowledge-graph entity into the word-level vector representation of the domain knowledge-graph entity, comprising:
acquiring a corresponding domain knowledge graph;
training the TransE vector of the domain knowledge graph according to the domain knowledge graph,
converting the domain knowledge graph entity into a word-level vector representation of the domain knowledge graph entity, comprising:
and coding the domain knowledge graph entity through the TransE vector to obtain word level vector representation of the domain knowledge graph entity.
Training a knowledge graph TransE vector of a specific field through a TransE algorithm; the TransE algorithm, a knowledge representation learning algorithm known in the art, describes triples in a knowledge graph with a distributed representation (distributed representation).
The vector representation is then encoded by an encoder to achieve feature extraction. In this embodiment, a BiGRU model is selected as an encoder to complete feature extraction, and the BiGRU model is a known neural network model, and the principle and algorithm thereof are not described herein again.
Performing word segmentation on the document to obtain word segmentation results
Coding the word segmentation result to obtain word level vector representation of the document; by means of a BiGRU dieThe vector is encoded by the vector as shown in the following equation:
wherein the content of the first and second substances,
representing the ith word segmentation of the document.
For the subject term (k) contained in the document1,k2,…,kM) Coding is carried out by using the word vector to obtain the expression of the word level vector of the subject word, and the subject word is coded by using a BiGRU model, wherein the expression is shown as the following formula:
wherein the content of the first and second substances,
representing the jth subject word.
The domain knowledge graph entity contained in the document is coded through a TransE vector to obtain word level vector representation of the domain knowledge graph entity, and then the vector is coded through a BiGRU model, wherein the word level vector representation is specifically shown in the following formula:
wherein the content of the first and second substances,
a word-level vector representation representing the kth knowledge-graph entity.
Preferably, the fusing the document feature information, the topic feature information and the domain feature information to obtain a fused feature includes:
respectively representing the attention mechanism of the document characteristic information, the topic characteristic information and the domain characteristic information to obtain the attention mechanism representation of the document characteristic information, the attention mechanism representation of the topic characteristic information and the attention mechanism representation of the domain characteristic information;
and fusing the attention mechanism representation of the document feature information, the attention mechanism representation of the subject feature information and the attention mechanism representation of the domain feature information to obtain a fusion feature.
In the decoding stage, a document, subject words and domain knowledge graph entities are fused by using an attention mechanism, the serialization prediction of key words is completed, and when the ith word is decoded, the text information (document), the subject information (subject words) and the background knowledge information (domain knowledge graph entities) are represented by the attention mechanism:
the document is expressed as
The method specifically comprises the following steps:
wherein the content of the first and second substances,
。
the subject term is expressed as based on the attention mechanism
:
Wherein the content of the first and second substances,
。
the domain knowledge graph entity is expressed as
:
Wherein the content of the first and second substances,
。
wherein, the above
MLP, which is usually multilayered, and whose activation function is tanh, S
i-1Is the (i-1) th output in decoding.
The three kinds of characteristic information are fused, and the information fusing all the characteristics is expressed as
:
At decoding time, the original text information (the annotation data of the document) and the fusion feature are combined
As inputs to the GRU, the following are specified:
wherein, in the step (A),
the keyword sequence which is predicted before the step i.
Finally, use
To complete the prediction of the keyword distribution, and a negative log-likelihood function is used as a loss function in the training process.
FIG. 2 is a flowchart illustrating a keyword generation method according to an embodiment of the present invention,as shown in fig. 2, the document, the subject word and the domain knowledge map entity contained in the document are obtained to obtain the text information, the subject information and the background knowledge information of the document, the subject word and the domain knowledge map entity are respectively subjected to vector representation and feature extraction, and feature fusion is performed through an attention mechanism (feature fusion is performed in the figure)
、
、
Respectively representing a document based on attention mechanism representation, a subject term based on attention mechanism representation and a domain knowledge graph entity based on attention mechanism representation), realizing training and application of an end-to-end model through fusion characteristics, realizing prediction of keywords through the trained model, outputting high-quality keywords, and paying attention to the fact that in the model training process, parameters and weights of the model are adjusted through back propagation errors, optimal parameters are sought, and an optimal model is constructed.
Example 2
The present embodiment provides a keyword generation apparatus, which is used to implement the foregoing embodiments and preferred embodiments, and has been described without further description, and terms "module", "unit", "subunit", and the like used below may be a combination of software and/or hardware that can implement a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware or a combination of software and hardware is also possible and contemplated.
Fig. 3 is a block diagram of a keyword generation apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus includes:
a training data obtaining module 31, configured to obtain training data, where the training data includes a document, a subject word and a domain knowledge map entity included in the document, and label data of the document;
the encoding module 32 is configured to perform feature extraction on the document, the topic word, and the domain knowledge graph entity, respectively, to obtain document feature information, topic feature information, and domain feature information, respectively;
the feature fusion module 33 is configured to fuse the document feature information, the topic feature information, and the domain feature information to obtain a fusion feature;
the model training module 34 is used for performing end-to-end model training through the labeled data and the fusion characteristics of the documents to obtain a trained keyword generation model;
and the keyword prediction module 35 is configured to receive the document to be predicted, and output the keyword of the document to be predicted through the keyword generation model.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
Example 3
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, an electronic device is provided, where the electronic device may be a server, and its internal structural diagram may be as shown in fig. 4. The electronic device comprises a processor, a memory, an input device and an output device; wherein the number of processors in the electronic device may be one or more, and one processor is taken as an example in fig. 4; the processor, memory, input devices and output devices in the electronic apparatus may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example.
The memory, which is a computer-readable storage medium, may include a high-speed random access memory, a non-volatile memory, and the like, and may be used to store an operating system, a software program, a computer-executable program, and a database, such as program instructions/modules corresponding to the keyword generation method in embodiment 1 of the present invention, and may further include a memory, which may be used to provide a running environment for the operating system and the computer program. In some examples, the memory may further include memory located remotely from the processor, and these remote memories may be connected to the electronic device through a network.
The processor, which is used to provide computing and control capabilities, may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of embodiments of the present Application. The processor executes various functional applications and data processing of the electronic device by running the computer-executable program, software program, instructions, and modules stored in the memory, that is, implements the keyword generation method of embodiment 1.
The output device of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
The electronic device may further include a network interface/communication interface, the network interface of the electronic device being for communicating with an external terminal through a network connection. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with the present application, and does not constitute a limitation on the electronic device to which the present application is applied, and a particular electronic device may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.
It will be understood by those skilled in the art that all or part of the processes in the keyword generation method according to embodiment 1 may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and the computer program may include the processes of the embodiments of the methods described above when executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
Example 4
An embodiment of the present invention provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to implement a keyword generation method, including:
acquiring training data, wherein the training data comprises a document, subject words and a domain knowledge map entity contained in the document, and labeling data of the document;
respectively extracting the characteristics of the document, the subject term and the domain knowledge map entity to respectively obtain document characteristic information, subject characteristic information and domain characteristic information;
fusing document feature information, theme feature information and field feature information to obtain fusion features;
performing end-to-end model training through the labeled data and the fusion characteristics of the documents to obtain a trained keyword generation model;
and receiving the document to be predicted, and outputting the keywords of the document to be predicted through the keyword generation model.
Of course, the storage medium provided by the embodiment of the present invention includes computer-executable instructions, and the computer-executable instructions are not limited to the operations of the keyword generation method in the above embodiments, and may also perform related operations in the keyword generation method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, where the computer software product may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, and includes several instructions to enable an electronic device (which may be a mobile phone, a personal computer, a server, or a network device) to execute the keyword generation method according to the embodiments of the present invention.
It should be noted that, in the embodiment of the keyword generation method, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.