CN116932686A

CN116932686A - Theme mining method and device, electronic equipment and storage medium

Info

Publication number: CN116932686A
Application number: CN202311208307.6A
Authority: CN
Inventors: 刘陆阳; 林群阳; 张闯; 王敏
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-10-24
Anticipated expiration: 2043-09-19
Also published as: CN116932686B

Abstract

The invention provides a topic mining method, a device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: performing self-supervision training on the decoding network based on the target text information so as to enable the topic keyword distribution tensor of the decoding network to be iteratively updated from a dense tensor to a sparse tensor; after the trained decoding network is obtained, obtaining a theme corresponding to the target text information and/or a theme keyword corresponding to the theme based on a theme keyword distribution tensor of the trained decoding network; the decoding network is built based on a pre-trained language model. The topic mining method, the device, the electronic equipment and the storage medium can mine the topic and/or the topic keyword with stronger sparsity based on the decoding network on the basis of ensuring the accuracy and the comprehensiveness of topic mining, and the decoding network can replace and migrate to other topic models, so that the interpretability of the topic models and the efficiency of operation, calculation and storage can be improved.

Description

Theme mining method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for mining a theme, an electronic device, and a storage medium.

Background

Along with the explosive growth of the information quantity, the efficient reading and understanding of all text information are difficult to realize by means of manual mode, and the efficiency of reading and understanding the text information can be improved by performing subject mining on the text information.

In the prior art, topic mining can be performed on text information by using a topic model constructed based on deep learning, so as to obtain topics in the text information and/or topic keywords of each topic.

However, the subject and the subject keywords obtained by mining based on the subject model are generally not sparse. Therefore, how to enhance the sparsity of the subject and/or the subject keyword obtained by the mining on the basis of ensuring the accuracy and the comprehensiveness of the subject mining is a technical problem to be solved in the field.

Disclosure of Invention

The invention provides a topic mining method, device, electronic equipment and storage medium, which are used for solving the defect that topics and topic keywords obtained through topic model mining based on deep learning construction in the prior art are not strong in sparsity, and enhancing the sparsity of the topics and/or topic keywords obtained through mining on the basis of ensuring the accuracy and the comprehensiveness of topic mining.

The invention provides a topic mining method, which comprises the following steps:

acquiring target text information;

performing self-supervision training on a decoding network based on the target text information, so that the topic keyword distribution tensor of the decoding network is iteratively updated from a dense tensor to a sparse tensor;

after the trained decoding network is obtained, obtaining a theme corresponding to the target text information and/or a theme keyword corresponding to the theme based on a theme keyword distribution tensor of the trained decoding network;

wherein the decoding network is constructed based on a pre-trained language model.

According to the subject mining method provided by the invention, the decoding network comprises the following steps: an initialization module and a vector conversion module;

the self-supervision training is performed on the decoding network based on the target text information, and the self-supervision training comprises the following steps:

obtaining model parameters of the decoding network output by the initialization module, wherein the model parameters comprise topic keyword distribution tensors, and the initialization module is used for initializing the model parameters of the decoding network;

inputting the topic keyword distribution tensor of the decoding network into the vector conversion module, and obtaining a weighted topic keyword context representation vector corresponding to the decoding network output by the vector conversion module;

Calculating the value of a loss function of the decoding network based on the target text information and the weighted topic keyword context representation vector;

and under the condition that the value of the loss function is not converged, updating the model parameters of the decoding network, and repeating the step of calculating the value of the loss function of the decoding network until the value of the loss function is converged, thereby obtaining the trained decoding network.

According to the subject mining method provided by the invention, the vector conversion module comprises the following steps: a sparse unit, a textualization unit and a weighting unit;

inputting the topic keyword distribution tensor of the decoding network into the vector conversion module, wherein the weighted topic keyword context representation vector corresponding to the decoding network output by the vector conversion module comprises the following components:

inputting the topic keyword distribution tensor of the decoding network into the sparse unit, and performing sparsification processing on the topic keyword distribution tensor of the decoding network by the sparse unit to further obtain a sparsified topic keyword distribution tensor output by the sparse unit;

inputting the sparse topic keyword distribution tensor into a textualization unit, textualizing the sparse topic keyword distribution tensor by the textualization unit to obtain a topic keyword sequence, and generating a context representation tensor of the topic keyword sequence based on the topic keyword sequence to further obtain the context representation tensor of the topic keyword sequence output by the textualization unit;

And inputting the sparse topic keyword distribution tensor and the context representation tensor of the topic keyword sequence into the weighting unit, and carrying out weighting processing on the context representation tensor of the topic keyword sequence by the weighting unit based on the sparse topic keyword distribution tensor, so as to obtain a weighted topic keyword context representation vector output by the weighting unit.

According to the topic mining method provided by the invention, the specific steps of the sparse unit for performing the sparse processing on the topic keyword distribution tensor of the decoding network comprise the following steps:

based on the topic keyword distribution tensor of the decoding network, ordering the topic keywords corresponding to each topic in a descending order according to the order of the occurrence frequency from high to low, and obtaining a topic keyword sequence corresponding to each topic;

determining the topic keywords with the target quantity before ranking in the topic keywords corresponding to each topic and the index positions of the topic keywords with the target quantity before ranking based on the topic keyword sequence corresponding to each topic;

and based on the topic keywords of the target number before ranking in the topic keywords corresponding to each topic and the index positions of the topic keywords of the target number before ranking, performing sparsification processing on the topic keyword distribution tensor of the decoding network according to the non-zero element coordinate representation format of the sparse tensor to obtain the sparsified topic keyword distribution tensor.

According to the subject mining method provided by the invention, the textualization unit comprises the following steps: a first pre-trained language model;

the step of generating the context representation tensor of the topic keyword sequence by the textualization unit based on the topic keyword sequence comprises the following specific steps:

inputting the topic keyword sequence into a first pre-trained language model;

and acquiring a context representation tensor of the topic keyword sequence output by the first pre-training language model.

According to the subject mining method provided by the invention, the first pre-training language model comprises any one of a Transfomer model, a Sentence-BERT model and a BERT-base model.

According to the topic mining method provided by the invention, the weighting unit carries out the specific steps of weighting the context representation tensor of the topic keyword sequence based on the sparse topic keyword distribution tensor, and the method comprises the following steps:

acquiring a normalized weight tensor of the topic keywords based on the sparse topic keyword distribution tensor;

and calculating the product of the normalized weight tensor of the topic keyword and the context representation tensor of the topic keyword sequence, and eliminating redundant dimensions in the product of the normalized weight tensor of the topic keyword and the context representation tensor of the topic keyword sequence after obtaining the product result of the normalized weight tensor of the topic keyword and the context representation tensor of the topic keyword sequence, so as to obtain the weighted topic keyword context representation vector.

According to the topic mining method provided by the invention, the calculating the value of the loss function of the decoding network based on the target text information and the weighted topic keyword context representation vector comprises the following steps:

performing text conversion on the target text information, and converting the target text information into a word term number sequence;

inputting the vocabulary item numbering sequence into a second pre-training language model, and obtaining a text vector of the target text information output by the second pre-training language model;

respectively inputting the text vector into a first linear network and a second linear network, and acquiring the target text information mean vector output by the first linear network and the variance vector of the target text information output by the second linear network;

performing re-parameterization and normalization calculation on the mean vector and the variance vector to obtain a subject component vector normalized by the target text information;

and calculating the value of the loss function of the decoding network based on the normalized topic component vector of the target text information, the mean vector, the variance vector and the weighted topic keyword context representation vector.

According to the topic mining method provided by the invention, before calculating the value of the loss function of the decoding network, the topic component vector normalized based on the target text information, the mean vector, the variance vector and the weighted topic keyword context representation vector are provided, the method further comprises:

performing word bagging processing on the target text information to obtain word bagging representation vectors of the target text information;

the calculating the value of the loss function of the decoding network based on the normalized topic component vector, the mean vector, the variance vector and the weighted topic keyword context representation vector of the target text information comprises the following steps:

reconstructing a bag-of-word representation vector of the target text information based on the normalized topic component vector of the target text information and the weighted topic keyword context representation vector to obtain a bag-of-word representation vector of the target text information after reconstruction;

and calculating the value of the loss function based on the mean vector, the variance vector, the bag-of-words representation vector of the target text information and the bag-of-words representation vector of the reconstructed target text information.

According to the subject mining method provided by the invention, the method for reconstructing the bag-of-word representation vector of the target text information based on the normalized subject component vector of the target text information and the weighted subject keyword context representation vector comprises the following steps:

inputting the context representation vector of the weighted subject keyword into a multi-layer perceptron neural network model, and obtaining a calculation result output by the multi-layer perceptron neural network model;

and calculating the product of the calculated result and the subject component vector normalized by the target text information to be used as a bag-of-word representation vector reconstructed by the target text information.

According to the topic mining method provided by the invention, the loss function of the decoding network comprises a reconstruction loss function and a regularization loss function;

the calculating the value of the loss function based on the mean vector, the variance vector, the bag-of-words representation vector of the target text information and the bag-of-words representation vector of the reconstructed target text information includes:

calculating the value of the regularization loss function based on the word bag representation vector of the target text information and the word bag representation vector after the target text information is reconstructed, and calculating the value of the reconstruction loss function based on the mean vector and the variance vector;

And calculating the product of the value of the regularized loss function and the target weight, and then calculating the sum of the product of the value of the regularized loss function and the target weight and the value of the reconstruction loss function as the value of the loss function of the decoding network after obtaining the product of the value of the regularized loss function and the first target weight.

According to the topic mining method provided by the invention, the first target weight is dynamically adjusted based on a cyclic annealing strategy.

According to the subject mining method provided by the invention, the second pre-training language model comprises the following steps: sentence-BERT model.

According to the topic mining method provided by the invention, the calculating the value of the loss function of the decoding network based on the topic component vector, the mean vector, the variance vector and the weighted topic keyword context representation vector after the normalization of the target text information comprises the following steps:

reconstructing the text vector of the target text information based on the normalized topic component vector of the target text information and the weighted topic keyword context representation vector to obtain the text vector of the reconstructed target text information;

And calculating the value of the loss function based on the mean value vector, the variance vector, the text vector of the target text information and the text vector after the target text information is reconstructed.

the calculating the value of the loss function based on the mean vector, the variance vector, the text vector of the target text information and the text vector after the reconstruction of the target text information comprises the following steps:

calculating the value of the regularization loss function based on the text vector of the target text information and the text vector after the target text information is reconstructed, and calculating the value of the reconstruction loss function based on the mean vector and the variance vector;

and calculating the product of the value of the regularized loss function and the target weight, and then calculating the sum of the product of the value of the regularized loss function and the target weight and the value of the reconstruction loss function as the value of the loss function of the decoding network after obtaining the product of the value of the regularized loss function and the second target weight.

According to the topic mining method provided by the invention, the second target weight is dynamically adjusted based on a cyclic annealing strategy.

According to the subject mining method provided by the invention, the method for acquiring the target text information comprises the following steps:

acquiring original text information;

and carrying out data processing on the original text information, removing abnormal information in the original text information, and obtaining the target text information, wherein the data processing comprises at least one of spelling check, part-of-speech analysis and spelling error correction.

The invention also provides a subject excavation device, comprising:

the text acquisition module is used for acquiring target text information;

the model training module is used for performing self-supervision training on the decoding network based on the target text information so as to enable the topic keyword distribution tensor of the decoding network to be iteratively updated into a sparse tensor from a dense tensor;

the topic mining module is used for acquiring topics corresponding to the target text information and/or topic keywords corresponding to the topics based on topic keyword distribution tensor of the trained decoding network after the trained decoding network is acquired;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the subject mining method as described in any of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a topic mining method as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a subject matter mining method as described in any one of the above.

According to the topic mining method, the device, the electronic equipment and the storage medium, the decoding network is subjected to self-supervision training based on the target text information, after the decoding network is trained well, the topic keywords corresponding to the topics in the target text information are obtained based on the model parameters of the trained decoding network, the topics and/or the topic keywords with stronger sparsity can be obtained based on the decoding network mining on the basis of ensuring the accuracy and the comprehensiveness of topic mining, the decoding network can be replaced and migrated to other topic models, and the operation, calculation and storage efficiency of the topic models can be improved while the interpretability of the topic models is improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a topic model constructed based on a variation self-encoder in the related art;

FIG. 2 is a schematic diagram of a topic model constructed based on a pre-trained language model in the related art;

FIG. 3 is a schematic flow chart of the subject matter mining method provided by the present invention;

FIG. 4 is a schematic diagram of a decoding network in the topic mining method provided by the present invention;

FIG. 5 is a schematic flow chart of calculating a weighted topic keyword context representation vector corresponding to a decoding network in the topic mining mode provided by the invention;

FIG. 6 is one of the application scenarios of the subject mining method provided by the present invention;

FIG. 7 is a second application scenario of the subject mining method provided by the present invention;

FIG. 8 is a schematic flow chart of performing self-supervision training on a coding network based on target text information in the topic mining method provided by the invention;

FIG. 9 is a schematic view of the construction of the subject excavation devices provided by the present invention;

fig. 10 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

It should be noted that the topic model is a model for mining topics of text information and topic keyword distribution of each topic.

With the continuous development of deep learning technology and pre-training language model technology, a topic model constructed based on a deep neural network and a pre-training language model gradually becomes a research hot spot in the topic mining field.

Different from a topic model constructed based on a Bayesian directed probability graph model, the topic model constructed based on the deep neural network and the pre-training language model can well replace statistical information in the topic model constructed based on the Bayesian directed probability graph model by using dense information acquired by the pre-training model, and even can realize the capability of small sample learning and even zero sample learning by only knowledge contained in the pre-training model. Compared with a topic model constructed based on a Bayesian directed probability graph model, the topic model constructed based on a deep design network and a pre-training language model expands the development of the topic model in the fields of short text, small sample learning and zero sample learning.

Fig. 1 is a schematic diagram of a topic model constructed based on a variation self-encoder in the related art. Fig. 2 is a schematic diagram of a topic model constructed based on a pre-trained language model in the related art. As shown in fig. 1, the input of the topic model constructed based on the variation self-encoder is a text bag representation vector, and the output is a reconstructed text bag representation vector. As shown in fig. 2, the input of the topic model constructed based on the pre-trained language model is a text sequence, and the reconstructed text bag representation vector is output.

Among them, word bagging (Bag of Words) is a text information processing method commonly used to convert text information into a digital form so that machine learning algorithms can process it. In word bagging, text is treated as a disordered collection, each word is treated as an independent feature, and the number or frequency of occurrences of each word in the text is counted.

In the related art, a topic model constructed based on a deep neural network and a pre-training language model replaces a multi-layer perceptron in a coding network with the pre-training language model. Compared with a topic model constructed based on a neural network, the neural network topic model based on a deep neural network and a pre-training language model avoids the problem of information sparseness of word bag input in the field of short text, well fuses knowledge contained in the pre-training model into the input of training text, and improves the performance of the model to a certain extent.

The topic model constructed based on the deep neural network and the pre-training language model is used for inputting sparse word bag vectors constructed based on the variation self-encoder, and replacing the sparse word bag vectors with dense context coding vector inputs based on the transform model, so that the obtained sentence vectors or document vectors contain a certain pre-training corpus knowledge, the signal-to-noise ratio of the model input is improved, and the model performance is improved.

The topic model constructed based on the deep neural network and the pre-training language model still uses the structure of the decoding network in the variable self-encoder (Variational Auto Encoders) framework, namely, only one layer of polynomial distribution or decomposed polynomial distribution is used for representing topic keyword distribution.

Sparse tensor refers to tensors where the majority of elements are 0 and there are only a small number of non-zero elements. Dense tensors refer to tensors where the majority of elements are not 0 and only a small number are zero elements. Sparse tensors are widely applied in the fields of numerical analysis and scientific computation, for example, along with the continuous expansion of data scale, the diversification of data expression modes and sparse feature extraction gradually become research hotspots for improving the model interpretability and improving the data representation efficiency.

In deep learning, various parameters of the deep neural network are often various dense tensors, but using sparse tensors to represent the model can improve the efficiency of model operation, calculation and storage while improving the interpretability of the model.

The topic model constructed based on the deep neural network and the pre-training language model has the following defects: in one aspect, in the topic model constructed based on the deep neural network and the pre-trained language model, a linear layer plus Softmax activation function is typically used to model the distribution of topic keywords, while the input and output of the linear layer are typically two-dimensional dense tensors. Therefore, in the topic model constructed based on the deep neural network and the pre-training language model, the distribution tensor of the topic keywords is also usually a two-dimensional dense tensor, and the sparsity of the topic keywords obtained based on the topic model constructed based on the deep neural network and the pre-training language model is not strong.

However, in practical applications, not all the topic keywords are useful, and in practical applications, only a few topic keywords are often needed to represent a certain topic. The distribution of the topic keywords is represented by using the sparse tensor in the topic model, so that the topic keywords of each topic and the occurrence probability thereof can be obtained more efficiently, and further, the topic keywords with stronger sparsity can be obtained.

On the other hand, the topic model constructed based on the deep neural network and the pre-training language model only uses a layer of polynomial distribution or decomposed polynomial distribution to represent topic keyword distribution, and since term distribution in the polynomials is independent from each other, it is inevitable that the topic model constructed based on the deep neural network and the pre-training language model cannot consider the relevance of the contexts between topic keywords, and thus it is often difficult to embody the relevance between contexts between topic keywords obtained based on the topic model constructed based on the deep neural network and the pre-training language model.

Aiming at the defects of a topic model constructed based on a deep neural network and a pre-training language model in the related art, the invention provides a topic mining method, which can dynamically extract a plurality of topic keywords with the largest occurrence probability in each topic according to the number of configurable topic keywords, can convert the topic keywords into a keyword text sequence and a sparse tensor consisting of the occurrence probability of the topic keywords, then converts the topic keyword text sequence into a topic keyword vector sequence by using a pre-training model, and finally weights the vector sequence and a corresponding probability value to obtain the context representation of topic keyword distribution. The improvement of the topic mining method provided by the invention mainly aims at a decoding network in a topic model, and can be replaced and migrated into other neural network topic models and document models so as to improve the performance of related business document modeling and topic mining tasks.

Fig. 3 is a schematic flow chart of the topic mining method provided by the invention. The subject mining method of the present invention is described below in conjunction with fig. 3. As shown in fig. 3, the method includes: step 301, obtaining target text information.

It should be noted that, the execution body of the embodiment of the present invention is a subject excavation apparatus.

Specifically, the target text information is an object to be mined in the subject mining method. Based on the topic mining method provided by the invention, text mining can be carried out on the target text information, and topic keywords corresponding to each topic in the target text information are obtained.

In the embodiment of the invention, the target text information can be acquired in various modes, for example: according to the embodiment of the invention, the target text information sent by other electronic equipment can be received; alternatively, the target text information may be obtained after the original text information is subjected to data processing. The embodiment of the invention is not limited to a specific mode for acquiring the target text information.

As an alternative embodiment, obtaining the target text information includes: and acquiring the original text information.

Specifically, in the embodiment of the invention, the text which needs to be subject mined and analyzed can be determined to be the original text information based on the actual requirement.

In the embodiment of the invention, the original text information can be acquired in a data query mode, or the original text information sent by other electronic equipment can be received in the embodiment of the invention.

And carrying out data processing on the original text information, removing abnormal information in the original text information, and obtaining target text information, wherein the data processing comprises at least one of spell checking, part-of-speech analysis and spelling error correction.

Specifically, after the original text information is obtained, data processing such as spell checking, part-of-speech analysis, spelling error correction and the like can be performed on the original text information based on the spaCy text processing library, and after abnormal information in the original text information is removed, target text information can be obtained.

The spaCy text processing library is an NLP natural language text processing library based on Python and CPython. The spaCy text processing library can be used for marking, parsing, named entity recognition, text classification, etc., multitasking learning using pre-trained transducers such as BERT, etc., as well as production-ready training systems and simple model packaging, deployment and workflow management.

Step 302, performing self-supervision training on the decoding network based on the target text information, so that the topic keyword distribution tensor of the decoding network is iteratively updated from a dense tensor to a sparse tensor; wherein the decoding network is constructed based on a pre-trained language model.

It should be noted that, the model parameters of the decoding network in the topic model include topic keyword distribution tensors. In a conventional topic model, the topic keyword distribution tensor is usually a dense tensor, and the parameter size of the topic keyword distribution tensor is proportional to the length of the vocabulary.

However, for any topic, in practical application, only the first topic keywords with the highest occurrence frequency among topic keywords corresponding to the topic need to be paid attention to, and each topic keyword corresponding to the topic need not be paid attention to.

Therefore, the topic keyword distribution tensor of the coding network is thinned, so that the requirement of practical application can be met, the model parameter scale of the coding network can be simplified, and the running, calculating and storing efficiency of the coding network can be improved while the interpretability of the coding network is improved.

Specifically, the decoding network in the embodiment of the invention is a network model constructed based on a pre-training language model. The model parameters of the decoding network include a topic keyword distribution tensor. The topic keyword distribution tensor can be used to describe each topic keyword corresponding to each topic, the frequency of occurrence of each topic keyword, the position of each topic keyword, and the like.

After the target text information is acquired, the target text information can be used as a data base, and the decoding network is subjected to self-supervision training according to the loss function of the decoding network, so that the trained decoding network is obtained.

The loss function of the decoding network may be determined based on a variance vector of the subject distribution of the target text information, a bag-of-word representation vector of the target text information, and a bag-of-word representation vector after reconstruction of the target text information; the variance vector of the subject distribution of the target text information, the bag-of-words representation vector of the target text information and the bag-of-words representation vector of the reconstructed target text information are obtained through numerical calculation, deep learning and other modes based on the target text information.

The loss function of the decoding network can also be determined based on the variance vector of the subject distribution of the target text information, the text vector of the target text information and the text vector after the reconstruction of the target text information; the variance vector of the subject distribution of the target text information, the text vector of the target text information and the text vector after reconstruction of the target text information are obtained through numerical calculation, deep learning and other modes based on the target text information.

Before the decoding network is subjected to self-supervision training, the topic keyword distribution tensor of the decoding network is a dense tensor. The process of self-supervision training is also the process of iterative updating of the model parameters of the decoding network. According to the embodiment of the invention, the self-supervision training is carried out on the decoding network based on the target text information, so that the topic keyword distribution tensor of the decoding network can be iteratively updated from the dense tensor to the sparse tensor, and the trained topic keyword distribution tensor of the decoding network can describe each topic keyword corresponding to each topic in the target text information, the occurrence frequency of each topic keyword, the position of each topic keyword and the like.

Step 303, after obtaining the trained decoding network, obtaining a topic corresponding to the target text information and/or a topic keyword corresponding to the topic based on a topic keyword distribution tensor of the trained decoding network.

Specifically, after the trained decoding network is obtained, the topic keyword distribution tensor of the trained decoding network can be based on at least one of data query, numerical calculation and mathematical statistics, so as to obtain the topic in the target text information and/or the topic keyword corresponding to each topic.

It will be appreciated that the number of subjects in the target text information may be one or more. The number of topic keywords corresponding to each topic may also be one or more.

According to the embodiment of the invention, the decoding network is subjected to self-supervision training based on the target text information, after the trained decoding network is obtained, the topic keywords corresponding to the topics in the target text information are obtained based on the model parameters of the trained decoding network, so that the topics and/or the topic keywords with stronger sparsity can be obtained based on the decoding network mining on the basis of ensuring the accuracy and the comprehensiveness of the topic mining, the decoding network can be replaced and migrated into other topic models, and the operating, calculating and storing efficiencies of the topic models can be improved while the interpretability of the topic models is improved.

Fig. 4 is a schematic structural diagram of a decoding network in the topic mining method provided by the present invention. As shown in fig. 4, as an alternative embodiment, the decoding network 401 includes: an initialization module 402 and a vector conversion module 403;

based on the target text information, performing self-supervision training on the decoding network, including: and obtaining model parameters of the decoding network output by an initialization module, wherein the model parameters comprise topic keyword distribution tensors, and the initialization module is used for initializing the model parameters of the decoding network. In particular, embodiments of the invention may use Representing the topic keyword distribution tensor of the decoding network.

An initialization module 402 in the decoding network 401 may initialize model parameters of the decoding network.

After the initialization module 402 initializes the model parameters of the decoding network, the topic keyword distribution tensor of the decoding network output by the initialization module 402 can be obtainedThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing a predefined number of topics->Express vocabulary length, ++>For identifying topics->Indicate->Distribution of individual topics over the vocabulary.

And inputting the topic keyword distribution tensor of the decoding network into a vector conversion module, and obtaining a weighted topic keyword context representation vector corresponding to the decoding network output by the vector conversion module.

Specifically, the topic keyword distribution tensor of the decoding network output by the initialization module 402 is obtainedThe topic keyword distribution tensor of the decoding network can then be->The input vector conversion module 403.

The vector conversion module 403 distributes tensors based on topic keywords of the decoding networkThe context expression vector of the weighted topic keyword corresponding to the decoding network can be obtained in a numerical calculation mode>. Wherein the weighted topic keyword context represents the vector +. >May be used to represent contextual associations between topic keywords.

And calculating the value of the loss function of the decoding network based on the target text information and the weighted topic keyword context representation vector.

Specifically, a weighted topic gateway is acquiredKeyword context representation vectorThereafter, a vector may be represented based on the target text information and the weighted subject keyword context +.>And calculating the value of the loss function of the decoding network by a numerical calculation mode.

And under the condition that the value of the loss function is not converged, updating the model parameters of the decoding network, and repeating the step of calculating the value of the loss function until the value of the loss function is converged, thereby obtaining the trained decoding network.

Specifically, after obtaining the value of the loss function of the decoding network, it may be determined whether the value of the loss function of the decoding network converges.

If the value of the loss function of the decoding network is not converged, updating the model parameters of the decoding network, and distributing tensor based on the updated topic keywords of the decoding networkAcquiring updated weighted topic keyword context representation vector +.>Further, a keyword context representation vector is represented based on the target text information and the updated weighted subject matter +. >And calculating the value of the loss function of the decoding network until the value of the loss function of the decoding network converges, and obtaining the trained decoding network.

According to the embodiment of the invention, after the topic word distribution tensor of the decoding network output by the initializing module in the decoding network is obtained, the topic keyword distribution tensor of the decoding network is input into the vector conversion module, so that the vector conversion module obtains the weighted topic keyword context representation vector corresponding to the decoding network based on the topic keyword distribution tensor of the decoding network, calculates the value of the loss function of the decoding network based on the target text information and the weighted topic keyword context representation vector, updates the model parameters of the decoding network under the condition that the value of the loss function is not converged, and repeats the step of calculating the value of the loss function until the value of the loss function is converged, the trained decoding network is obtained, and the context relevance among topic keywords can be better embodied based on the basis that the decoding network is used for mining the topic keywords with stronger sparsity in the text information.

As an alternative embodiment, as shown in fig. 4, the vector conversion module includes: a sparse unit 404, a textualization unit 405 and a weighting unit 406.

Fig. 5 is a schematic flow chart of calculating a weighted topic keyword context representation vector corresponding to a decoding network in the topic mining mode provided by the invention. As shown in fig. 5, the topic keyword distribution tensor of the decoding network is input to a vector conversion module, and the weighted topic keyword context corresponding to the decoding network output by the vector conversion module represents a vector, which includes: inputting the topic keyword distribution tensor of the decoding network into a sparse unit, and performing sparsification processing on the topic keyword distribution tensor of the decoding network by the sparse unit to further obtain a sparsified topic keyword distribution tensor output by the sparse unit;

specifically, the topic keyword distribution tensor of the decoding network output by the initialization module 402 is obtainedThe topic keyword distribution tensor of the decoding network can then be->The sparse unit 404 is input.

The sparse unit 404 may distribute tensors to topic keywords of the decoding network by means of numerical computationPerforming sparsification processing to obtain and output a sparse topic distribution tensor +.>。

As an optional embodiment, the specific step of the sparse unit performing the sparse processing on the topic keyword distribution tensor of the decoding network includes: and based on the topic keyword distribution tensor of the decoding network, ordering the topic keywords corresponding to each topic in a descending order according to the order of the occurrence frequency from high to low, and obtaining a topic keyword sequence corresponding to each topic.

In particular, topic keyword distribution tensor due to decoding networkThus, based on->Distribution of individual topics on vocabulary +.>Can obtain->The frequency of occurrence of each topic keyword corresponding to each topic.

Acquisition of the firstAfter the occurrence frequency of each topic keyword corresponding to each topic, the +.>The keywords of each topic corresponding to each topic are ordered in a descending order to obtain +.>And a topic keyword sequence corresponding to each topic.

And determining the topic keywords with the target number before ranking and the index positions of the topic keywords with the target number before ranking in the topic keywords corresponding to each topic based on the topic keyword sequence corresponding to each topic.

It should be noted that in the embodiments of the present invention, it is possible to useRepresenting the target number. Wherein (1)>Is a positive integer. />Can be dynamically configured according to the actual demand, and can also be used for training according to the actual demand>Is modified to balance the number of topic keywords in the contextual representation.

The invention provides a topic mining method for topic keyword quantityCan be dynamically configured according to the needs, and can use different numbers of keywords for different topics in training according to different scenes. The pre-trained language model is used for encoding the theme keywords, and the subsequent network units do not need to adjust according to the number of the theme keywords, so that the compatibility of model components can be improved.

Specifically, obtain the firstAfter the topic keyword sequences corresponding to the topics, a +.>Top-ranked in each topic keyword corresponding to each topic>Personal topic keyword->，/>The method comprises the steps of carrying out a first treatment on the surface of the Wherein,,indicate->Ranking the topic keywords of the 1 st topic in the topic keywords corresponding to the topics; />Indicate->Ranking the topic keywords of the 2 nd topic in the topic keywords corresponding to the topics; similarly, the case of->Indicate->Ranking the topic keywords corresponding to the topics +.>Subject keywords of (c).

Obtain the firstAfter the topic keyword sequences corresponding to the topics, a +.>Top-ranked in each topic keyword corresponding to each topic>Index position of individual topic keywords +.>，/>The method comprises the steps of carrying out a first treatment on the surface of the Wherein,,indicate->The index position of the topic keyword of rank 1 in each topic keyword corresponding to each topic; />Represent the firstThe index position of the topic keyword ranked 2 in the topic keywords corresponding to the topics; similarly, the case of->Indicate->Ranking the topic keywords corresponding to the topics +.>Is a topic keyword index location.

And carrying out sparsification processing on the topic keyword distribution tensor of the decoding network according to the non-zero element coordinate representation format of the sparse tensor based on the topic keywords of the target number before ranking in the topic keywords corresponding to each topic and the index positions of the topic keywords of the target number before ranking, so as to obtain the sparse topic keyword distribution tensor.

Specifically, obtain the firstTop-ranked in each topic keyword corresponding to each topic>Personal topic keyword->Thereafter, for->Top-ranked in each topic keyword corresponding to each topic>Personal topic keyword->Normalization treatment is carried out to obtain the normalized +.>Top-ranked in each topic keyword corresponding to each topic>Personal topic keyword->。

Obtain the normalized firstTop-ranked in each topic keyword corresponding to each topic>Personal topic keywordsThereafter, the normalization based on +.>Top-ranked in each topic keyword corresponding to each topic>Personal topic keyword->First->Top-ranked in each topic keyword corresponding to each topic>Index position of individual topic keywordsDistributing tensors for topic keywords according to a sparse tensor universal non-zero element coordinate representation format (Coordinates Sparse Tensor)>Performing sparsification treatment to obtain a sparse topic keyword distribution tensor +.>The specific calculation formula is as follows:

（1）

wherein,,non-zero positions of non-zero elements in the sparse tensor may be represented; />The values of non-zero elements in the sparse tensor (i.e., the topic keywords) may be represented.

The sparse topic keyword distribution tensor is input into the textualization unit 405, the textualization unit 405 textualizes the sparse topic keyword distribution tensor to obtain a topic keyword sequence, and then context representation tensors of the topic keyword sequence are generated based on the topic keyword sequence, so that the context representation tensors of the topic keyword sequence output by the textualization unit 405 are obtained.

Specifically, a sparse topic keyword distribution tensor is obtainedThereafter, the thinned topic keyword distribution tensor can be +.>A texting unit 405 is input.

It should be noted that, due to the topic keyword distribution tensorTensor of topic keyword distribution after sparsification +.>The definition is the same in each dimension, but the sparse subject keyword distribution tensor +.>Comprises onlyThe non-zero elements, therefore, in case of a large vocabulary (++>) The textualization unit 405 may distribute tensors for the sparse topic keyword according to the mapping relationship (id 2 word) between the dimensions and terms in the dictionary>Texting, namely distributing tensor of sparse topic keywords>Conversion to the highest frequency of occurrence of each topic>Topic keyword text sequence of individual topic keywords +.>Theme keyword text sequence->。

The textualization unit 405 obtains a text sequence of a topic keywordThereafter, the topic keyword text sequence +.>The context of (1) represents tensor +.>The text sequence of the subject keyword can be obtained, for example, by means of numerical calculation or deep learning>The context of (1) represents tensor +.>。

As an alternative embodiment, the textualization unit comprises: a first pre-trained language model;

The specific step of the textualization unit generating a context representation tensor of the topic keyword sequence based on the topic keyword sequence comprises: inputting the topic keyword sequence into a first pre-trained language model;

and obtaining a context representation tensor of the topic keyword sequence output by the first pre-training language model.

Specifically, a topic keyword text sequence is obtainedThereafter, the subject keyword text sequence +.>As an abstract document formed by a string of keywords, inputting the abstract document into a first pre-training language model, and obtaining a theme keyword text sequence outputted by the first pre-training language model>The context of (1) represents tensor +.>。

As an alternative embodiment, the first pre-trained language model includes any one of a Transfomer model, a Sentence-BERT model, and a BERT-base model.

Alternatively, the first pre-trained language model in embodiments of the present invention may be a transducer model.

The transducer model is a neural network model based on Self-Attention (Self-Attention) mechanism, and can perform context learning by tracking the relation in sequence data.

Topic keyword text sequenceThe context of (1) represents tensor +. >The expression can be expressed by the following formula:

（2）

optionally, the first pre-trained language model in the embodiment of the present invention may also be a Sentence-BERT model or a BERT-base model.

Wherein the Sentence-BERT model is a pretrained BERT-based twin network capable of obtaining semantically significant chapter vectors. The Sentence-BERT is mainly proposed for solving the problems that the huge time cost of the Bert semantic similarity retrieval and the Sentence characterization thereof are not suitable for non-supervision tasks such as clustering, sentence similarity calculation and the like. The Sentence-BERT uses an authentication twinning network structure to acquire vector representation of Sentence pairs, and then pretrains a similarity model to obtain the Sentence-BERT.

The bert-base model is a neural network model formed by a multi-layer bidirectional transducer encoder.

It should be noted that, in the case where the first pre-training language model is a Sentence-BERT model or a BERT-base model, the subject keyword text sequenceThe context of (1) represents tensor +.>The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The vector output dimension is represented as a first pre-trained language model.

And inputting the sparse topic keyword distribution tensor and the context representation tensor of the topic keyword sequence into a weighting unit, and carrying out weighting processing on the context representation tensor of the topic keyword sequence by the weighting unit based on the sparse topic keyword distribution tensor, so as to obtain a weighted topic keyword context representation vector output by the weighting unit.

Specifically, a topic keyword text sequence is obtainedThe context of (1) represents tensor +.>Thereafter, the subject keyword text sequence +.>The context of (1) represents tensor +.>A weighting unit 406 is input.

Weighting unit 406 is based on the sparse topic keyword distribution tensorBy means of numerical calculation, the subject keyword text sequence +.>The context of (1) represents tensor +.>Weighting processing is carried out to obtain and output a weighted topic keyword context representation vector corresponding to the decoding network>。

As an alternative embodiment, the specific step of weighting the context representation tensor of the topic keyword sequence by the weighting unit 406 based on the sparse topic keyword distribution tensor includes: based on the topic keyword distribution tensor, a normalized weight tensor of the topic keyword is obtained.

Specifically, normalized third is obtainedTop-ranked in each topic keyword corresponding to each topic>Personal topic keywordsThen, ranking the keywords of each topic corresponding to all the normalized topics>Personal topic keywordsCan be in->Dimension by row pair->Normalization is carried out, and a normalized weight tensor of the theme keywords is obtained>。

And calculating the product of the normalized weight tensor of the topic keyword and the context representation tensor of the topic keyword sequence, and eliminating redundant dimensions in the product of the normalized weight tensor of the topic keyword and the context representation tensor of the topic keyword sequence after obtaining the product of the normalized weight tensor of the topic keyword and the context representation tensor of the topic keyword sequence, so as to obtain the weighted topic keyword context representation vector.

Specifically, a normalized weight tensor of a topic keyword is obtainedThereafter, a normalized weight tensor for the topic keyword can be calculated>Text sequence of keywords related to subject->The context of (1) represents tensor +.>And eliminating redundant dimensions to obtain a weighted topic keyword context representation vector +.>The specific calculation formula is as follows:

（3）

it should be noted that, the decoding network in the subject mining method provided by the invention can be compatible with the traditional subject model, and the subject word distribution tensor corresponding to the baseline model can be directly brought into the subject word distribution tensor of the decoding networkβAnd (3) obtaining the product.

As an alternative embodiment, calculating a value of a loss function of the decoding network based on the target text information and the weighted topic keyword context representation vector, comprises: and performing text conversion on the target text information, and converting the target text information into a word term number sequence.

In particular, embodiments of the invention may useRepresenting the target text information.

The embodiment of the invention can utilize Tokenizer conversion to convert the target text informationConversion to a lexical item numbering sequence. Among them, token is a commonly used text processing tool that can convert text into a sequence of numbers. / >

And inputting the vocabulary item numbering sequence into the second pre-training language model, and obtaining the text vector of the target text information output by the second pre-training language model.

Specifically, the target text informationConversion to the lexical item number sequence->Thereafter, the vocabulary entry numbering sequence +.>Inputting the second pre-training language model, and further obtaining the text vector of the target text information output by the second pre-training language model>。

As an alternative embodiment, the second pre-trained language model comprises: sentence-BERT model.

Alternatively, the second pre-trained language model in embodiments of the present invention may be a Sentence-BERT model.

Numbering the above terms into a sequenceAfter input to the second pre-trained language model, the second pre-trained language model may be based on the term number sequence +.>Obtain text vector of target text information +.>The specific calculation formula is as follows:

（4）

and respectively inputting the text vectors into a first linear network and a second linear network, and acquiring a mean value vector of target text information subject distribution output by the first linear network and a variance vector of target text information subject distribution output by the second linear network.

In particular, a text vector of target text information is acquired After that, the text vector of the acquisition target text information can be respectively +.>Input first linear network->And a second linear network->。

First linear networkText vector which can be based on target text information +.>Obtaining and outputting a mean vector of subject distribution of the target text information +.>The specific calculation formula is as follows:

（5）

second linear networkText vector which can be based on target text information +.>Obtaining and outputting variance vector of subject distribution of target text information +.>The specific calculation formula is as follows:

（6）

and (3) carrying out heavy parameterization and normalization calculation on the mean value vector and the variance vector to obtain a subject component vector after normalization of the target text information.

In particular, a first linear network is acquiredMean vector of subject distribution of output target text information +.>A second linear network->Variance vector of subject distribution of output target text information +.>Thereafter, the mean vector can be +>Variance vector->Performing Reparameterization (Reparameterization) to obtain topic component vector of target text information from variation distribution>The specific calculation formula is as follows:

（7）

wherein,,representing random noise sampled from a standard normal distribution, for generating gaussian distributed samples.

Obtaining subject component vectors of target text informationThereafter, the subject component vector of the target text information can be alignedNormalizing to obtain subject component vector ++normalized by target text information>The specific calculation formula is as follows:

（8）

and calculating the value of the loss function of the decoding network based on the topic component vector, the mean vector, the variance vector and the weighted topic keyword context representation vector after target text information normalization.

Specifically, the subject component vector normalized by the target text information is obtainedMean vector of subject distribution of target text information +.>Variance vector of subject distribution of target text information +.>And a weighted topic keyword context representation vector +.>Thereafter, the subject component vector normalized based on the target text information may be +>Mean vector of subject distribution of target text information +.>Variance vector of subject distribution of target text information +.>And a weighted topic keyword context representation vector +.>And calculating to obtain the value of the loss function of the decoding network by a numerical calculation mode.

The embodiment of the invention can sparse the dense subject word distribution tensor based on a rapid ordering algorithm, can express the subject as a text sequence formed by the subject keywords, then utilizes a pre-training language model to vectorize the sparse subject keyword expression, and the subject keyword context vector vectorized by the pre-training language model contains context semantics and position information.

The embodiment of the invention also utilizes the probability weight of the topic keyword distribution to weight the topic keyword context vector, thereby not only ensuring that the keyword with higher probability has higher weight, but also enabling the gradient to penetrate the whole decoder network so that the model can be directly trained by using the optimization method based on the gradient.

It should be noted that the decoding network in the embodiment of the present invention is still differentiable (differential), and the gradient may be counter-propagated along the flow represented by the dashed arrow in fig. 4, so that the decoding network may be trained by the gradient-based optimization method.

According to the embodiment of the invention, the normalized topic component vector of the target text information, the mean value vector of the topic distribution of the target text information and the variance vector of the topic distribution of the target text information are obtained based on the target text information, the weighted topic keyword context representation vector is obtained based on the topic keyword distribution tensor of the decoding network, and further the value of the loss function of the decoding network is obtained by calculation based on the normalized topic component vector of the target text information, the mean value vector of the topic distribution of the target text information, the variance vector of the topic distribution of the target text information and the weighted topic keyword context representation vector, so that the topic keyword distribution tensor of the encoding network can be thinned through the calculation of the value of the loss function, and the context relevance among topic keywords in the topic keyword distribution tensor of the encoding network is improved.

Fig. 6 is one of application scenarios of the topic mining method provided by the present invention. As shown in fig. 6, as an alternative embodiment, before calculating the value of the loss function of the decoding network based on the normalized topic component vector, the mean vector, the variance vector, and the weighted topic keyword context representation vector of the target text information, the method further includes: and carrying out word bagging processing on the target text information to obtain word bagging representation vectors of the target text information.

It should be noted that, the subject mining method in the embodiment of the invention is suitable for scenes with smaller word lists (not more than 2 ten thousand) of the target text information.

In particular, embodiments of the invention may useThe word bagging representation vector representing the target text information.

The word bagging processing comprises the following steps: word segmentation, namely, target text informationSplitting into sets of words or terms; constructing a word list, counting all words appearing in the target text information, and constructing a word list; vectorization: for each text in the target text information, constructing a vector representation text with the same length as the word list according to the occurrence times or the frequency of words in the word list; text vectorization converts all text into a vector form, each vector representing one text, and each element representing the number or frequency of occurrences of the corresponding word.

The result of the bag of words processing is to convert the target text information into a digitized bag of words representation vectorThis may apply the target text information to various machine learning algorithms, such as classification, clustering, regression, and the like. />

Calculating a value of a loss function of a decoding network based on a subject component vector, a mean vector, a variance vector and a weighted subject keyword context representation vector of target text information after normalization, comprising: reconstructing the bag-of-words representation vector of the target text information based on the normalized topic component vector and the weighted topic keyword context representation vector of the target text information to obtain the bag-of-words representation vector of the target text information after reconstruction.

In particular, a word bagging representation vector of target text information is obtainedThereafter, the subject component vector normalized based on the target text information may be +>And a weighted topic keyword context representation vector +.>Word bagging representation vector of target text information by means of numerical calculation, deep learning and the like>Reconstructing to obtain word bagging representation vector after reconstruction of target text information>。

As an alternative embodiment, reconstructing the bag-of-words representation vector of the target text information based on the normalized topic component vector and the weighted topic keyword context representation vector of the target text information to obtain the reconstructed bag-of-words representation vector of the target text information, including: and inputting the context representation vector of the weighted subject keyword into the multi-layer perceptron neural network model to obtain a calculation result output by the multi-layer perceptron neural network model.

Specifically, the weighted topic keyword context representation vectorAfter the multi-layer perceptron neural network model is input, the calculation result output by the multi-layer perceptron (Multilayer Perceptron, MLP) neural network model can be obtained。

Among them, a very popular neural network model is a multi-layer perceptron neural network model, which comprises at least one hidden layer and an output layer. The multi-layer perceptron neural network model multi-layer perceptron can learn a nonlinear function through training weight parameters, and has the advantages of higher calculation accuracy, better generalization capability, more flexible training mode and the like.

And calculating the product of the calculated result and the subject component vector normalized by the target text information, and taking the product as a word bag representation vector reconstructed by the target text information.

Specifically, a calculation result output by a multi-layer perceptron neural network model is obtainedAfter that, the above calculation result +.>Subject component vector normalized with target text information +.>Is used as the word bagging expression vector after the reconstruction of the target text information>The specific calculation formula is as follows:

（9）

and calculating the value of the loss function based on the normalized topic component vector, the average value vector of the target text information, the word bag representation vector of the target text information and the reconstructed word bag representation vector of the target text information.

Specifically, the word bagging representation vector after target text information reconstruction is obtainedThereafter, a mean vector which can be based on the subject distribution of the target text information +.>Variance vector of subject distribution of target text information +.>Word bagging representation vector of target text information +.>Word bagging representation vector after target text information reconstruction>And calculating to obtain the value of the loss function of the decoding network by a numerical calculation mode.

As an alternative embodiment, the loss function of the decoding network comprises a reconstruction loss function and a regularization loss function.

Specifically, in the embodiment of the invention, the loss function of the decoding networkIncluding reconstructing a loss functionAnd regularization loss function->。/>

Calculating a value of a loss function based on the mean vector, the variance vector, the bag-of-words representation vector of the target text information and the bag-of-words representation vector of the reconstructed target text information, comprising: based on the word bag representation vector of the target text information and the word bag representation vector of the target text information after reconstruction, calculating the value of the regularized loss function, and based on the mean vector and the variance vector, calculating the value of the reconstructed loss function.

In particular, the loss function is reconstructedThe expression can be expressed by the following formula:

（10）

Regularized loss functionThe expression can be expressed by the following formula:

（11）

and calculating the product of the value of the regularized loss function and the first target weight, and then calculating the sum of the product of the value of the regularized loss function and the target weight and the value of the reconstructed loss function as the value of the loss function of the decoding network after obtaining the product of the value of the regularized loss function and the first target weight.

In particular, the loss function of the decoding networkThe expression can be expressed by the following formula:

（12）

wherein,,representing a first target weight.

It should be noted that, in the embodiment of the present invention, the reconstruction loss functionIs in word frequency spaceCEBOW) A kind of electronic device.

As an alternative embodiment, the first target weight is dynamically adjusted based on a cyclical annealing strategy.

Specifically, since a transform network with a strong fitting capability is used in the decoding network, a KL divergence (KL divergence) problem may occur, degrading the performance of the decoding network. To address this problem, embodiments of the present invention regularize the loss function based on a cyclical annealing (Cyclical annealing) strategyIs the first target weight of (1)And (5) dynamically adjusting.

Specifically, the cycle annealing parameters are determined according to actual requirements: cyclelength, prop and maxiter, To be used forStep-up as a linear function of slope and 0 bias until from +.>Then maintain +.>To->After steps, will->Resetting to zero and restarting the next round of circulation, wherein the whole circulation annealing process lasts for maxiter circulation altogether, and the annealing process reaches the maxiter and then is +.>。

It should be noted that the cyclic annealing strategy is only effective during the training of the coding network. Loss functionThe parameters of the model may be optimized using a gradient-based Adam optimization algorithm.

According to the embodiment of the invention, the value of the loss function of the decoding network is obtained by calculating the value of the loss function of the decoding network based on the word bag representation vector after target text information reconstruction, the mean value vector of target text information topic distribution and the variance vector of target text information topic distribution after the word bag representation vector after target text information reconstruction is obtained based on the topic component vector after target text information normalization and the weighted topic keyword context representation vector, so that the topic keyword distribution tensor of the encoding network can be thinned under the scene with smaller word list of the target text information, and the context relevance among topic keywords in the topic keyword distribution tensor of the encoding network is improved.

Fig. 7 is a second application scenario of the topic mining method provided by the present invention. As shown in fig. 7, as an alternative embodiment, calculating a value of a loss function of the decoding network based on the normalized topic component vector, the mean vector, the variance vector, and the weighted topic keyword context representation vector of the target text information includes: reconstructing the text vector of the target text information based on the normalized topic component vector and the weighted topic keyword context representation vector of the target text information to obtain the reconstructed text vector of the target text information.

It should be noted that, in the case that the vocabulary of the text is too large (more than 2 ten thousand), the signal-to-noise ratio of the bag-of-words representation will be low, and the training of the coding network based on the bag-of-words representation vector is suitable. The subject mining method in the embodiment of the invention can be also suitable for scenes with large word lists (more than 2 ten thousand) of the target text information.

Specifically, the subject component vector normalized based on the target text informationAnd a weighted topic keyword context representation vector +.>Text vector ++for target text information by numerical calculation or deep learning>Reconstructing to obtain text vector of reconstructed target text information >。

As an alternative embodiment, reconstructing the text vector of the target text information based on the normalized topic component vector and the weighted topic keyword context representation vector of the target text information to obtain the reconstructed text vector of the target text information, including: and calculating the product of the normalized topic component vector of the target text information and the weighted topic keyword context representation vector to serve as a text vector after the target text information is reconstructed.

Specifically, the text vector after reconstruction of the target text informationThe method can be calculated by the following formula:

（13）

and calculating the value of the loss function based on the mean value vector, the variance vector, the text vector of the target text information and the text vector after the reconstruction of the target text information.

Specifically, a text vector after target text information reconstruction is obtainedThereafter, the mean vector based on the target text information +.>Variance vector of target text information->Text vector of target text information +.>And text vector after reconstruction of the target text information +.>And calculating to obtain the value of the loss function of the decoding network by a numerical calculation mode.

Specifically, in the embodiment of the invention, the loss function of the decoding networkIncluding reconstructing a loss functionAnd regularization loss function->。

Calculating a value of a loss function of the decoding network based on the mean vector, the variance vector, the text vector of the target text information and the text vector after the target text information is reconstructed, including: calculating the value of the regularized loss function based on the text vector of the target text information and the text vector after the target text information is reconstructed, and calculating the value of the reconstructed loss function based on the mean vector and the variance vector.

（14）

（15）

（16）

wherein,,representing a second target weight.

It should be noted that, in the embodiment of the present invention, the reconstruction loss function Is in semantic space%MSESE) A kind of electronic device.

As an alternative embodiment, the second target weight is dynamically adjusted based on a cyclical annealing strategy.

It should be noted that, in the embodiment of the present invention, the second target weight is weighted based on the cyclic annealing strategyA process of dynamic adjustment and a method of weighting the first target based on a cyclical annealing strategy>The dynamic adjustment process is the same, and the embodiments of the present invention are not described in detail.

According to the embodiment of the invention, the value of the loss function of the decoding network is obtained by calculating the text vector after the target text information is reconstructed based on the normalized topic component vector and the weighted topic keyword context representation vector of the target text information, the text vector after the target text information is reconstructed, the mean value vector of the topic distribution of the target text information and the variance vector of the topic distribution of the target text information, so that the topic keyword distribution tensor of the coding network can be thinned under the scene with larger vocabulary of the target text information, and the context relevance among topic keywords in the topic keyword distribution tensor of the coding network is improved.

The topic mining method provided by the invention can be applied to different typical scenes, and under the scene that the vocabulary of the target text information is large (more than 2 ten thousand), the bag of words vector after fine preprocessing is not required to be provided as input, but the original text and topic keyword sequence can be directly input, so that the threshold of the decoding network application is further reduced.

Fig. 8 is a schematic flow chart of self-supervision training on a coding network based on target text information in the subject mining method provided by the invention. The flow of self-supervised training of the coding network based on the target text information is shown in fig. 8.

The invention provides a topic mining method based on a pre-training model. According to the topic mining method provided by the invention, the dynamic and configurable modeling of the topic keyword context information is realized by redesigning the decoding network and reconstructing the loss item, so that the topic mining capability of the decoding network can be improved. The topic mining method provided by the invention can realize dynamic and configurable mining of sparse topics and topic keywords.

In addition, because the interpretability of the pre-training language model is poorer than that of the probability statistical model, the topic mining method provided by the invention provides a coding network, and the pre-training language model and the topic sparse tensor representation are combined and can be used as a basic component for interpreting the self-coding pre-training language model.

The topic mining method provided by the invention is used for representing topic keywords as text sequences and inputting the text sequences as a pre-training language model to obtain topic keyword context representation, and the method is also suitable for representing some abstractions such as user tag sets, concept keyword sets and the like, and has wide application prospects in the fields of text mining, recommendation systems and information retrieval.

Fig. 9 is a schematic structural view of the subject excavation devices provided by the present invention. The subject matter mining apparatus provided by the present invention will be described below with reference to fig. 9, and the subject matter mining apparatus described below and the subject matter mining method provided by the present invention described above may be referred to correspondingly with each other. As shown in fig. 9, the apparatus includes: a text acquisition module 901, a model training module 902, and a topic mining module 903.

A text acquisition module 901, configured to acquire target text information;

the model training module 902 is configured to perform self-supervision training on the decoding network based on the target text information, so that the topic keyword distribution tensor of the decoding network is iteratively updated from a dense tensor to a sparse tensor;

the topic mining module 903 is configured to obtain a topic corresponding to the target text information and/or a topic keyword corresponding to the topic based on a topic keyword distribution tensor of the trained decoding network after the trained decoding network is obtained;

Specifically, the text acquisition module 901, the model training module 902, and the topic mining module 903 are electrically connected.

According to the topic mining device, the decoding network is subjected to self-supervision training based on the target text information, after the trained decoding network is obtained, topic keywords corresponding to topics in the target text information are obtained based on the model parameters of the trained decoding network, and on the basis of ensuring the accuracy and the comprehensiveness of topic mining, topics and/or topic keywords with stronger sparsity can be obtained based on the decoding network mining, and the decoding network can be replaced and migrated to other topic models, so that the operation, calculation and storage efficiency of the topic models can be improved while the interpretability of the topic models is improved.

Fig. 10 illustrates a physical structure diagram of an electronic device, as shown in fig. 10, which may include: a processor 1010, a communication interface (Communications Interface) 1020, a memory 1030, and a communication bus 1040, wherein the processor 1010, the communication interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a subject matter mining method comprising: acquiring target text information; performing self-supervision training on the decoding network based on the target text information so as to enable the topic keyword distribution tensor of the decoding network to be iteratively updated from a dense tensor to a sparse tensor; after the trained decoding network is obtained, obtaining a theme corresponding to the target text information and/or a theme keyword corresponding to the theme based on a theme keyword distribution tensor of the trained decoding network; wherein the decoding network is constructed based on a pre-trained language model.

Further, the logic instructions in the memory 1030 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the subject mining method provided by the methods above, the method comprising: acquiring target text information; performing self-supervision training on the decoding network based on the target text information so as to enable the topic keyword distribution tensor of the decoding network to be iteratively updated from a dense tensor to a sparse tensor; after the trained decoding network is obtained, obtaining a theme corresponding to the target text information and/or a theme keyword corresponding to the theme based on a theme keyword distribution tensor of the trained decoding network; wherein the decoding network is constructed based on a pre-trained language model.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the subject mining method provided by the above methods, the method comprising: acquiring target text information; performing self-supervision training on the decoding network based on the target text information so as to enable the topic keyword distribution tensor of the decoding network to be iteratively updated from a dense tensor to a sparse tensor; after the trained decoding network is obtained, obtaining a theme corresponding to the target text information and/or a theme keyword corresponding to the theme based on a theme keyword distribution tensor of the trained decoding network; wherein the decoding network is constructed based on a pre-trained language model.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of topic mining, comprising:

acquiring target text information;

2. The topic mining method of claim 1, wherein the decoding network includes: an initialization module and a vector conversion module;

3. The topic mining method of claim 2, wherein the vector conversion module includes: a sparse unit, a textualization unit and a weighting unit;

inputting the topic keyword distribution tensor of the decoding network into the vector conversion module, and obtaining a weighted topic keyword context representation vector corresponding to the decoding network output by the vector conversion module, wherein the weighted topic keyword context representation vector comprises:

4. The topic mining method according to claim 3, wherein the specific step of the sparse unit performing the sparse processing on the topic keyword distribution tensor of the decoding network includes:

5. The topic mining method of claim 4, wherein the textualization unit includes: a first pre-trained language model;

Inputting the topic keyword sequence into a first pre-trained language model;

6. The topic mining method of claim 5, wherein the first pre-trained language model comprises any one of a Transfomer model, a Sentence-BERT model, and a BERT-base model.

7. The topic mining method according to claim 3, wherein the weighting unit performs a specific step of weighting the context representation tensor of the topic keyword sequence based on the thinned topic keyword distribution tensor, including:

8. The topic mining method of claim 2, wherein said calculating a value of a penalty function of the decoding network based on the target text information and the weighted topic keyword context representation vector comprises:

respectively inputting the text vector into a first linear network and a second linear network, and acquiring a mean vector of the target text information subject distribution output by the first linear network and a variance vector of the target text information subject distribution output by the second linear network;

9. The topic mining method of claim 8, wherein prior to calculating the value of the loss function of the decoding network based on the normalized topic component vector, the mean vector, the variance vector, and the weighted topic keyword context representation vector of the target text information, the method further comprises:

10. The method according to claim 9, wherein reconstructing the bag-of-words representation vector of the target text information based on the normalized topic component vector of the target text information and the weighted topic keyword context representation vector to obtain the reconstructed bag-of-words representation vector of the target text information comprises:

11. The topic mining method of claim 9, wherein the loss function of the decoding network includes a reconstruction loss function and a regularization loss function;

12. The topic mining method of claim 11, wherein the first target weight is dynamically adjusted based on a cyclical annealing strategy.

13. The topic mining method of claim 8, wherein the second pre-trained language model includes: sentence-BERT model.

14. The topic mining method of claim 8, wherein the calculating a value of a loss function of the decoding network based on the normalized topic component vector, the mean vector, the variance vector, and the weighted topic keyword context representation vector includes:

15. The topic mining method of claim 14, wherein the loss function of the decoding network includes a reconstruction loss function and a regularization loss function;

16. The topic mining method of claim 15, wherein the second target weight is dynamically adjusted based on a cyclical annealing strategy.

17. The method for mining a subject according to any one of claims 1 to 16, wherein the acquiring the target text information includes:

acquiring original text information;

18. A topic mining device, comprising:

the text acquisition module is used for acquiring target text information;

19. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the subject mining method of any of claims 1 to 17 when the program is executed by the processor.

20. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the subject mining method according to any of claims 1 to 17.