CN111176800A

CN111176800A - Training method and device of document theme generation model

Info

Publication number: CN111176800A
Application number: CN201910605567.4A
Authority: CN
Inventors: 涂小刚; 于东海; 孙仕杰; 李永安; 高品; 李本利; 魏万敬
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2020-05-19

Abstract

The invention discloses a training method and a training device for a document theme generation model. Wherein, the method comprises the following steps: starting a Message Passing Interface (MPI) process of a document theme generation model (LDA) in a big data processing framework Spark task; the MPI process creates an input named pipeline and an output named pipeline; the Spark task inputs input data to an MPI process through an input naming pipeline, wherein the input data is used for training an LDA; the MPI process completes the training of LDA by using the input data to obtain a training result; and the MPI process sends the training result to the Spark task through an output named pipeline. The invention solves the technical problems of low training speed and memory consumption of the document theme generation model.

Description

Training method and device of document theme generation model

Technical Field

The invention relates to the field of computers, in particular to a training method and a training device for a document theme generation model.

Background

LDA (patent Dirichlet Allocation, a document topic generation model, also called a three-layer bayesian probability model, including three-layer structures of words, topics and documents, and unsupervised machine learning technology) is a classic algorithm in the field of machine learning, and can be used to identify topic information hidden in a large-scale document set or corpus. In recent years, in the field of machine learning, due to the explosion of data volume and the increase of model complexity, a single machine cannot cope with such changes due to resource limitation. Various implementations also appear in the industry for LDA. For example, Gibbs sampling may be used, which may be implemented by GraphX based on the maximum Expectation algorithm EM (Expectation-Maximization algorithm). However, the realization of the training is slow, and simultaneously, the training is extremely memory-consuming and has poor performance.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for training a document theme generating model, which at least solve the technical problems of low training speed and memory consumption of the document theme generating model.

According to an aspect of the embodiments of the present invention, there is provided a training method for a document theme generation model, including: starting a Message Passing Interface (MPI) process of a document theme generation model (LDA) in a big data processing framework Spark task; the MPI process creates an input named pipeline and an output named pipeline; the Spark task inputs input data to an MPI process through an input naming pipeline, wherein the input data is used for training an LDA; the MPI process completes the training of LDA by using the input data to obtain a training result; and the MPI process sends the training result to the Spark task through an output named pipeline.

According to another aspect of the embodiments of the present invention, there is also provided a training apparatus for generating a model from a document theme, including: the generating module is used for starting a Message Passing Interface (MPI) process of a document theme generation model (LDA) in a big data processing framework Spark task; the system comprises a creating module, a setting module and a setting module, wherein the creating module is used for creating an input named pipeline and an output named pipeline through an MPI process; the input module is used for controlling the Spark task to input data to the MPI process through an input naming pipeline, wherein the input data is used for training the LDA; the training module is used for completing the training of the LDA by using the input data through the MPI process to obtain a training result; and the sending module is used for controlling the MPI process to send the training result to the Spark task through the output named pipeline.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium, in which a computer program is stored, where the computer program is configured to execute the training method of the document theme generation model when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic apparatus, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the training method of the document theme generation model through the computer program.

In the embodiment of the invention, the message passing interface MPI process of the LDA is started in the Spark task of the big data processing frame, the Spark task inputs the input data to the MPI process through the input named pipeline created by the MPI process, so that the MPI process can train the LDA by using the input data and send the training result to the Spark task mode, peripheral data access and task scheduling are realized based on Spark to ensure usability, training of the LDA is realized based on MPI, the performance of the LDA can be ensured, the performance and the usability are considered simultaneously through the perfect combination of Spark and MPI, the technical effects of high training speed, low memory consumption and high usability of the LDA of the document theme generation model are realized, and the technical problems of low training speed and memory consumption of the document theme generation model are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a diagram illustrating an application environment of an alternative method for training a document topic generation model according to an embodiment of the invention;

FIG. 2 is a flowchart illustrating an alternative method for training a document topic generation model according to an embodiment of the invention;

FIG. 3 is a schematic structural diagram of a big data processing framework Spark task according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the training of an LDA according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart diagram illustrating an alternative method for training a document topic generation model according to an embodiment of the invention;

FIG. 6 is a flowchart illustrating an alternative method for training a document topic generation model according to an embodiment of the invention;

FIG. 7 is a schematic structural diagram of an alternative training apparatus for generating a document topic according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiment of the present invention, a method for training a document topic generation model is provided, and optionally, as an optional implementation manner, the method for training the document topic generation model may be applied, but not limited, to the environment shown in fig. 1.

The terminal device 102 may execute step S110 through the processor 104, and start a message passing interface MPI process of the document topic generation model LDA in the big data processing framework Spark task; s112, the MPI process creates an input named pipeline and an output named pipeline; s114, the Spark task inputs input data to the MPI process through an input naming pipeline, wherein the input data are used for training the LDA; s116, the MPI process completes the LDA training by using the input data to obtain a training result; and S118, the MPI process sends the training result to the Spark task through an output naming pipeline. The terminal device 102 may complete the training of the LDA by performing the above steps. Here, the terminal device 102 may store the input data, the trained LDA, and the like through the memory 104. The terminal device 102 may display the trained LDA, etc. via the display 108.

Optionally, in this embodiment, the training method of the document theme generation model may be applied, but not limited to, in the terminal 102, for assisting the application client in training the LDA. The application client may be but not limited to run in the terminal device 102, and the user device 102 may be but not limited to a terminal device such as a PC, a tablet computer, a notebook computer, a mobile phone, and the like, which supports running the application client. The above is merely an example, and this is not limited in this embodiment.

Optionally, as an optional implementation manner, as shown in fig. 2, the training method of the document topic generation model includes:

step S202, starting a Message Passing Interface (MPI) process of a document theme generation model (LDA) in a big data processing framework Spark task;

step S204, the MPI process creates an input named pipeline and an output named pipeline;

step S206, the Spark task inputs input data to the MPI process through an input naming pipeline, wherein the input data is used for training the LDA;

s208, the MPI process completes the training of LDA by using the input data to obtain a training result;

and step S210, the MPI process sends the training result to the Spark task through the output named pipeline.

In the embodiment of the invention, a Message Passing Interface MPI process of LDA is started in a big data processing frame Spark task, wherein the big data processing frame Spark is a brand-new big data processing frame, a cluster computing platform for realizing rapid and universal use and a general memory parallel computing frame can be used for constructing a large-scale and low-delay data analysis application program, and an MPI (Message Passing Interface) can be used for concurrent programming of single machine and multiple threads. The embodiment of the invention effectively combines Spark and MPI, trains LDA through MPI, can ensure the performance of a core LDA algorithm, transmits input data and a training result through Spark tasks, can realize peripheral data access and task scheduling, and ensures the usability. Meanwhile, an MPI process of an LDA message transmission interface is started in a Spark task, so that the MPI process is positioned in the Spark task and is a Spark application program on the whole, and the MPI process can be deployed on any platform supporting Spark, such as a Hadoop (Hadoop Distributed File System, a software framework for performing Distributed processing on a large amount of data) or Kubernets to be executed.

Optionally, starting a message passing interface MPI process of the document theme generation model LDA in a big data processing framework Spark task includes: and starting n MPI processes in n Spark tasks, wherein one MPI process is started in each Spark task, and the n MPI processes form a distributed MPI program. As shown in fig. 3, the big data processing framework Spark task 302 may include n Spark tasks 304, each Spark task 304 has one MPI process 306 started therein, all MPI processes 306 in the Spark tasks 304 form a distributed LDA program, and the MPI processes 306 implement LDA training through message passing. In the embodiment of the present invention, in each Spark task, the MPI process 306 is created with an input named pipe and an output named pipe, the distributed LDA program is scheduled and executed by the Spark, and the MPI process 306 exchanges data with the Spark task 304 through the input named pipe and the output named pipe. The LDA is trained by a distributed MPI program formed by n MPI processes, so that the training performance can be ensured.

Optionally, the n MPI processes include m Word processes and n-m Document processes, where the Spark task inputs the input data to the MPI process through an input naming pipe, and includes: the n-m Spark tasks corresponding to the n-m Document processes input data to the n-m Document processes through n-m input command pipelines created by the n-m Document processes, wherein each Spark task in the n-m Spark tasks inputs the input data to one Document process in the n-m Document processes through one input command pipeline in the n-m input command pipelines, and the m Spark tasks corresponding to the m Word processes are set not to input the input data to the m Word processes.

In an embodiment of the present invention, the n MPI processes include: the Word process group and the Document process group. Here, the input data for training the LDA is divided into n partitions, n Spark task lines are correspondingly started, an LDA core program, i.e., an MPI process, is started in each Spark task, a Word process is run in the first m Spark tasks, the m Word processes form a Word process group, a Document process is run in the last n-m Spark tasks, and the n-m Document processes form a Document process group. Wherein, the first m partitions in the n partitions have no data, and the last n-m partitions have data. The first m tasks correspond to the Word process group, data are not distributed, so that writing is empty, the last n-m tasks correspond to the Document process group, input data for training are written into the input named pipeline, and after the Document process reads the training data, iterative training is started, so that the LDA is trained. As shown in FIG. 4, in an embodiment of the present invention, Word process group 402 provides a model service, equivalent to a parameter server, for responding to parameter pulling and updating. Where each Word process 404 is used to maintain a portion of the corpus Word-topoic model parameters. In other words, the Word-topoic model parameters may be partitioned, with each Word process 404 being used to maintain a partition in the Word-topoc model parameters, such that the Word-topoc model parameters for the entire corpus are distributed among the different Word processes 404. For example, the number of Word processes is modulo by Word id, so that in the Document process training process, the corresponding Word process request model parameters can be found through the Word id. The word-topoc model refers to a word-topoc matrix, and the parameters of the word-topoc model are parameters corresponding to one partition in the partitioned word-topoc matrix. It can be understood that the doc-topoc model in the embodiment of the present invention refers to a doc-topoc matrix, and the doc-topoc model parameter is a parameter corresponding to one partition in the doc-topoc matrix after the partition. In the embodiment of the present invention, the Document process group 406 is responsible for Document training, and is equivalent to a work node, and is used for training input data, and may be trained in a data parallel manner in the training process, that is, each Document process 408 loads a training Document doc through an input named pipe, data loaded by different Document processes 408 is only a part of the whole training data, and each Document process 408 trains a part of doc-topic model parameters. The doc-topic model here refers to the doc-topic matrix. Under the condition of finishing training, for example, when the maximum iteration number is reached, each Word process outputs corresponding Word-topoic model parameters through an output naming pipeline, each Document process outputs corresponding doc-topoic model parameters through an output naming pipeline, and each Document process trains a part of a Word-topoic matrix, so that a final doc-topoic matrix can be obtained according to the doc-topoic model parameters output by all Document processes, and similarly, the Word-topoic model parameters output by all Word processes can obtain the final Word-topoic matrix.

In the embodiment of the invention, after the MPI process is started, an input-output named pipeline is firstly established, each Word process initializes a Word-topoic matrix and starts model service, and each Document process starts to wait for input data of the input named pipeline so as to train doc-topoic model parameters. Because the input data is used for training the Document process, the update of the Word-topoc matrix by the Word process is updated according to the message notification sent by the Document process, and the update of the Word-topoc matrix by the Word process does not need to input data. Therefore, the m Spark tasks corresponding to the m Word processes are set not to input data to the m Word processes, thereby avoiding inputting data to the Word processes. In the embodiment of the invention, after all MPI processes are started, the Spark tasks write the training data into the input naming pipeline, and because the first m Spark tasks are set not to input the input data to the m Word processes, the first m Spark tasks are not divided into data. The last n-m Spark tasks will write the input data for training, thereby facilitating training based on the input data.

Optionally, the MPI process completes training on the LDA using the input data to obtain a training result, including: n-m Document processes read input data; and n-m Document processes use the read input data to train the LDA to obtain a final word-topic matrix and a final doc-topic matrix, wherein the training result comprises the final word-topic matrix and the final doc-topic matrix, each line in the final word-topic matrix represents the weight of one word in the corpus in each theme, and each line in the final doc-topic matrix represents the weight of one Document in each theme. In the embodiment of the invention, the Document process reads input data to train the LDA, and a trained doc-topic matrix, namely a final doc-topic matrix, is obtained. In the Document process training process, a Word-topoc matrix corresponding to the Word process is pulled, and the doc-topoc matrix is updated according to the Word-topoc matrix and input data. Meanwhile, after the doc-topoc matrix is updated, the Document process informs the Word process of updating the Word-topoc matrix through the MPI message. And when the training is finished, outputting the updated Word-topoic matrix, namely the final Word-topoic matrix, by the Word process through an output named pipeline.

In the embodiment of the invention, the Document process informs the Word process of updating the Word-topoc matrix through the MPI message, and simultaneously, the Document process also sends a pull message to the Word process because the Document process needs to update the doc-topoc matrix of the Document process according to the Word-topoc matrix corresponding to the Word process. Here, the Word process updates the Word-topoc matrix in response to the model update and the pull message sent by the Document process, and sends the updated Word-topoc matrix to the Document process. The word-topic matrix (matrix size ═ number of words × number of topics) represents the weight of a word (word) in the corpus in each topic (topic), and the doc-topic matrix (matrix size ═ number of documents × number of topics) represents the weight of a document (doc) in the document library in each topic (topic). For the Document process, intermediate data is obtained according to input data and a Word-topic matrix drawn by the Word process, and the intermediate data is used for indicating the weight of a Word (Word) in each topic (topic). The Document process informs the Word process through the message to send the intermediate data to the Word process, and the Word process can update the Word-topoic matrix according to the Word-topoic matrix and the intermediate data. For example, the word-topic matrix and the intermediate data may be summed to update the weight of the word (word) in each topic (topic). The way for summing may be weighted summing.

Optionally, the MPI process sends the training result to the Spark task through an output named pipe, including: writing the final Word-topoic matrix into m output named pipes created by the m Word processes, and writing the final doc-topoic matrix into n-m output named pipes created by the n-m Document processes; m Spark tasks corresponding to m Word processes read m output named pipes to obtain a final Word-topoic matrix, and n-m Spark tasks corresponding to n-m Document processes read n-m output named pipes to obtain a final doc-topoic matrix. Here, in the embodiment of the present invention, since each Document process only holds a part of the documents of the corpus, the topic distribution of the part of the documents in the training process is maintained, that is, a part of the doc-topic matrix is trained. Each Word process only holds partial Word-topic model parameters of the corpus, and the Word-topic model parameters corresponding to the whole corpus can be evenly distributed to different Word processes. Therefore, in order to obtain a final Word-topoic matrix and a final doc-topoic matrix, each Word process is required to write the final Word-topoic matrix into the corresponding output named pipe, and each Document process is required to write the final doc-topoic matrix into the corresponding output named pipe, so that the final Word-topoic matrix is obtained according to the Word-topoic matrices output by the m Word processes, and the final doc-topoic matrix is obtained according to the doc-topoic matrices output by the n-m Document processes.

Optionally, after m named pipes output by m Spark tasks corresponding to m Word processes are read to obtain a final Word-topic matrix, and n-m named pipes output by n-m Spark tasks corresponding to n-m Document processes are read to obtain a final doc-topic matrix, the method further includes: and writing the final word-topoic matrix into the external storage system by the m Spark tasks, and writing the final doc-topoic matrix into the external storage system by the n-m Spark tasks. In the embodiment of the invention, the final word-topoc matrix and the final doc-topoc matrix are written into the external storage system through the Spark task, so that the final word-topoc matrix and the final doc-topoc matrix after training can be conveniently utilized subsequently.

Optionally, the training of the LDA by the n-m Document processes using the read input data to obtain a final word-topic matrix and a final doc-topic matrix includes:

each Document process of the n-m Document processes performs the following operations, wherein a Document process is considered to be a current Document process when performing the following operations:

the current Document process reads Document data from a current input named pipeline created by the current Document process and caches the Document data to an internal memory, wherein the Word process starts a model service and monitors model updating and pulling messages sent by the current Document process;

the method comprises the steps that a current Document process randomly selects a corresponding theme for each Word in Document data, a local doc-topoic matrix is updated, and a Word process is informed to update the Word-topoic matrix through a model updating message, wherein the Word process is used for receiving the model updating message sent by the current Document process and updating the Word-topoic matrix in response to the model updating message;

the current Document process pulls a Word-topoc parameter from a Word process, updates a local doc-topoc matrix by using a Gibbs sampling algorithm, and informs the Word process to update the Word-topoc matrix through an MPI message, wherein the Word process is used for waiting for the current Document process to pull the Word-topoc parameter and update the Word-topoc matrix.

In the embodiment of the invention, Document data is read from an input named pipeline by the Document process and cached to the memory, and the Word process starts the model service and is used for monitoring the model updating and pulling messages sent by the Document process. Here, the Document process will randomly select a corresponding topic for each Word in the Document data based on the input data and the Word-topic matrix pulled by the Word process. It can be understood that, here, the Document process randomly selects a corresponding topic for each Word, and when the subsequent Document process notifies the Word process to update the Word-topic matrix through the model update message, the subsequent Document process carries the Document process to randomly select a corresponding topic for each Word, so that the Word process can update the Word-topic matrix conveniently. In the embodiment of the invention, the Word process monitors the model updating and pulling messages sent by the Document process through the model service, thereby updating the Word-topic matrix and sending the updated Word-topic matrix as a parameter to the Document process so as to facilitate the next training of the Document process.

Embodiments of the present invention will be described below with reference to fig. 5 to 6.

S502, loading training data and re-partitioning; for example, the LDA kernel of MPI has n processes, that is, n MPI processes, where m Word processes are present and n-m Document processes are present, the training data, that is, the input data, may be divided into n partitions having the same number as that of the MPI processes, where the first m partitions correspond to the Word processes and do not distribute data, and the last n-m partitions correspond to the Document processes, and the n-m partitions may distribute all the training data equally. For example, by customizing a partition function, the partition id of each piece of data can be calculated by the formula key% (n-m) + m; s504, scheduling and executing the distributed MPI core program, and starting to schedule and execute the LDA core program based on the MPI based on the re-partitioned training data; and S506, acquiring output and writing the output into an external storage system, wherein after the training of the LDA core program of the MPI is finished, the output can be acquired and written into the external storage system so as to use the training result in the following.

For the above training process, the training data, that is, the input data, has been divided into n partitions, and accordingly n Spark tasks are started to execute, as shown in fig. 6, Spark task execution S602 pulls an MPI process, and after the MPI process is started, executes step S604 to create an input/output named pipe. For each Word process in each Spark task, initializing a Word-topoic matrix, starting a model service, and starting each Document process to wait for inputting data of a named pipeline; and (4) executing S606 after the Spark tasks wait for all the MPI processes to be started, and writing training data into the input named pipeline created by the MPI, wherein the first m Spark tasks are not divided into data, so that the training data are directly skipped, and only the last n-m Spark tasks can write the training data. Correspondingly, the n-m Document processes execute S608, read the input named pipe, store the data in the memory, execute S610, and perform iterative training on the LDA according to the input data. After the training is finished, S612 is executed, the training result is written into the output named pipeline, and the training result is transmitted to the Spark task. And the Spark task executes S614, reads the training results in the output named pipes, waits for all the output named pipes to be read, and stores the training results in an external storage system by the Spark task, so that training is completed and the final training result is obtained.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiment of the present invention, there is also provided a training apparatus for a document theme generation model, which is used for implementing the above training method for a document theme generation model. As shown in fig. 7, the apparatus includes: a generating module 702, configured to start a message passing interface MPI process of a document theme generation model LDA in a big data processing framework Spark task; a creating module 704 for creating an input named pipe and an output named pipe through an MPI process; an input module 706, configured to control the Spark task to input data to the MPI process through an input naming pipe, where the input data is used to train the LDA; a training module 708, configured to complete training on the LDA using the input data through the MPI process to obtain a training result; and the sending module 710 is configured to control the MPI process to send the training result to the Spark task through the output named pipe.

Optionally, the generating module includes: the starting unit is used for starting n MPI processes in n Spark tasks, wherein one MPI process is started in each Spark task, and the n MPI processes form a distributed MPI program. Here, the big data processing framework Spark tasks may include n Spark tasks, each Spark task has an MPI process started therein, all MPI processes in the Spark tasks constitute a distributed LDA program, and the MPI processes implement LDA training through message passing. In the embodiment of the invention, in each Spark task, an input named pipeline and an output named pipeline are created in the MPI process, the distributed LDA program is scheduled and executed by the Spark, and the MPI process exchanges data with the Spark task through the input named pipeline and the output named pipeline.

Optionally, the n MPI processes include m Word processes and n-m Document processes, wherein the input module includes: and the input unit is used for controlling n-m Spark tasks corresponding to the n-m Document processes to input the input data to the n-m Document processes through n-m input command pipelines created by the n-m Document processes, wherein each Spark task in the n-m Spark tasks inputs the input data to one Document process in the n-m Document processes through one input command pipeline in the n-m input command pipelines, and the m Spark tasks corresponding to the m Word processes are set not to input the input data to the m Word processes. In an embodiment of the present invention, the n MPI processes include: the Word process group and the Document process group. In the embodiment of the invention, input data for training LDA is divided into n partitions, n Spark task lines are correspondingly started, an LDA core program, namely an MPI process, is started in each Spark task, a Word process is run in the first m Spark tasks to form a Word process group, and a Document process is run in the last n-m Spark tasks to form a Document process group. The first m partitions in the n partitions have no data, the last n-m partitions have data, the first m tasks correspond to Word process groups and are not distributed with data, so that writing is empty, the last n-m tasks correspond to Document process groups, input data for training are written into an input named pipeline, and iterative training is started after the Document process reads the training data, so that the training of the LDA is realized.

Optionally, the training module includes: a first reading unit for reading input data through n-m Document processes; the training unit is used for training the LDA by using the read input data through n-m Document processes to obtain a final word-topic matrix and a final doc-topic matrix, wherein the training result comprises the final word-topic matrix and the final doc-topic matrix, each line in the final word-topic matrix represents the weight of one word in the corpus in each theme, and each line in the final doc-topic matrix represents the weight of one Document in each theme. In the embodiment of the invention, the Document process reads input data to train the LDA, and a trained doc-topic matrix, namely a final doc-topic matrix, is obtained. In the Document process training process, a Word-topoc matrix corresponding to the Word process is pulled, and the doc-topoc matrix is updated according to the Word-topoc matrix and input data. Meanwhile, after the doc-topoc matrix is updated, the Document process informs the Word process of updating the Word-topoc matrix through the MPI message. And when the training is finished, outputting the updated Word-topoic matrix, namely the final Word-topoic matrix, by the Word process through an output named pipeline.

In the embodiment of the invention, the Document process informs the Word process of updating the Word-topoc matrix through the MPI message, and simultaneously, the Document process also sends a pull message to the Word process because the Document process needs to update the doc-topoc matrix of the Document process according to the Word-topoc matrix corresponding to the Word process. Here, the Word process responds to the model update and the pull message sent by the Document process, updates the Word-topoc matrix, and sends the updated Word-topoc matrix to the Document process. The word-topic matrix (matrix size ═ number of words × number of topics) represents the weight of a word (word) in the corpus in each topic (topic), and the doc-topic matrix (matrix size ═ number of documents × number of topics) represents the weight of a document (doc) in the document library in each topic (topic).

Optionally, the sending module includes: the writing unit is used for writing the final Word-topoic matrix into m output named pipes created by the m Word processes through the m Word processes, and writing the final doc-topoic matrix into n-m output named pipes created by the n-m Document processes through the n-m Document processes; and the second reading unit is used for reading the m output named pipes through the m Spark tasks corresponding to the m Word processes to obtain a final Word-topic matrix, and reading the n-m output named pipes through the n-m Spark tasks corresponding to the n-m Document processes to obtain a final doc-topic matrix. Here, in the embodiment of the present invention, since each Document process only holds a part of the documents of the corpus, the topic distribution of the part of the documents in the training process is maintained, that is, a part of the doc-topic matrix is trained. Each Word process only holds partial Word-topic model parameters of the corpus, and the Word-topic model parameters corresponding to the whole corpus can be evenly distributed to different Word processes. Therefore, in order to obtain a final Word-topoic matrix and a final doc-topoic matrix, each Word process is required to write the final Word-topoic matrix into the corresponding output named pipe, and each Document process is required to write the final doc-topoic matrix into the corresponding output named pipe, so that the final Word-topoic matrix is obtained according to the Word-topoic matrices output by the m Word processes, and the final doc-topoic matrix is obtained according to the doc-topoic matrices output by the n-m Document processes.

Optionally, the apparatus further comprises: and the writing module is used for writing the final word-topoic matrix into the external storage system through the m Spark tasks, and writing the final doc-topoic matrix into the external storage system through the n-m Spark tasks. In the embodiment of the invention, the final word-topoc matrix and the final doc-topoc matrix are written into the external storage system through the Spark task, so that the final word-topoc matrix and the final doc-topoc matrix after training can be conveniently utilized subsequently.

Optionally, the training unit is specifically configured to: executing the following operations by each Document process of the n-m Document processes, wherein the Document process is considered as the current Document process when executing the following operations: reading Document data from a current input named pipeline created by a current Document process through the current Document process and caching the Document data to a memory, wherein a Word process starts a model service and monitors model updating and pulling messages sent by the current Document process; randomly selecting a corresponding theme for each Word in the Document data through the current Document process, updating a local doc-topic matrix, and informing a Word process to update the Word-topic matrix through a model updating message, wherein the Word process is used for receiving the model updating message sent by the current Document process and updating the Word-topic matrix in response to the model updating message; the method comprises the steps of pulling a Word-topoc parameter from a Word process through a current Document process, updating a local doc-topoc matrix by using a Gibbs sampling algorithm, and informing the Word process to update the Word-topoc matrix through an MPI message, wherein the Word process is used for waiting for the current Document process to pull the Word-topoc parameter and update the Word-topoc matrix. In the embodiment of the invention, Document data is read from an input named pipeline by the Document process and cached to the memory, and the Word process starts the model service and is used for monitoring the model updating and pulling messages sent by the Document process. Here, the Document process will randomly select a corresponding topic for each Word in the Document data based on the input data and the Word-topic matrix pulled by the Word process. It can be understood that, here, the Document process randomly selects a corresponding topic for each Word, and when the subsequent Document process notifies the Word process to update the Word-topic matrix through the model update message, the subsequent Document process carries the Document process to randomly select a corresponding topic for each Word, so that the Word process can update the Word-topic matrix conveniently. In the embodiment of the invention, the Word process monitors the model updating and pulling messages sent by the Document process through the model service, thereby updating the Word-topic matrix and sending the updated Word-topic matrix as a parameter to the Document process so as to facilitate the next training of the Document process.

According to another aspect of the embodiment of the present invention, there is further provided an electronic apparatus for implementing the training method of the document theme generation model, as shown in fig. 8, the electronic apparatus includes a memory 802 and a processor 804, the memory 802 stores a computer program, and the processor 804 is configured to execute the steps in any of the method embodiments through the computer program.

Optionally, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of a computer network.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, starting the Message Passing Interface (MPI) process of the document theme generation model LDA in the big data processing frame Spark task;

s2, the MPI process creates an input named pipeline and an output named pipeline;

s3, the Spark task inputs input data to the MPI process through an input naming pipeline, wherein the input data are used for training the LDA;

s4, the MPI process completes the LDA training by using the input data to obtain a training result;

and S5, the MPI process sends the training result to the Spark task through the output naming pipeline.

Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration, and the electronic device may also be a Mobile Internet Device (MID), a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a PAD, and other terminal Devices. Fig. 8 is a diagram illustrating a structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.

The memory 802 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for training a document theme generation model in the embodiment of the present invention, and the processor 804 executes various functional applications and data processing by running the software programs and modules stored in the memory 802, that is, implements the above-described method for training a document theme generation model. The memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 802 can further include memory located remotely from the processor 804, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 802 may be, but not limited to, specifically configured to store information such as input data. As an example, as shown in fig. 8, the memory 802 may include, but is not limited to, a generation module 702, a creation module 704, an input module 706, a training module 708, and a sending module 710 of the training apparatus including the document theme generation model. In addition, the document theme generation model may further include, but is not limited to, other module units in the training apparatus of the document theme generation model, which are not described in this example again.

Optionally, the transmitting device 806 is configured to receive or transmit data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 806 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 806 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In addition, the electronic device further includes: a display 808 for displaying the input data and the like; and a connection bus 810 for connecting the respective module parts in the above-described electronic apparatus.

According to a further aspect of embodiments of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A training method of a document theme generation model is characterized by comprising the following steps:

starting a Message Passing Interface (MPI) process of a document theme generation model (LDA) in a big data processing framework Spark task;

the MPI process creates an input named pipe and an output named pipe;

the Spark task inputs input data to the MPI process through the input naming pipeline, wherein the input data is used for training the LDA;

the MPI process completes the training of the LDA by using the input data to obtain a training result;

and the MPI process sends the training result to the Spark task through the output naming pipeline.

2. The method according to claim 1, wherein said initiating the message passing interface MPI process of the document topic generation model LDA in the big data processing framework Spark task comprises:

and starting n MPI processes in the n Spark tasks, wherein one MPI process is started in each Spark task, and the n MPI processes form a distributed MPI program.

3. The method according to claim 2, wherein the n MPI processes comprise m Word processes and n-m Document processes, wherein the Spark task inputs input data to the MPI processes through the input naming pipe, and comprises:

inputting the input data to the n-m Document processes by the n-m input command pipelines created by the n-m Document processes by the n-m Spark tasks corresponding to the n-m Document processes, wherein each Spark task in the n-m Spark tasks inputs the input data to one Document process in the n-m Document processes by one input command pipeline in the n-m input command pipelines, and the m Spark tasks corresponding to the m Word processes are set not to input the input data to the m Word processes.

4. The method of claim 3, wherein the MPI process uses the input data to complete the training of the LDA, resulting in a training result comprising:

the n-m Document processes read the input data;

the n-m Document processes use the read input data to train the LDA to obtain a final word-topic matrix and a final doc-topic matrix, wherein the training result comprises the final word-topic matrix and the final doc-topic matrix, each line in the final word-topic matrix represents the weight of one word in the corpus in each theme, and each line in the final doc-topic matrix represents the weight of one Document in each theme.

5. The method of claim 4, wherein the MPI process sends the training results to the Spark task through the output naming pipe, comprising:

the m Word processes write the final Word-topoic matrix into m output named pipes created by the m Word processes, and the n-m Document processes write the final doc-topoic matrix into n-m output named pipes created by the n-m Document processes;

reading the m output named pipes by the m Spark tasks corresponding to the m Word processes to obtain a final Word-topoic matrix, and reading the n-m output named pipes by the n-m Spark tasks corresponding to the n-m Document processes to obtain a final doc-topoic matrix.

6. The method of claim 5, wherein after m of the Spark tasks corresponding to the m Word processes read the m output named pipes to obtain the final Word-pitch matrix, and n-m of the Spark tasks corresponding to the n-m Document processes read the n-m output named pipes to obtain the final doc-pitch matrix, the method further comprises:

and the m Spark tasks write the final word-topoic matrix into an external storage system, and the n-m Spark tasks write the final doc-topoc matrix into the external storage system.

7. The method of claim 4, wherein the training of the LDA by the n-m Document processes using the read input data to obtain a final word-topic matrix and a final doc-topic matrix comprises:

each Document process of the n-m Document processes performs the following operations, wherein the Document process is considered as a current Document process when the following operations are performed:

the current Document process reads Document data from a current input named pipeline created by the current Document process and caches the Document data to a memory, wherein the Word process starts a model service and monitors model updating and pulling messages sent by the current Document process;

the current Document process randomly selects a corresponding theme for each Word in the Document data, updates the local doc-topoic matrix, and informs the Word process to update the Word-topoic matrix through a model update message, wherein the Word process is used for receiving the model update message sent by the current Document process and updating the Word-topoic matrix in response to the model update message;

the current Document process pulls a Word-topoc parameter from the Word process, updates the local doc-topoc matrix by using a Gibbs sampling algorithm, and informs the Word process to update the Word-topoc matrix through an MPI message, wherein the Word process is used for waiting for the current Document process to pull the Word-topoc parameter and update the Word-topoc matrix.

8. A training device for a document theme generation model is characterized by comprising:

the generating module is used for starting a Message Passing Interface (MPI) process of a document theme generation model (LDA) in a big data processing framework Spark task;

the creation module is used for creating an input named pipeline and an output named pipeline through the MPI process;

the input module is used for controlling the Spark task to input data to the MPI process through the input naming pipeline, wherein the input data is used for training the LDA;

the training module is used for finishing the training of the LDA by using the input data through the MPI process to obtain a training result;

and the sending module is used for controlling the MPI process to send the training result to the Spark task through the output named pipeline.

9. The apparatus of claim 8, wherein the generating module comprises:

and the starting unit is used for starting n MPI processes in the n Spark tasks, wherein one MPI process is started in each Spark task, and the n MPI processes form a distributed MPI program.

10. The apparatus of claim 9, wherein the n MPI processes comprise m Word processes and n-m Document processes, and wherein the input module comprises:

an input unit, configured to control n-m Spark tasks corresponding to the n-m Document processes to input the input data to the n-m Document processes through n-m input command pipelines created by the n-m Document processes, where each Spark task in the n-m Spark tasks inputs the input data to one of the n-m Document processes through one of the n-m input command pipelines, and m Spark tasks corresponding to the m Word processes are set to not input the input data to the m Word processes.

11. The apparatus of claim 10, wherein the training module comprises:

a first reading unit configured to read the input data input through the n-m Document processes;

the training unit is used for training the LDA by using the read input data through the n-m Document processes to obtain a final word-topic matrix and a final doc-topic matrix, wherein the training result comprises the final word-topic matrix and the final doc-topic matrix, each line in the final word-topic matrix represents the weight of one word in the corpus in each topic, and each line in the final doc-topic matrix represents the weight of one Document in each topic.

12. The apparatus of claim 11, wherein the sending module comprises:

a writing unit, configured to write the final Word-topoic matrix into the m output named pipes created by the m Word processes through the m Word processes, where the n-m Document processes write the final doc-topoic matrix into the n-m output named pipes created by the n-m Document processes;

a second reading unit, configured to read the m output named pipes through the m Spark tasks corresponding to the m Word processes to obtain a final Word-topoic matrix, and read the n-m output named pipes through the n-m Spark tasks corresponding to the n-m Document processes to obtain the final doc-topoic matrix.

13. The apparatus of claim 12, further comprising:

and the writing module is used for writing the final word-topoic matrix into an external storage system through the m Spark tasks, and the n-m Spark tasks write the final doc-topoic matrix into the external storage system.

14. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program when executed performs the method of any of the preceding claims 1 to 7.

15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.