CN112069807A - Text data theme extraction method and device, computer equipment and storage medium - Google Patents
Text data theme extraction method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN112069807A CN112069807A CN202011251689.7A CN202011251689A CN112069807A CN 112069807 A CN112069807 A CN 112069807A CN 202011251689 A CN202011251689 A CN 202011251689A CN 112069807 A CN112069807 A CN 112069807A
- Authority
- CN
- China
- Prior art keywords
- text data
- topic
- theme
- training
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/08—Insurance
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Finance (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Accounting & Taxation (AREA)
- Economics (AREA)
- General Business, Economics & Management (AREA)
- Technology Law (AREA)
- Strategic Management (AREA)
- Marketing (AREA)
- Development Economics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application belongs to the technical field of big data, and relates to a method, a device, computer equipment and a storage medium for extracting a theme of text data, wherein the method comprises the steps of performing word segmentation operation on the text data; inputting all the participles into an LDA model obtained by adopting semi-supervised training based on Gibbs sampling for theme extraction, wherein the extraction process comprises the following steps: randomly assigning a theme to each participle; sequentially taking each participle as a first current word, and calculating the probability of generating the first current word by each randomly distributed theme according to all the participles except the first current word; taking the theme with the highest probability of generating the first current word as a new theme of the first current word; and repeatedly executing the step of updating the topics of all the participles until the topic distribution is stable, and outputting the topics of the text data. The application also relates to a block chain technology, and the privacy information in the text data can be stored in the block chain. The theme output by the method can reflect the text content more objectively, and meanwhile, the theme occupies a small space when being stored.
Description
Technical Field
The present application relates to the field of big data technologies, and in particular, to a method and an apparatus for extracting a topic of text data, a computer device, and a storage medium.
Background
In the big data age, data mainly includes structured data and unstructured data. Traditional structured data, such as time, place, user information and other data in a relational database, are relatively convenient to process and have more mature processing methods, and unstructured data, such as a work log, are more complex to process than structured data.
The text data such as the log is often huge, if the log is read one by a manual processing method, a large amount of human resources are wasted, meanwhile, because human subjectivity is easy to cause errors in understanding, the value of the data cannot be fully exerted, and the data itself needs to be stored and managed by a large amount of computer resources, so how to process the text data to facilitate reading of the text data and save storage space becomes an urgent problem to be solved.
Disclosure of Invention
An object of the embodiments of the present application is to provide a method and an apparatus for extracting a topic of text data, a computer device, and a storage medium, so as to solve the problem of how to process text data to make the text data convenient for reading and save storage space.
In order to solve the above technical problem, an embodiment of the present application provides a method for extracting a topic of text data, which adopts the following technical embodiments:
a method for extracting a theme of text data includes the following steps:
acquiring text data, and performing word segmentation operation on the text data;
inputting all the participles into a preset LDA model for theme extraction, wherein the preset LDA model is obtained by training in a semi-supervised mode based on Gibbs sampling;
wherein, the theme extraction of the text data through the preset LDA model comprises:
randomly assigning a theme to each participle of the text data; sequentially taking each participle as a first current word, and calculating the probability of each randomly distributed theme of the text data generating the first current word in the text data according to other participles except the first current word in the text data; taking the topic with the highest probability of generating the first current word in the text data as a new topic of the first current word so as to update the topics of all the participles; and repeatedly executing the step of updating the topics of all the participles based on the new topic of each participle until the topic distribution of all the participles is stable, and outputting the topic corresponding to each participle as the topic of the text data when the topic distribution is stable.
In order to solve the above technical problem, an embodiment of the present application further provides a topic extraction apparatus for text data, which employs the following technical embodiments:
a subject extraction apparatus of text data, comprising:
the word segmentation module is used for acquiring text data and performing word segmentation operation on the text data;
the topic extraction module is used for inputting all the participles into a preset LDA model for topic extraction, and the preset LDA model is obtained by training in a semi-supervised mode based on Gibbs sampling;
the theme extraction module is specifically configured to, when performing theme extraction on the text data through the preset LDA model: randomly assigning a theme to each participle of the text data; sequentially taking each participle as a first current word, and calculating the probability of each randomly distributed theme of the text data generating the first current word in the text data according to other participles except the first current word in the text data; taking the topic with the highest probability of generating the first current word in the text data as a new topic of the first current word so as to update the topics of all the participles; and repeatedly executing the step of updating the topics of all the participles based on the new topic of each participle until the topic distribution of all the participles is stable, and outputting the topic corresponding to each participle as the topic of the text data when the topic distribution is stable.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical embodiments:
a computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of a method of topic extraction of text data as described above.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which employs the following technical embodiments:
a computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of a method of topic extraction of textual data as described above.
Compared with the prior art, the method, the device, the computer equipment and the storage medium for extracting the theme of the text data provided by the embodiment of the application have the following beneficial effects:
the LDA model is obtained by adopting semi-supervised training based on Gibbs sampling, the method is different from the traditional unsupervised LDA model training method, a semi-supervised training method is adopted, a small number of pre-marked samples for determining the number and meaning of the topics are beneficial to training of the model, the topics of text data are automatically mined through the trained LDA model, the output quantitative result of topic mining can objectively and conveniently reflect the text content, the efficiency of reading the text data is improved, meanwhile, the occupied space of the mined topics is reduced during storage, and the method is beneficial to saving computer resources.
Drawings
In order to illustrate the embodiments of the present application more clearly, a brief description will be given below of the drawings that are required for describing the embodiments of the present application, the drawings in the following description corresponding to some embodiments of the present application, and other drawings may be obtained by those skilled in the art without inventive effort.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method for topic extraction of textual data according to the present application;
fig. 3 is a schematic diagram of word distribution of different topics of a log text obtained by the topic extraction method for text data of the present application;
fig. 4 is a schematic diagram of topic distribution of a section of log text obtained by a topic extraction method for text data according to the present application;
FIG. 5 is a schematic diagram of a time-based topic distribution obtained by a topic extraction method for text data according to the present application;
fig. 6 is a schematic structural diagram of an embodiment of a text data subject extraction apparatus according to the present application;
fig. 7 is a schematic structural diagram of another embodiment of a text data subject extraction apparatus according to the present application;
FIG. 8 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and in the claims of the present application or in the drawings described above, are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the embodiments of the present application better understood by those skilled in the art, the technical embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that the method for extracting the theme of the text data provided in the embodiment of the present application is generally executed by a server, and accordingly, the device for extracting the theme of the text data is generally disposed in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continuing reference to FIG. 2, a flow diagram of one embodiment of a method for topic extraction of textual data according to the present application is shown. The method for extracting the theme of the text data comprises the following steps:
s201, acquiring text data, and performing word segmentation operation on the text data;
s202, inputting all the participles into a preset LDA model for theme extraction, wherein the preset LDA model is obtained by training in a semi-supervised mode based on Gibbs sampling;
wherein, the theme extraction of the text data through the preset LDA model comprises:
randomly assigning a theme to each participle of the text data; sequentially taking each participle as a first current word, and calculating the probability of each randomly distributed theme of the text data generating the first current word in the text data according to other participles except the first current word in the text data; taking the topic with the highest probability of generating the first current word in the text data as a new topic of the first current word so as to update the topics of all the participles; and repeatedly executing the step of updating the topics of all the participles based on the new topic of each participle until the topic distribution of all the participles is stable, and outputting the topic corresponding to each participle as the topic of the text data when the topic distribution is stable.
The above steps are explained in the following.
For step S201, in this embodiment, the text data may be read from a database or obtained online in real time, such as a work log submitted by a company employee in real time. When the word segmentation operation is performed on the text data, an existing word segmentation model or tool, such as a word segmentation at a bus, can be used, after the word segmentation is completed, the embodiment further includes deleting the word segmentation irrelevant to the extracted text topic, such as similar word segmentation of "today", "then", and the like, and simultaneously retaining the word relevant to a specific service, such as the name of a certain product is "million anyhow", and the word cannot be split into two words, i.e., "million" and "anyhow".
In this embodiment, text theme extraction is mainly performed on unstructured data, and when the unstructured data is acquired, the method further includes determining whether the acquired unstructured data is text data, and if the unstructured data is non-text data, converting the unstructured data into text data, for example, converting audio and video data into text data in a voice recognition manner, and then performing word segmentation.
For step S202, the purpose of sequentially using each participle as the first current word in this step is that the topic of the participle randomly specified may be inaccurate, and the topic of the participle as the first current word needs to be re-confirmed based on the topics of other participles randomly specified, so that the participle falls under the correct topic as much as possible.
In this embodiment, the LDA model is trained based on Gibbs sampling in a semi-supervised manner, specifically, a sample of a small portion of pre-labeled topics is extracted for training. Specifically, the process of training and obtaining the preset LDA model in a semi-supervised mode based on Gibbs sampling comprises the following steps:
acquiring a training text set, segmenting the training texts in the training text set, and randomly allocating a theme to each segmentation of each training text, wherein the randomly allocated theme comprises a pre-marked theme; sequentially taking each participle of each training text as a second current word, calculating the theme distribution of each training text according to other participles except the second current word in each training text, calculating the probability of each randomly-distributed theme of each training text for generating the second current word in each training text, and taking the theme with the highest probability of generating the second current word in each training text as a new theme of the second current word so as to update the themes of all the participles in each training text; and repeating the step of updating the topics of all the participles in each training text based on the new topic of each participle in each training text until the topic distribution of all the participles in each training text is stable, outputting the topics corresponding to all the participles of each training text as the topics of each training text when the topic distribution is stable, and outputting the word distribution corresponding to each topic.
In this embodiment, the calculating the probability that the randomly distributed topic of each training text generates the second current word in each training text specifically adopts the following calculation formula:
in the formula (1)Representing the removal of the second current wordSubject attribution of other later participles, and finding out the second current word based on the subject attributionConditional probability of belonging to topic k. Since the classical implicit Dirichlet distribution (LDA) model is unsupervised, equation (1) above differs from the classical LDA model in that the subject matter,The training texts in this embodiment include pre-labeled training texts and unlabeled training texts, for labeled samples, only the topics existing in the labels are taken, and for unlabeled samples, the topics existing in the unlabeled samples are completely random in the initial state. As training progresses, the distributions of topics for unlabeled and labeled samples will eventually stabilize, and the distribution of topics for unlabeled samples will only contain the topics present in the labeled samples.
In the above formula (1), the latter term in the formula is used to calculate the topic distribution of the training text, i.e. it is calculated by using the following formula:
wherein,representing the number of words attributed to topic j by the training text,the Dirichlet distribution parameter corresponding to topic j.
In the above formula (1), the former term in the formula is used to calculate the subject k to generate the second current wordThe probability of (c) is calculated using the following equation:
wherein,for word segmentation attributed to topic kThe number of the (c) is,for word segmentationCorresponding Dirichlet distribution parameters.
In the present embodiment, theAndfor hyper-parameters, the Gibbs-based sampling is trained in a semi-supervised mode to obtainIn the process of the preset LDA model, the method further comprises the following steps: automatically generating or acquiring multiple preset groups of hyper-parameters, performing model training based on each group of hyper-parameters, comparing multiple corresponding obtained model training results, and screening a group of hyper-parameters which enable the highest extraction accuracy of the preset LDA model from the multiple groups of hyper-parameters according to the comparison result.
Taking a working log based on insurance industry as a training text as an example to illustrate the output result of the LDA model training, referring to the following table 1, the output result of the LDA model training contains all topics in a labeled sample, meanwhile, the model training determines the word distribution of each topic, and only part of the topics and part of participles belonging to the topics are shown in the table 1.
The semi-supervised LDA model adopted in the present embodiment has the following advantages over the unsupervised LDA model: the number of the topics can be determined according to the marked samples, the meaning of the topics is determined, and the randomness of the number of the unsupervised LDA topics and the abstract of word distribution of the list topics after training are avoided; meanwhile, the addition of a small amount of marked samples generates constraint on the training of the model, the uncertainty in the training process is reduced, the accurate final model is easier to obtain, and in addition, the existence of the marked samples provides an intuitive and effective mode for model evaluation, thereby being beneficial to quality monitoring and iterative adjustment after the model is on line.
Further, after obtaining the preset LDA model, the method further includes performing iterative training on the preset LDA model, where the iterative training includes: receiving text data uploaded by a user side on line in real time, and displaying a theme contained in the preset LDA model to the user side after receiving the text data uploaded by the user side so as to be selected by a user of the user side; and performing iterative training on the preset LDA model according to the theme selected by the user at the user side and the text data uploaded by the user side. Specifically, the iterative training process is the same as the process of training to obtain the preset LDA model in a semi-supervised manner based on Gibbs sampling, so that a new LDA model can be obtained. By adopting the scheme of the embodiment, the labeled sample can be expanded in an online interaction mode, taking the insurance agent submitting the working log as an example, after the insurance agent writes the finished working log, a plurality of selection interfaces are popped up, the existing topics of the LDA model after training are listed, and the insurance agent selects the topics contained in the working log written by the insurance agent to submit, so that labeled text data is formed, and the labeled text data can be used for monitoring the accuracy of the LDA model on one hand and can be used for iterative updating of the LDA model on the other hand. Therefore, according to the scheme of the embodiment, only a small number of samples are required to be labeled in the initial stage, and then the labeled samples are obtained in an online interaction mode with the user, so that the sample labeling workload is reduced, and the accuracy of the model can be improved.
In this embodiment, the process of extracting the theme from the text data by using the preset LDA model is also performed by using Gibbs sampling, and compared with the process of obtaining the preset LDA model by training in a semi-supervised manner based on Gibbs sampling, the difference is that the word distribution of each theme is determined when the theme is extracted by using the trained LDA model. When the theme extraction is carried out in a Gibbs sampling mode, the probability of each theme in the text data is calculated according to all the participles except the first current word in the text data based on the formula (3), and the conditional probability of the first current word generated by different themes in the current text is obtained by combining the fixed distribution probability of the theme words. And replacing the theme of the first current word based on the obtained probability to obtain a new theme of the first current word, thereby updating the theme corresponding to each participle. After the topic distribution of the text data is stable, the corresponding topic of each word segmentation at the moment is the topic of the text data.
Further, after the theme extracting the text data through the preset LDA model, the method further includes: and determining the distribution proportion of the participles corresponding to each topic in the text data and the distribution positions in the text data according to the topic extraction result of the preset LDA model, and generating corresponding display graphs by the distribution positions and the distribution proportion so as to send the display graphs to a target position for displaying. Specifically, in this embodiment, while the theme of the text data is output, word distribution and theme distribution of each theme may be displayed on the target terminal, for example, taking a work log of the insurance agent as an example, fig. 3 shows word distribution of different themes of a log text, words belonging to different themes are marked in different colors, which is a result of automatically extracting themes according to the text data by the LDA model, fig. 4 shows theme distribution after extracting themes according to a log text, and the log content of the insurance agent can be intuitively and accurately known through a chart without directly reading the log text through the word distribution chart and the theme distribution chart of the themes.
Further, in this embodiment, the method further includes: determining a data source of each text data, respectively extracting topics of the text data from different sources, performing statistical comparison analysis on quantized data of the extracted topics in multiple dimensions, and performing corresponding data operation on the data source of each text data based on a comparison analysis result. The analysis latitude may be time, area, etc., the data operation may be message pushing, etc., taking an insurance agent scene as an example, after the subject of the work log of the insurance agent is quantitatively mined through the LDA model provided in this embodiment, the subject of the work log of the insurance agent within a period of time may be statistically analyzed to know the overall work condition of each insurance agent within the period of time, and the evolution condition of the log subject of the insurance agent along with the length of the work time may also be statistically analyzed, for example, all the subjects may be classified into sales categories (including sales promotion credit card, promotion insurance, etc.), membership categories, service categories (guarantee inspection, disease detection, etc.), and others, as shown in fig. 5 below, the change of the log subject distribution of the insurance agent along with the month is shown in the form of a report, which facilitates monitoring and management of the work of the insurance agent.
The text data topic extraction method is different from a traditional unsupervised LDA model training method in that a semi-supervised mode is adopted for training to obtain an LDA model based on Gibbs sampling, a small number of pre-marked samples for determining topic number and topic meaning are adopted for facilitating model training, topics of text data are automatically mined through the trained LDA model, output topic mining quantization results can reflect text content more objectively and conveniently, text data reading efficiency is improved, and value of text data can be fully exerted through multi-dimensional statistical analysis on the topic mining quantization results. Meanwhile, the occupied space of the mined theme is reduced during storage, and the method is favorable for saving computer resources. In addition, after the model is online, a new labeled sample can be obtained through online interaction with a user and used for monitoring and iterative updating of the model.
It is emphasized that, in order to further ensure the privacy and security of the information, the privacy information related to the text data in the above embodiments may also be stored in a node of a blockchain. The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the processes of the embodiments of the methods described above can be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
Referring to fig. 6, as an implementation of the method for extracting the theme of the text data shown in fig. 2, the present application provides an embodiment of a device for extracting the theme of the text data, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be applied to various electronic devices.
As shown in fig. 6, the topic extraction device for text data according to this embodiment includes: a word segmentation module 601 and a topic extraction module 602. The word segmentation module 601 is configured to obtain text data and perform word segmentation on the text data; the text data can be read from a database or obtained online in real time, for example, a work log submitted by a company employee in real time, when performing word segmentation operation on the text data, an existing word segmentation model or tool, such as a word segmentation at a bus, can be used, after completing the word segmentation, the embodiment further includes deleting words irrelevant to the subject of the extracted text, such as words like "today", "then", and the like, while retaining words relevant to a specific service, for example, a product is named "million and my", and the word cannot be split into two words, i.e., "million" and "my". In this embodiment, text theme extraction is mainly performed on unstructured data, and when the unstructured data is acquired, the method further includes determining whether the acquired unstructured data is text data, and if the unstructured data is non-text data, converting the unstructured data into text data, for example, converting audio and video data into text data in a voice recognition manner, and then performing word segmentation.
The topic extraction module 602 is configured to input all the participles into a preset LDA model for topic extraction, where the preset LDA model is obtained by training in a semi-supervised manner based on Gibbs sampling.
With reference to fig. 7, the topic extraction module includes a topic assignment unit 6021, a calculation unit 6022, an update unit 6023, and a topic output unit 6024, wherein the topic assignment unit 6021 is configured to randomly assign a topic to each participle of the text data; the calculating unit 6022 is configured to take each participle as a first current word in sequence, and calculate, according to other participles except the first current word in the text data, a probability that each randomly allocated topic of the text data generates the first current word in the text data; the updating unit 6023 is configured to take a topic with the highest probability of generating the first current word in the text data as a new topic of the first current word to update topics of all participles, and repeatedly perform updating of topics of all participles based on the new topic of each participle until topic distribution of all participles is stable, and the topic output unit 6024 is configured to output a topic corresponding to each participle as a topic of the text data when topic distribution is stable.
In this embodiment, the purpose of sequentially using each participle as the first current word is that because the randomly specified topic of the participle may be inaccurate, the topic of the participle as the first current word needs to be confirmed again based on the randomly specified topics of other participles, so that the participle belongs to the correct topic as much as possible.
In this embodiment, the LDA model is trained based on Gibbs sampling in a semi-supervised manner, specifically, a sample of a small portion of pre-labeled topics is extracted for training. Specifically, as shown in fig. 7, the topic extraction apparatus for text data further includes a model training module 603, where the model training module 603 is configured to obtain the preset LDA model by training in a semi-supervised manner based on Gibbs sampling, and specifically is configured to: acquiring a training text set, segmenting the training texts in the training text set, and randomly allocating a theme to each segmentation of each training text, wherein the randomly allocated theme comprises a pre-marked theme; sequentially taking each participle of each training text as a second current word, calculating the theme distribution of each training text according to other participles except the second current word in each training text, calculating the probability of each randomly-distributed theme of each training text for generating the second current word in each training text, and taking the theme with the highest probability of generating the second current word in each training text as a new theme of the second current word so as to update the themes of all the participles in each training text; repeatedly updating the topics of all the participles in each training text based on the new topic of each participle in each training text until the topic distribution of all the participles in each training text is stable, outputting the topics corresponding to all the participles of each training text as the topics of each training text when the topic distribution is stable, and outputting the word distribution corresponding to each topic. In this embodiment, the specific process of calculating the probability that the randomly assigned topic of each training text generates the second current word in each training text may refer to the method embodiment described above, and is not expanded herein.
Further, after obtaining the preset LDA model, the model training module 603 is further configured to perform iterative training on the preset LDA model, where the iterative training includes: receiving text data uploaded by a user side on line in real time, and displaying a theme contained in the preset LDA model to the user side after receiving the text data uploaded by the user side so as to be selected by a user of the user side; and performing iterative training on the preset LDA model according to the theme selected by the user at the user side and the text data uploaded by the user side. Reference is made in particular to the above-described method embodiments, which are not to be construed as open ended herein.
In this embodiment, the process of extracting the theme from the text data by the theme extraction module 602 through the preset LDA model is also performed in a Gibbs sampling manner, and compared with the process of obtaining the preset LDA model through training by the model training module 603 based on the Gibbs sampling in a semi-supervised manner, the difference is that when the theme extraction module 602 performs theme extraction through the trained LDA model, word distribution of each theme is determined, and when the theme extraction module 602 performs theme extraction through the Gibbs sampling manner, only the probability of each theme in the text data needs to be calculated according to all participles except the first current word in the text data, and then the conditional probability of generating the first current word in the current text by different themes is obtained by combining with the fixed distribution probability of the theme words. And replacing the theme of the first current word based on the obtained probability to obtain a new theme of the first current word, thereby updating the theme corresponding to each participle. After the topic distribution of the text data is stable, the corresponding topic of each word segmentation at the moment is the topic of the text data.
Further, the theme extraction module 602 is further configured to determine, according to a theme extraction result of the preset LDA model, a distribution ratio of the participles corresponding to each theme in the text data and a distribution position in the text data after the theme extraction is performed on the text data through the preset LDA model, and generate a corresponding display diagram according to the distribution position and the distribution ratio to send the display diagram to a target position for displaying. Reference is made in particular to the above-described method embodiments, which are not to be construed as open ended herein.
Further, the topic extraction module 602 is further configured to determine a data source of each text data, perform topic extraction on the text data from different sources, perform statistical comparison analysis on quantized data of extracted topics in multiple dimensions, and perform corresponding data operation on the data source of each text data based on a comparison analysis result. The analysis latitude may be time, area, etc., and the data operation may be message pushing, etc., which may refer to the above method embodiments and is not expanded herein.
The text data topic extraction device provided by the application is characterized in that a semi-supervised mode is adopted for training based on Gibbs sampling to obtain an LDA model, the device is different from a traditional unsupervised LDA model training method, a semi-supervised training method is adopted, a small number of pre-marked samples for determining the number and meaning of topics are beneficial to model training, topics of text data are automatically mined through the trained LDA model, output topic mining quantification results can reflect text contents more objectively and conveniently, text data reading efficiency is improved, and the value of the text data can be fully exerted through multi-dimensional statistical analysis on the topic mining quantification results. Meanwhile, the occupied space of the mined theme is reduced during storage, and the method is favorable for saving computer resources. In addition, after the model is online, a new labeled sample can be obtained through online interaction with a user and used for monitoring and iterative updating of the model.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 8, fig. 8 is a block diagram of a basic structure of a computer device according to the present embodiment. The computer device 8 comprises a memory 81, a processor 82 and a network interface 83 which are mutually connected in a communication manner through a system bus, wherein the memory 81 stores computer readable instructions, and the processor 82 implements the steps of the text data theme extracting method in the above method embodiment when executing the computer readable instructions, and has the beneficial effects corresponding to the text data theme extracting method, which are not expanded herein.
It is noted that only the computer device 8 having the memory 81, the processor 82, and the network interface 83 is shown, but it is understood that not all of the illustrated components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
In the present embodiment, the memory 81 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 81 may be an internal storage unit of the computer device 8, such as a hard disk or a memory of the computer device 8. In other embodiments, the memory 81 may also be an external storage device of the computer device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 8. Of course, the memory 81 may also comprise both an internal storage unit of the computer device 8 and an external storage device thereof. In this embodiment, the memory 81 is generally used for storing an operating system installed in the computer device 8 and various types of application software, such as computer readable instructions corresponding to the above-mentioned theme extraction method of text data. Further, the memory 81 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 82 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 82 is typically used to control the overall operation of the computer device 8. In this embodiment, the processor 82 is configured to execute computer readable instructions stored in the memory 81 or process data, for example, execute computer readable instructions corresponding to a theme extraction method of the text data.
The network interface 83 may comprise a wireless network interface or a wired network interface, and the network interface 83 is generally used for establishing communication connections between the computer device 8 and other electronic devices.
The present application further provides another embodiment, which is to provide a computer-readable storage medium, wherein the computer-readable storage medium stores computer-readable instructions, which are executable by at least one processor, so as to cause the at least one processor to perform the steps of the above-mentioned method for extracting a subject matter of text data, and have the corresponding advantages, which are not expanded herein.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical embodiments of the present application may be essentially or partially implemented in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.
Claims (10)
1. A method for extracting a theme of text data, comprising the steps of:
acquiring text data, and performing word segmentation operation on the text data;
inputting all the participles into a preset LDA model for theme extraction, wherein the preset LDA model is obtained by training in a semi-supervised mode based on Gibbs sampling;
wherein, the theme extraction of the text data through the preset LDA model comprises:
randomly assigning a theme to each participle of the text data; sequentially taking each participle as a first current word, and calculating the probability of each randomly distributed theme of the text data generating the first current word in the text data according to other participles except the first current word in the text data; taking the topic with the highest probability of generating the first current word in the text data as a new topic of the first current word so as to update the topics of all the participles; and repeatedly executing the step of updating the topics of all the participles based on the new topic of each participle until the topic distribution of all the participles is stable, and outputting the topic corresponding to each participle as the topic of the text data when the topic distribution is stable.
2. The method for extracting subject matter of text data according to claim 1, wherein the process of training and obtaining the preset LDA model in a semi-supervised manner based on Gibbs sampling comprises:
acquiring a training text set, segmenting the training texts in the training text set, and randomly allocating a theme to each segmentation of each training text, wherein the randomly allocated theme comprises a pre-marked theme;
sequentially taking each participle of each training text as a second current word, calculating the theme distribution of each training text according to other participles except the second current word in each training text, calculating the probability of each randomly-distributed theme of each training text for generating the second current word in each training text, and taking the theme with the highest probability of generating the second current word in each training text as a new theme of the second current word so as to update the themes of all the participles in each training text;
and repeating the step of updating the topics of all the participles in each training text based on the new topic of each participle in each training text until the topic distribution of all the participles in each training text is stable, outputting the topics corresponding to all the participles of each training text as the topics of each training text when the topic distribution is stable, and outputting the word distribution corresponding to each topic.
3. The method of claim 2, wherein the calculating of the probability that the randomly allocated topic of each of the training texts generates the second current word in each of the training texts specifically adopts the following calculation formula:
wherein,representing the removal of the second current wordThe topic affiliation of the latter other participles,the subject matter of the indicia is represented,representing the number of words attributed to topic j by the training text,for the Dirichlet distribution parameter corresponding to topic j,for word segmentation attributed to topic kThe number of the (c) is,for word segmentationCorresponding Dirichlet distribution parameters.
4. The method of claim 3, wherein the subject extraction is performed on the text dataAndin the process of training and obtaining the preset LDA model in a semi-supervised mode based on Gibbs sampling for hyper-parameters, the method further comprises the following steps:
automatically generating or acquiring multiple preset groups of hyper-parameters, performing model training based on each group of hyper-parameters, comparing multiple corresponding obtained model training results, and screening a group of hyper-parameters which enable the highest extraction accuracy of the preset LDA model from the multiple groups of hyper-parameters according to the comparison result.
5. The method of any of claims 1 to 4, wherein after obtaining the preset LDA model, the method further comprises iteratively training the preset LDA model, wherein the iteratively training comprises:
receiving text data uploaded by a user side on line in real time, and displaying a theme contained in the preset LDA model to the user side after receiving the text data uploaded by the user side so as to be selected by a user of the user side; and performing iterative training on the preset LDA model according to the theme selected by the user at the user side and the text data uploaded by the user side.
6. The method of claim 1, wherein after the theme extracting the text data by the preset LDA model, the method further comprises:
and determining the distribution proportion of the participles corresponding to each topic in the text data and the distribution positions in the text data according to the topic extraction result of the preset LDA model, and generating corresponding display graphs by the distribution positions and the distribution proportion so as to send the display graphs to a target position for displaying.
7. The method of claim 1, wherein the method further comprises:
determining a data source of each text data, respectively extracting topics of the text data from different sources, performing statistical comparison analysis on quantized data of the extracted topics in multiple dimensions, and performing corresponding data operation on the data source of each text data based on a comparison analysis result.
8. A subject extraction apparatus for text data, comprising:
the word segmentation module is used for acquiring text data and performing word segmentation operation on the text data;
the topic extraction module is used for inputting all the participles into a preset LDA model for topic extraction, and the preset LDA model is obtained by training in a semi-supervised mode based on Gibbs sampling;
the theme extraction module is specifically configured to, when performing theme extraction on the text data through the preset LDA model: randomly assigning a theme to each participle of the text data; sequentially taking each participle as a first current word, and calculating the probability of each randomly distributed theme of the text data generating the first current word in the text data according to other participles except the first current word in the text data; taking the topic with the highest probability of generating the first current word in the text data as a new topic of the first current word so as to update the topics of all the participles; and repeatedly executing the step of updating the topics of all the participles based on the new topic of each participle until the topic distribution of all the participles is stable, and outputting the topic corresponding to each participle as the topic of the text data when the topic distribution is stable.
9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the method of topic extraction of text data as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon computer-readable instructions which, when executed by a processor, implement the steps of the method for topic extraction of text data as claimed in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011251689.7A CN112069807A (en) | 2020-11-11 | 2020-11-11 | Text data theme extraction method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011251689.7A CN112069807A (en) | 2020-11-11 | 2020-11-11 | Text data theme extraction method and device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112069807A true CN112069807A (en) | 2020-12-11 |
Family
ID=73655049
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011251689.7A Pending CN112069807A (en) | 2020-11-11 | 2020-11-11 | Text data theme extraction method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112069807A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112836507A (en) * | 2021-01-13 | 2021-05-25 | 哈尔滨工程大学 | A method for topic extraction of domain text |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103903164A (en) * | 2014-03-25 | 2014-07-02 | 华南理工大学 | Semi-supervised automatic aspect extraction method and system based on domain information |
CN108280180A (en) * | 2018-01-23 | 2018-07-13 | 北京航空航天大学 | Semi-supervised Hash algorithm based on topic model |
CN111611432A (en) * | 2020-05-29 | 2020-09-01 | 北京酷我科技有限公司 | Singer classification method based on Labeled LDA model |
-
2020
- 2020-11-11 CN CN202011251689.7A patent/CN112069807A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103903164A (en) * | 2014-03-25 | 2014-07-02 | 华南理工大学 | Semi-supervised automatic aspect extraction method and system based on domain information |
CN108280180A (en) * | 2018-01-23 | 2018-07-13 | 北京航空航天大学 | Semi-supervised Hash algorithm based on topic model |
CN111611432A (en) * | 2020-05-29 | 2020-09-01 | 北京酷我科技有限公司 | Singer classification method based on Labeled LDA model |
Non-Patent Citations (3)
Title |
---|
王静茹等: "基于隐含狄利克雷分布的文本主题提取对比研究", 《情报科学》 * |
郑世卓等: "基于半监督LDA的文本分类应用研究", 《软件》 * |
韩栋等: "结合半监督学习和LDA模型的文本分类方法", 《计算机工程与设计》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112836507A (en) * | 2021-01-13 | 2021-05-25 | 哈尔滨工程大学 | A method for topic extraction of domain text |
CN112836507B (en) * | 2021-01-13 | 2022-12-09 | 哈尔滨工程大学 | A Method of Domain Text Topic Extraction |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110825956A (en) | Information flow recommendation method and device, computer equipment and storage medium | |
CN112863683A (en) | Medical record quality control method and device based on artificial intelligence, computer equipment and storage medium | |
CN112613917A (en) | Information pushing method, device and equipment based on user portrait and storage medium | |
CN112686301A (en) | Data annotation method based on cross validation and related equipment | |
CN116703515A (en) | Recommendation method and device based on artificial intelligence, computer equipment and storage medium | |
CN114398466A (en) | Complaint analysis method and device based on semantic recognition, computer equipment and medium | |
CN119201597A (en) | Log parsing method, device, computer equipment and medium based on artificial intelligence | |
CN112069807A (en) | Text data theme extraction method and device, computer equipment and storage medium | |
CN118037455A (en) | Financial data prediction method, device, equipment and storage medium thereof | |
CN112084408A (en) | List data screening method and device, computer equipment and storage medium | |
CN117493563A (en) | Session intention analysis method, device, equipment and storage medium thereof | |
CN114549053B (en) | Data analysis method, device, computer equipment and storage medium | |
CN115544282A (en) | Data processing method, device and equipment based on graph database and storage medium | |
CN116166858A (en) | Information recommendation method, device, equipment and storage medium based on artificial intelligence | |
CN115730603A (en) | Information extraction method, device, equipment and storage medium based on artificial intelligence | |
CN115826973A (en) | List page generation method and device, computer equipment and storage medium | |
CN115545753A (en) | Partner prediction method based on Bayesian algorithm and related equipment | |
CN113902032A (en) | Business data processing method, device, computer equipment and storage medium | |
CN117078406A (en) | Customer loss early warning method and device, computer equipment and storage medium | |
CN118885568A (en) | User behavior prediction method, device, computer equipment and storage medium | |
CN117271790A (en) | Method and device for expanding annotation data, computer equipment and storage medium | |
CN119250040A (en) | Artificial intelligence-based prompt text generation method, device, equipment and medium | |
CN117407750A (en) | Metadata-based data quality monitoring method, device, equipment and storage medium | |
CN117076775A (en) | Information data processing method, information data processing device, computer equipment and storage medium | |
CN115757889A (en) | Data item processing method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201211 |
|
RJ01 | Rejection of invention patent application after publication |