CN104850617A

CN104850617A - Short text processing method and apparatus

Info

Publication number: CN104850617A
Application number: CN201510250477.XA
Authority: CN
Inventors: 阮星华; 张文
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-05-15
Filing date: 2015-05-15
Publication date: 2015-08-19
Anticipated expiration: 2035-05-15
Also published as: CN104850617B

Abstract

The present application discloses a short text processing method and apparatus. The method comprises: acquiring a first short text set, and preprocessing the first short text set; based on the preprocessed first short text set, executing the processing step of training a topic model LDA by using the preprocessed first short text set to obtain topic probability distribution of short texts in the first short text set; and clustering the topic probability distribution to determine a topic category of the short texts in the first short text set. According to the method and apparatus provided by the present application, the topic category of the short texts is obtained by training the topic model and further clustering the topic probability distribution, so that the purpose of accurately categorizing the short texts is achieved.

Description

Short text disposal route and device

Technical field

The application relates to field of computer technology, is specifically related to text-processing technical field, particularly relates to short text disposal route and device.

Background technology

Along with the develop rapidly of Internet technology, people more and more deliver oneself viewpoint or suggestion by the various network platform.Such as, user can deliver film review or dramatic criticism by the website introducing movie and television play to movie or television play product, commodity evaluation can also be delivered to purchase or used commodity by shopping at network platform, also can be presented one's view suggestion to providing the operator of service or application by suggestion feedback channel, any viewpoint etc. of oneself can also be delivered by social platform such as microbloggings.The description mostly just carrying out segment due to these comments, evaluation or suggestion illustrates, the word content included by it is less, therefore all can be counted as short text data.

In the face of the Massive short documents notebook data that internet fast development produces, how to carry out dividing and therefrom extracting information with practical value to short text exactly, become the problem of internet industry common concern and research.In the prior art, TF-IDF (TermFrequency-Inverse Document Frequency, word frequency-reverse document frequency) method can be passed through analyze short text data.But, calculate because this method relies on word frequency of occurrence in a document completely, and the content of short text is in general all relatively more brief, vector matrix is sparse, therefore traditional TF-IDF method effect is also bad, and its accuracy distinguished short text is lower.

Summary of the invention

In view of above-mentioned defect of the prior art or deficiency, expect to provide a kind of short text to classify scheme accurately.In order to realize above-mentioned one or more object, this application provides short text disposal route and device.

First aspect, this application provides a kind of short text disposal route, comprising: obtain the first short text set, and carries out pre-service to described first short text set; Based on pretreated first short text set, perform following treatment step: use described pretreated first short text set training topic model LDA, obtain the theme probability distribution of each short text in described first short text set; Cluster is carried out to described theme probability distribution, determines the subject categories of each short text in described first short text set.

Second aspect, this application provides a kind of short text treating apparatus, comprising: the first acquisition module, for obtaining the first short text set, and carries out pre-service to described first short text set; Processing module, for based on pretreated first short text set, drive and perform following treatment step with lower unit: training unit, for using described pretreated first short text set training topic model LDA, obtain the theme probability distribution of each short text in described first short text set; Cluster cell, for carrying out cluster to described theme probability distribution, determines the subject categories of each short text in described first short text set.

The short text disposal route that the application provides and device, first pre-service can be carried out to the first short text set obtained, then the data after utilizing process carry out topic model LDA training, to obtain the theme probability distribution of each short text in set, finally cluster is carried out to theme probability distribution, just can determine the subject categories of each short text.Obtain the distribution situation of theme probability by first carrying out topic model training, more further to theme probability distribution cluster, the subject categories for distinguishing short text type can be obtained, thus quick, the Accurate classification to Massive short documents notebook data can be realized.

Accompanying drawing explanation

By reading the detailed description done non-limiting example done with reference to the following drawings, the other features, objects and advantages of the application will become more obvious:

Fig. 1 is the process flow diagram of an embodiment of the application's short text disposal route;

Fig. 2 is the process flow diagram of another embodiment of the application's short text disposal route;

Fig. 3 is the functional module construction schematic diagram of an embodiment of the application's short text treating apparatus;

Fig. 4 is the functional module construction schematic diagram of another embodiment of the application's short text treating apparatus;

Fig. 5 is the structural representation of the computer system be suitable for for the terminal device or server realizing the embodiment of the present application.

Embodiment

Below in conjunction with drawings and Examples, the application is described in further detail.Be understandable that, specific embodiment described herein is only for explaining related invention, but not the restriction to this invention.It also should be noted that, for convenience of description, in accompanying drawing, illustrate only the part relevant to Invention.

It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the application in detail in conjunction with the embodiments.

Please refer to Fig. 1, it illustrates the flow process 100 of an embodiment of the application's short text disposal route.The present embodiment is mainly applied in this way in the server of short text application platform and illustrates, the short text disposal route of the present embodiment, comprises the following steps:

As shown in Figure 1, in a step 101, obtain the first short text set, and pre-service is carried out to the first short text set.

In the present embodiment, server by various wired or wireless mode, can obtain the short text information that user inputs in client.First short text set can be the method for first Application the present embodiment when processing a certain class short text usually, the set of obtainable short text composition.Such as, when needing to process the field feedback of some application, can using user for all suggestion feedbacks of this application as the first short text set.Alternatively, also can obtain the feedback opinion of within a period of time (as in a year) as the first short text set, thus remove those ageing poor short text data.After getting the first pending short text set, first can carry out pre-service to it, to improve efficiency and the accuracy of subsequent treatment.

In an optional implementation of the present embodiment, above-mentioned pre-service comprises carries out invalid data filtration, removal stop words, stem extraction and numbering to each short text in short text set.It can be filter short text data that invalid data filters, remove wherein invalid short text information, such as text size is lower than 3 characters or with attempting the short text attacking the features such as SQL (StructuredQuery Language, Structured Query Language (SQL)) statement.Specifically can judge whether a short text belongs to invalid data by decision tree.After filtering out the invalid data in the first short text set, conventional segmenting method can be used, as the segmenting method based on string matching, word is cut to remaining short text, then removing wherein expresses the meaning is worth lower stop word and stop words, such as " ", " " etc.Then, can stem extraction be carried out, namely the same section in the different terms of same stem synonym be extracted, thus the low word of value can be rejected further on the impact of whole short text kind judging.Finally, in the first short text set after can extracting stem, the word that occurred of institute carries out a secondary index, is each word and gives one and number or No. ID, so that subsequent calculations.

In an optional implementation of the present embodiment, stem extracts and comprises main body extraction and descriptor extraction.When carrying out stem to a short text and extracting, main body can be carried out and extract and descriptor extraction.Main body can refer to short text for object, such as suggestion feedback for product, or the film or performer etc. corresponding to film review can be the subject or object etc. in short text usually.Descriptor can be the word be described state, situation, emotion etc., can be adjective or verb etc. with emotion usually.Because user delivers the position of short text, normally a specific network site, as the suggestion feedback channel of certain product.Now user can directly express an opinion and advise, and usually no longer mentions Related product.Therefore, when carrying out stem and extracting, probably definite main body cannot be extracted.Such as, the feedback of user to some products is " new edition is difficult to use " several word, then directly cannot extract the main body corresponding to it from this feedback.Now, main body can be determined according to the source of the first short text set.Such as, when the first short text set be get from the suggestion feedback of search application time, then the main body can thinking corresponding to short text is exactly this search application.

In an optional implementation of the present embodiment, when not extracting main body, can also according to descriptor determination main body.Particularly, when not extracting main body, can analyze descriptor, and the content further described by descriptor determines main body.Such as, although the first short text set gets from the suggestion feedback of search application, wherein the content of some short texts is " during movie, cannot Auto-matching captions ".By the content described by this short text, can determine that the main body of its correspondence should possess video playback capability, therefore its for be likely Video Applications instead of search application.This user, possibly via the feedback conduit of search application, has fed back the problem of Video Applications.Now, according to descriptor, the main body of this feedback can be defined as Video Applications, instead of search application.By according to descriptor determination main body, the accuracy that main body is determined can be improved.

Then, in a step 102, based on pretreated first short text set, following treatment step can be performed: use pretreated first short text set training topic model LDA, obtain the theme probability distribution of each short text in the first short text set; Cluster is carried out to theme probability distribution, determines the subject categories of each short text in the first short text set.

Step 102 in the present embodiment can comprise sub-step 1021 and 1022.Wherein:

In step 1021, use pretreated first short text set training topic model LDA, obtain the theme probability distribution of each short text in the first short text set.

After obtain pretreated first short text set in above-mentioned steps 101, data sample can be it can be used as further, agent model LDA (Latent Dirichlet Allocation, implicit Dirichlet distribute) is trained.LDA can classify to document and word without supervision, and can predict the theme distribution of document in non-training set and word.Different from general machine learning classification algorithm, the target of prediction of LDA---theme distribution is the amount that can not observe directly in training set, but people's (or model) fabricates an amount out, is therefore referred to as potential (Latent).Just because of target of prediction is the hidden variable that model self fabricates out, and training set is not needed to provide this to measure, so LDA can realize unsupervised study yet.

LDA is a kind of generation model (generative model), that is, to carry out prediction different from the document directly arrived according to the observation, and first LDA assume that a process of generation document, then arrive document according to the observation, predict what kind of production process is behind.LDA supposes that all documents exist k theme (theme is exactly the distribution of word in fact), generate one section of document, first generate a theme distribution of the document, and then generate the set of word; Generate a word, need the theme distribution Stochastic choice theme according to document, then according to distribution Stochastic choice word of word in theme.Particularly, can pass through the mode of gibbs sampler (GibbsSampling), the sample arrived according to the observation i.e. the first short text set, derives the theme probability distribution of each short text each theme is to the probability distribution of word and the first probability z of theme belonging to word in short text set _m,n.Wherein, m, k and n can be respectively used to the quantity representing short text, theme and word.Determine all and z _m,nafter, be just equivalent to the LDA model after obtaining training.

In step 1022, cluster is carried out to theme probability distribution, determine the subject categories of each short text in the first short text set.

The process that the set of physics or abstract object is divided into the multiple classes be made up of similar object is called as cluster.In the present embodiment, each theme probability distribution that obtains in above-mentioned steps 102, can represent by the form of probability vector.Therefore, when carrying out cluster to theme probability distribution, be equivalent to carry out cluster to multiple probability vector.Particularly, clustering method general in prior art can be used, cluster is carried out to all theme probability distribution.Alternatively, the theme probability distribution of K average (K-Means) clustering procedure to short text can be used to carry out cluster.K-Means is a kind of typical partition clustering algorithm, and its cluster centre mean value of all data in of all categories represents that its fast convergence rate can be expanded for large-scale data set.After carrying out K-Means cluster to theme probability distribution, the theme probability distribution of each short text can be divided into a specific subject categories, and this subject categories just can as the subject categories of short text.

The short text disposal route that the present embodiment provides, first pre-service can be carried out to the first short text set obtained, then the data after utilizing process carry out topic model LDA training, to obtain the theme probability distribution of each short text in set, finally cluster is carried out to theme probability distribution, just can determine the subject categories of each short text.Obtain the distribution situation of theme probability by first carrying out topic model training, then to the further cluster of theme probability distribution, the subject categories for distinguishing short text type can be obtained, thus quick, the Accurate classification to Massive short documents notebook data can be realized.

Please further refer to Fig. 2, it illustrates the flow process 200 of another embodiment of the application's short text disposal route.

As shown in Figure 2, in step 201, obtain the first short text set, and pre-service is carried out to the first short text set.

Then, in step 202., based on pretreated first short text set, following treatment step can be performed: use pretreated first short text set training topic model LDA, obtain the theme probability distribution of each short text in the first short text set; Cluster is carried out to theme probability distribution, determines the subject categories of each short text in the first short text set.

In the present embodiment, step 202 can comprise sub-step 2021 and 2022.Wherein:

In step 2021, use pretreated first short text set training topic model LDA, obtain the theme probability distribution of each short text in the first short text set.

In step 2022, cluster is carried out to theme probability distribution, determine the subject categories of each short text in the first short text set.

In the present embodiment, the step 101-102 in above-mentioned steps 201-202 and Fig. 1 is identical, does not repeat them here.

Then, in step 203, obtain the second newly-increased short text set, and pre-service is carried out to the second short text set.

When passing through training LDA model in above-mentioned steps 202, after determining the subject categories of each short text in the first short text set, the second short text set increased newly can also be obtained further.Because short text data is that user independently delivers on network, therefore after processing the first short text set, As time goes on, user also can constantly deliver new suggestion feedback, comment or evaluation etc.In the present embodiment, newly-increased short text all can be added the second short text set, and pre-service is carried out to it.Pre-service in this step and the preprocessing process in the step 101 of Fig. 1 can be the same, do not repeat them here.

Then, in step 204, whether the neologisms quantity detected in pretreated second short text set exceeds predetermined threshold value, if then perform step 205, otherwise performs step 206.

In the present embodiment, after carrying out pre-service to the second short text set, can add up the neologisms quantity in the second short text set further, wherein, neologisms refer to the word do not occurred in the first short text set.User can pre-set a threshold value about neologisms quantity, if the neologisms quantity counted in the second short text set exceedes this threshold value, then performs following step 205, otherwise performs following step 206.

In step 205, pretreated first short text set and pretreated second assigned short text set are amounted to as pretreated first short text set, again perform above-mentioned treatment step 202.

Because the LDA after training can only identify the word occurred in the first short text set, and emerging word cannot be processed.When the neologisms quantity in the second short text set exceeds predetermined threshold value, the LDA before trained probably cannot identify the short text in the second short text set accurately, therefore can again to the training of LDA model.Particularly, can amount to same as data sample using carrying out pretreated first short text set in step 201 and carrying out pretreated second assigned short text set in step 203, namely pretreated first short text set, performs above-mentioned treatment step 202 again.By re-starting training to LDA model, and to the theme probability distribution cluster again obtained, can determine in the first original short text set and the second newly-increased short text set, the subject categories of each short text.

In step 206, the LDA after using training determines the subject categories of each short text in the second short text set.

In the present embodiment, when the neologisms quantity in the second short text set does not exceed predetermined threshold value, the LDA model trained before can thinking can be used in the document of prediction second short text set and the theme distribution of word.Therefore, LDA trained in above-mentioned steps 202 can be directly used in, each short text in the second short text set is classified, to determine their subject categories.

Compared with the method shown in Fig. 1, the short text disposal route that the present embodiment provides, after to the training of LDA model, newly-increased short text data can be obtained further, and can determine it is re-training LDA model according to the quantity of neologisms in newly-increased data, or determine the subject categories of newly-increased short text with LDA model trained before.Thus the processing accuracy that can improve newly-increased short text, extend the range of application of short text disposal route.

With further reference to Fig. 3, it illustrates the structural representation of an embodiment of the application's short text treating apparatus.

As shown in Figure 3, the short text treating apparatus 300 of the present embodiment comprises: the first acquisition module 310 and processing module 320.

First acquisition module 310, for obtaining the first short text set, and carries out pre-service to the first short text set.

Processing module 320, for based on pretreated first short text set, drives and performs following treatment step with lower unit:

Training unit 321, for using the pretreated first short text set training topic model LDA of the first acquisition module 310, obtains the theme probability distribution of each short text in the first short text set.

Cluster cell 322, carries out cluster for the theme probability distribution obtained training unit 321, determines the subject categories of each short text in the first short text set.

In an optional implementation of the present embodiment, as shown in Figure 4, short text treating apparatus 300 can also comprise:

Second acquisition module 330, for obtaining the second newly-increased short text set, and carries out pre-service to the second short text set.

Whether detection module 340, exceed predetermined threshold value for the neologisms quantity detected in the pretreated second short text set of the second acquisition module 330.

Feedback module 350, for when detection module 340 detects that neologisms quantity exceeds predetermined threshold value, pretreated first short text set and pretreated second assigned short text set are amounted to as pretreated first short text set, feed back to processing module 320.

Determination module 360, for when detection module 340 detects that neologisms quantity does not exceed predetermined threshold value, uses the LDA after training to determine the subject categories of each short text in described second short text set.

In an optional implementation of the present embodiment, pre-service comprises carries out invalid data filtration, removal stop words, stem extraction and numbering to each short text in short text set.

In an optional implementation of the present embodiment, stem extracts and comprises main body extraction and descriptor extraction.

In an optional implementation of the present embodiment, also comprise:

Main body determination module (not shown), for when not extracting main body, according to descriptor determination main body.

Should be appreciated that all unit or the module of record in Fig. 3-4 are corresponding with each step in the method described with reference to Fig. 1-2.Thus, the operation described for method above and feature are equally applicable to device in Fig. 3-4 and the unit wherein comprised or module, do not repeat them here.

The short text treating apparatus that the present embodiment provides, first first acquisition module can carry out pre-service to the first short text set obtained, then the data after processing module utilizes process carry out topic model LDA training, to obtain the theme probability distribution of each short text in set, and cluster is carried out to theme probability distribution, just can determine the subject categories of each short text.Obtain the distribution situation of theme probability by first carrying out topic model training, more further to theme probability distribution cluster, the subject categories for distinguishing short text type can be obtained, thus quick, the Accurate classification to Massive short documents notebook data can be realized.

Below with reference to Fig. 5, it illustrates the structural representation of the computer system 500 be suitable for the terminal device or server realizing the embodiment of the present application.

As shown in Figure 5, computer system 500 comprises CPU (central processing unit) (CPU) 501, and it or can be loaded into the program random access storage device (RAM) 503 from storage area 508 and perform various suitable action and process according to the program be stored in ROM (read-only memory) (ROM) 502.In RAM 503, also store system 500 and operate required various program and data.CPU 501, ROM 502 and RAM 503 are connected with each other by bus 504.I/O (I/O) interface 505 is also connected to bus 504.

I/O interface 505 is connected to: the importation 506 comprising keyboard, mouse etc. with lower component; Comprise the output 507 of such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.; Comprise the storage area 508 of hard disk etc.; And comprise the communications portion 509 of network interface unit of such as LAN card, modulator-demodular unit etc.Communications portion 509 is via the network executive communication process of such as the Internet.Driver 510 is also connected to I/O interface 505 as required.Detachable media 511, such as disk, CD, magneto-optic disk, semiconductor memory etc., be arranged on driver 510 as required, so that the computer program read from it is mounted into storage area 508 as required.

Especially, according to embodiment of the present disclosure, the process that reference flow sheet describes above may be implemented as computer software programs.Such as, embodiment of the present disclosure comprises a kind of computer program, and it comprises the computer program visibly comprised on a machine-readable medium, and described computer program comprises the program code for the method shown in flowchart.In such embodiments, this computer program can be downloaded and installed from network by communications portion 509, and/or is mounted from detachable media 511.

Process flow diagram in accompanying drawing and block diagram, illustrate according to the architectural framework in the cards of the system of various embodiments of the invention, method and computer program product, function and operation.In this, each square frame in process flow diagram or block diagram can represent a part for module, program segment or a code, and a part for described module, program segment or code comprises one or more executable instruction for realizing the logic function specified.Also it should be noted that at some as in the realization of replacing, the function marked in square frame also can be different from occurring in sequence of marking in accompanying drawing.Such as, in fact the square frame that two adjoining lands represent can perform substantially concurrently, and they also can perform by contrary order sometimes, and this determines according to involved function.Also it should be noted that, the combination of the square frame in each square frame in block diagram and/or process flow diagram and block diagram and/or process flow diagram, can realize by the special hardware based system of the function put rules into practice or operation, or can realize with the combination of specialized hardware and computer instruction.

Be described in module involved in the embodiment of the present application to be realized by the mode of software, also can be realized by the mode of hardware.Described module also can be arranged within a processor, such as, can be described as: a kind of processor comprises the first acquisition module and processing module.Wherein, the title of these modules does not form the restriction to this module itself under certain conditions, and such as, the first acquisition module can also be described to " for obtaining the first short text set, and carrying out pretreated module to the first short text set ".

As another aspect, present invention also provides a kind of computer-readable recording medium, this computer-readable recording medium can be the computer-readable recording medium comprised in device described in above-described embodiment; Also can be individualism, be unkitted the computer-readable recording medium allocated in terminal.Described computer-readable recording medium stores more than one or one program, and described program is used for performance description in the short text disposal route of the application by one or more than one processor.

More than describe and be only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art are to be understood that, invention scope involved in the application, be not limited to the technical scheme of the particular combination of above-mentioned technical characteristic, also should be encompassed in when not departing from described inventive concept, other technical scheme of being carried out combination in any by above-mentioned technical characteristic or its equivalent feature and being formed simultaneously.The technical characteristic that such as, disclosed in above-mentioned feature and the application (but being not limited to) has similar functions is replaced mutually and the technical scheme formed.

Claims

1. a short text disposal route, is characterized in that, comprising:

Obtain the first short text set, and pre-service is carried out to described first short text set;

Based on pretreated first short text set, perform following treatment step:

Use described pretreated first short text set training topic model LDA, obtain the theme probability distribution of each short text in described first short text set;

Cluster is carried out to described theme probability distribution, determines the subject categories of each short text in described first short text set.

2. method according to claim 1, is characterized in that, also comprises:

Obtain the second newly-increased short text set, and described pre-service is carried out to described second short text set;

Whether the neologisms quantity detected in pretreated second short text set exceeds predetermined threshold value;

If so, then described pretreated first short text set and described pretreated second assigned short text set are amounted to as pretreated first short text set, again perform described treatment step;

Otherwise, use the LDA after training to determine the subject categories of each short text in described second short text set.

3. method according to claim 1 and 2, is characterized in that, described pre-service comprises carries out invalid data filtration, removal stop words, stem extraction and numbering to each short text in short text set.

4. method according to claim 3, is characterized in that, described stem extracts and comprises main body extraction and descriptor extraction.

5. method according to claim 4, is characterized in that, also comprises:

When not extracting described main body, determine described main body according to described descriptor.

6. a short text treating apparatus, is characterized in that, comprising:

First acquisition module, for obtaining the first short text set, and carries out pre-service to described first short text set;

Processing module, for based on pretreated first short text set, drives and performs following treatment step with lower unit:

Training unit, for using described pretreated first short text set training topic model LDA, obtains the theme probability distribution of each short text in described first short text set;

Cluster cell, for carrying out cluster to described theme probability distribution, determines the subject categories of each short text in described first short text set.

7. device according to claim 6, is characterized in that, also comprises:

Second acquisition module, for obtaining the second newly-increased short text set, and carries out described pre-service to described second short text set;

Whether detection module, exceed predetermined threshold value for the neologisms quantity detected in pretreated second short text set;

Feedback module, for when described neologisms quantity exceeds predetermined threshold value, amounts to described pretreated first short text set and described pretreated second assigned short text set with as pretreated first short text set, feeds back to described processing module;

Determination module, for when described neologisms quantity does not exceed predetermined threshold value, uses the LDA after training to determine the subject categories of each short text in described second short text set.

8. the device according to claim 6 or 7, is characterized in that, described pre-service comprises carries out invalid data filtration, removal stop words, stem extraction and numbering to each short text in short text set.

9. device according to claim 8, is characterized in that, described stem extracts and comprises main body extraction and descriptor extraction.

10. device according to claim 9, is characterized in that, also comprises:

Main body determination module, for when not extracting described main body, determines described main body according to described descriptor.