CN105786791A

CN105786791A - Data topic acquisition method and apparatus

Info

Publication number: CN105786791A
Application number: CN201410812266.6A
Authority: CN
Inventors: 陆中振; 邓雪娇
Original assignee: Shenzhen Tencent Computer Systems Co Ltd
Current assignee: Shenzhen Tencent Computer Systems Co Ltd
Priority date: 2014-12-23
Filing date: 2014-12-23
Publication date: 2016-07-20
Anticipated expiration: 2034-12-23
Also published as: CN105786791B

Abstract

The invention discloses a data acquisition method. The method comprises the steps of receiving input to-be-processed data, wherein the to-be-processed data comprises multiple pieces of sub-data, and each sub-datum comprises a plurality of words; grouping the to-be-processed data to generate multiple groups, wherein the quantities of the words in the groups after grouping are similar; sampling each word in each group for multiple times according to a Gibbs sampling formula, and selecting sampling data obtained by sampling each time in sequence to perform iterative calculation, wherein the sampling data of a first preset frequency is obtained in sampling each time; after the iterative calculation frequency of each word reaches a second preset frequency, generating a topic corresponding to each word according to an iterative result; and according to the obtained topic of each word, performing calculation to generate a topic of each sub-datum. The invention furthermore discloses a data topic acquisition apparatus. According to the method and apparatus, the duration of iterative calculation is shortened, so that the data topic acquisition efficiency is improved.

Description

Data subject acquisition methods and device

Technical field

The present invention relates to technical field of data processing, be related specifically to Data subject acquisition methods and device.

Background technology

LDA (LatentDirichletAllocation, probability topic model), it is possible to be used for identifying subject information hiding in extensive document sets or corpus.By utilizing topic model to calculate the relation to determine implicit variable and observed data through successive ignition, and iterative computation is all the process of a sampling every time.The method of sampling of LDA is GibbsSampling (Gibbs model), and wherein, observed data is the word in document, and implicit variable is the theme that word is corresponding, and a theme can be redistributed after sampling in each word in document.Sampling calculates very consuming time, it is necessary to each word in all documents is sampled one time.Assuming that theme number is K, be D for number of files, the average word number of document is the corpus of W, and its unitary sampling complexity is O (D*W*K).

Under distributed environment, input data include multiple document (subdata) by document section technique, and namely each piecemeal comprises identical document record.When in each piecemeal, number of files is more, data distribution can be said to be uniform generally, but the number of word inconsistent in every section of document, and the word number of maximum data block and minimum data block still has certain gap.Consuming time with word number linear correlation because sampling, in the existing mode by document piecemeal so that in each piecemeal, the quantity of word is different, and each block iterative solution to calculate duration different, result in " wooden pail effect " so that iterative computation duration is partially long, and then cause that the efficiency that Data subject obtains is on the low side.

Summary of the invention

The embodiment of the present invention provides acquisition methods and the device of a kind of Data subject, aim to solve the problem that the existing mode by document piecemeal, the quantity making word in each piecemeal is different, and each block iterative solution to calculate duration different, result in " wooden pail effect ", make iterative computation duration partially long, and then cause efficiency that Data subject obtains problem on the low side.

For achieving the above object, the embodiment of the present invention proposes a kind of Data subject acquisition methods, including step:

Receiving the pending data of input, described pending data include multiple subdata, and each subdata includes multiple word；

Described pending packet is generated multiple groups, and in each group after packet, the quantity of word is close；

According to gibbs Gibbs formula of sampling, each word in each group being carried out multiple repairing weld, every time sampling obtains the sampled data of the first preset times, and the sampled data obtained that selects successively every time to sample is iterated calculating；

After the number of times of the iterative computation to each word reaches the second preset times, generate, according to iteration result, the theme that each word is corresponding；

Theme according to each word obtained, calculates the theme generating each subdata.

To achieve these goals, the embodiment of the present invention it is further proposed that a kind of Data subject acquisition device, including:

Sending and receiving module, for receiving the pending data of input, described pending data include multiple subdata, and each subdata includes multiple word；

Grouping module, for described pending packet is generated multiple groups, in each group after packet, the quantity of word is close；

Sampling module, for each word in each group being sampled according to Gibbs sampling formula, sampling obtains the sampled data of the first preset times every time；Obtain the sampled data of the first preset times；

Iteration module, for successively select sample every time the sampled data obtained be iterated calculating；

Theme processing module, for, after the number of times of the iterative computation to each word reaches the second preset times, generating, according to iteration result, the theme that each word is corresponding；It is additionally operable to the theme according to each word obtained, calculates the theme generating each subdata.

The present invention is by by pending packet, and the quantity of word is close in each group, and sample formula to each word multiple repairing weld in each group according to Gibbs, sampling obtains the sampled data of the first preset times every time, and the sampled data obtained that selects successively every time to sample is iterated calculating；After the number of times of the iterative computation to each word reaches the second preset times, generate the theme of each word according to iteration result, and then generate the theme of each subdata.It is prevented effectively from the existing mode by document piecemeal, the quantity making word in each piecemeal is different, causes that each block iterative solution calculates duration different, result in " wooden pail effect ", make iterative computation duration partially long, and then cause efficiency that Data subject obtains problem on the low side.Decrease the duration of iterative computation, and then improve the efficiency that Data subject obtains.

Accompanying drawing explanation

Fig. 1 is the hardware structure schematic diagram involved by embodiment of the present invention Data subject acquisition device；

Fig. 2 is the schematic flow sheet of the first embodiment of Data subject acquisition methods of the present invention；

Fig. 3 is the refinement schematic flow sheet of an embodiment of step S30 in Fig. 1；

Fig. 4 is the refinement schematic flow sheet of an embodiment of step S33 in Fig. 3；

Fig. 5 is the refinement schematic flow sheet of another embodiment of step S33 in Fig. 3

Fig. 6 is the schematic flow sheet of the second embodiment of Data subject acquisition methods of the present invention；

Fig. 7 is the schematic flow sheet of the 3rd embodiment of Data subject acquisition methods of the present invention；

Fig. 8 is the high-level schematic functional block diagram of the first embodiment of Data subject acquisition device of the present invention；

Fig. 9 is the refinement high-level schematic functional block diagram of sampling module one embodiment in Fig. 8；

Figure 10 is the refinement high-level schematic functional block diagram of iteration module one embodiment in Fig. 8；

Figure 11 is the high-level schematic functional block diagram of the second embodiment of Data subject acquisition device of the present invention.

The realization of the object of the invention, functional characteristics and advantage will in conjunction with the embodiments, are described further with reference to accompanying drawing.

Detailed description of the invention

Should be appreciated that specific embodiment described herein is only in order to explain the present invention, is not intended to limit the present invention.

The primary solutions of the embodiment of the present invention is: receiving the pending data of input, described pending data include multiple subdata, and each subdata includes multiple word；Described pending packet is generated multiple groups, and in each group after packet, the quantity of word is close；According to gibbs Gibbs formula of sampling, each word in each group being carried out multiple repairing weld, every time sampling obtains the sampled data of the first preset times, and the sampled data obtained that selects successively every time to sample is iterated calculating；After the number of times of the iterative computation to each word reaches the second preset times, generate, according to iteration result, the theme that each word is corresponding；Theme according to each word obtained, calculates the theme generating each subdata.By by pending packet, and the quantity of word is close in each group, and sample formula to each word multiple repairing weld in each group according to Gibbs, sampling obtains the sampled data of the first preset times every time, and the sampled data obtained that selects successively every time to sample is iterated calculating；After the number of times of the iterative computation to each word reaches the second preset times, generate the theme of each word according to iteration result, and then generate the theme of each subdata.It is prevented effectively from the existing mode by document piecemeal, the quantity making word in each piecemeal is different, and each block iterative solution calculates duration difference, result in " wooden pail effect ", make iterative computation duration partially long, and then cause efficiency that Data subject obtains problem on the low side.Decrease the duration of iterative computation, and then improve the efficiency that Data subject obtains.

Due to existing by the mode of document piecemeal so that in each piecemeal, the quantity of word is different, and each block iterative solution calculates duration difference, result in " wooden pail effect " so that iterative computation duration is partially long, and then causes that the efficiency that Data subject obtains is on the low side.

Embodiment of the present invention framework one Data subject acquisition device, this Data subject acquisition device is by by pending packet, and the quantity of word is close in each group, and sample formula to each word multiple repairing weld in each group according to Gibbs, every time sampling obtains the sampled data of the first preset times, and the sampled data obtained that selects successively every time to sample is iterated calculating；After the number of times of the iterative computation to each word reaches the second preset times, generate the theme of each word according to iteration result, and then generate the theme of each subdata.It is prevented effectively from the existing mode by document piecemeal, the quantity making word in each piecemeal is different, and each block iterative solution calculates duration difference, result in " wooden pail effect ", make iterative computation duration partially long, and then cause efficiency that Data subject obtains problem on the low side.Decrease the duration of iterative computation, and then improve the efficiency that Data subject obtains.

Wherein, the present embodiment Data subject acquisition device can be carried on PC end, it is also possible to is carried on mobile phone, panel computer etc. and can use the electric terminal of the application such as theme acquisition.This hardware structure involved by Data subject acquisition device can be as shown in Figure 1.

Fig. 1 illustrates the hardware structure involved by embodiment of the present invention Data subject acquisition device.As it is shown in figure 1, the hardware involved by described Data subject acquisition device includes: processor 301, for instance CPU, network interface 304, user interface 303, memorizer 305, communication bus 302.Wherein, communication bus 302 is for realizing in hardware structure involved by this theme acquisition device the connection communication between each building block.User interface 303 can include the assemblies such as display screen (Display), keyboard (Keyboard), mouse, for receiving the information of user's input, and the information transmission of reception is processed to processor 305.Display screen can be LCD display, LED display, it is also possible to for touch screen, needs the data of display for video data theme acquisition device, for instance the operation interfaces such as video data theme obtains, Data subject shows.Optional user interface 303 can also include the wireline interface of standard, wave point.Network interface 304 optionally can include the wireline interface of standard, wave point (such as WI-FI interface).Memorizer 305 can be high-speed RAM memorizer, it is also possible to be stable memorizer (non-volatilememory), for instance disk memory.Memorizer 305 optionally can also is that the storage device independent of aforementioned processor 301.As it is shown in figure 1, obtain program as the memorizer 305 of a kind of computer-readable storage medium can include operating system, network communication module, Subscriber Interface Module SIM and Data subject.

In the hardware involved by the Data subject acquisition device shown in Fig. 1, network interface 304 is mainly used in connecting application platform, carries out data communication with application platform；User interface 303 is mainly used in connecting client, carries out data communication with client, receives information and the instruction of client input；And processor 301 may be used for calling the Data subject of storage in memorizer 305 and obtains program, and perform following operation:

Further, in one embodiment, processor 301 calls the Data subject acquisition program stored in memorizer 305 and can perform following operation:

Adopt MetropolisHasting algorithm to build probability transfer matrix, and determine the probability distribution that described probability transfer matrix is current；

Using described current probability distribution as transfering probability distribution, setting up alias table AliasTable, according to described AliasTable, each word is carried out multiple repairing weld, sampling obtains the sampled data of the first preset times and generates sampling set every time；

The sampled data in the sampling set every time generated is selected to be iterated calculating successively.

Judge whether the sampling set of current pending iterative computation exists non-selected sampled data；

If described sampling set is absent from non-selected sampled data, generate new sampling set in the manner described above, select the sampled data in new sampling set to be iterated calculating successively；

If described sampling set exists non-selected sampled data, the non-selected sampled data in sampling set is selected to be iterated calculating successively.

Selecting a sampled data from sampling set, its theme is j, and probability is Q_j；

If current topic is i, according to MetropolisHasting algorithm, go out a probit s at random, described probability s is compared with acceptance probability；

If described probability s is more than acceptance probability, then theme transfers to j from i, using theme j as new current topic, completes an iteration and calculates；

If described probability s is not more than acceptance probability, then theme remains i, using theme i as new current topic, completes an iteration and calculates；

The sampled data in sampling set is selected to complete iterative computation successively.

Obtain word identical in same subdata；

Complete the iterative computation of same words according to identical AliasTable, and generate the theme that each same words is corresponding according to iteration result respectively.

Obtain word identical in different subdata, and obtain the factor identical in described Gibbs sampling formula；

Build the Gibbs of same words in different subdata according to the identical factor to sample formula, and complete the iterative computation of equivalent according to the Gibbs built formula of sampling, generate each same words theme in corresponding subdata according to iteration result.

The present embodiment is according to such scheme, by by pending packet, and the quantity of word is close in each group, and sample formula to each word multiple repairing weld in each group according to Gibbs, every time sampling obtains the sampled data of the first preset times, and the sampled data obtained that selects successively every time to sample is iterated calculating；After the number of times of the iterative computation to each word reaches the second preset times, generate the theme of each word according to iteration result, and then generate the theme of each subdata.It is prevented effectively from the existing mode by document piecemeal, the quantity making word in each piecemeal is different, and each block iterative solution calculates duration difference, result in " wooden pail effect ", make iterative computation duration partially long, and then cause efficiency that Data subject obtains problem on the low side.Decrease the duration of iterative computation, and then improve the efficiency that Data subject obtains.

Based on above-mentioned hardware structure, it is proposed to Data subject acquisition methods embodiment of the present invention.

As shown in Figure 2, it is proposed to the first embodiment of a kind of Data subject acquisition methods of the present invention, described Data subject acquisition methods includes:

Step S10, receives the pending data of input, and described pending data include multiple subdata, and each subdata includes multiple word；

When needing the theme getting data, the operation interface provided by LDA topic model inputs pending data, described pending data include at least one subdata, at least one document i.e., each subdata includes at least one word, institute's predicate is word, for instance " article ", " probability ", " sample " etc..After opening LDA topic model, the data of detecting input, receive the pending data in the input of described operation interface.In other embodiments of the present invention, also can also is that and once input multiple pending data, it is possible to process the plurality of pending data by multiple topic models.

Step S20, generates multiple groups by described pending packet, and in each group after packet, the quantity of word is close；

After receiving pending data, by described pending packet, generate the group after multiple packet, and in each group after packet, the quantity of word is close, the close quantity including word is identical or between each group, the quantity of word difference is less than preset value, and described preset value can be 2,1 etc..By by described pending packet, and ensure that in each group, the quantity of word is close, effectively prevent the quantity of word in each group difference and cause that more greatly each group calculates duration difference, and cause computational efficiency problem on the low side, improve computational efficiency, and then it is shorter to make theme obtain duration, in hgher efficiency.

Step S30, carries out multiple repairing weld according to gibbs Gibbs formula of sampling to each word in each group, and every time sampling obtains the sampled data of the first preset times, and the sampled data obtained that selects successively every time to sample is iterated calculating；

The present embodiment sampling by Gibbs sample mode, obtains the theme of each word.Under described Gibbs sample mode, each word under each subdata is required for sampling, namely in the theme of predetermined number (K), one is selected, the probability of each theme in predetermined number can be calculated according to Gibbs sampling formula for this, and go out a theme by CDF and random number simulation.Described first preset times can be 5 times, 4 times or 3 inferior, it is configured as required, and described first preset times arrange more little iteration accuracy more high, but efficiency can be more low, therefore, in order to, therefore a relatively reasonable number of times can be set according to the actual requirements, for example, it is possible to be preferably 3 times.Described second preset times can be 300 times, 500 inferior, described second preset times represents with m.Described repeatedly it is arranged as required to, for instance, it is possible to it is 50 times, 100 inferior, and described repeatedly by second preset times business divided by the first preset times, for instance, the second preset times is 500 times, first preset times is 5, it is meant that need each word carries out 100 samplings.

Concrete, with reference to Fig. 3, described according to gibbs Gibbs formula of sampling, each word in each group is carried out multiple repairing weld, sampling obtains the sampled data of the first preset times every time；The process that the sampled data every time obtained is iterated calculating is selected to may include that successively

Step S31, adopts MetropolisHasting algorithm to build probability transfer matrix, and determines the probability distribution that described probability transfer matrix is current；

Sampling MetropolisHasting algorithm builds probability transfer matrix, and following MetropolisHasting algorithm referred to as M-H algorithmic formula is:

Wherein, P_iFor stationary binomial random process, namely current theme probability distribution, the shift direction of decision M-H algorithm, q_iFor transfering probability distribution, for Invariant Distribution, can be used for AliasMethod (another name method) and sample；For probability transfer matrix, determine the efficiency of single word iteration.If p_iAnd q_iFor same probability distribution, then transfer matrix is most effective, now transfer ratio

Step S32, using described current probability distribution as transfering probability distribution, sets up alias table AliasTable, according to described AliasTable, each word is carried out multiple repairing weld, and sampling obtains the sampled data of the first preset times and generates sampling set every time；

In order to take into account AliasMethod and transfer efficiency to q_iRequirement, in the iterative process of each word in document, by current theme Probability p_iAs q_i, set up AliasTable, each word is carried out multiple repairing weld, sampling obtains the sampled data of the first preset times (representing with n) every time, after each word is carried out multiple repairing weld, the data genaration sampling set obtained of every time sampling.

Step S33, selects the sampled data in the sampling set every time generated to be iterated calculating successively.

After sampling, successively select sample every time the sampled data obtained be iterated calculating, sampled data is selected to be iterated calculating from the sampled data every time obtained one by one, namely each sampled data does an iteration calculating, after the sampled data iteration obtained when secondary sampling is complete, other sampled datas are again selected to be iterated calculating, until the iterative computation number of times of each word reaches the second preset times.

Concrete, with reference to Fig. 4, the process that the sampled data in the described sampling set selecting successively every time to generate is iterated calculating may include that

Step S331, it is judged that whether there is non-selected sampled data in the sampling set of current pending iterative computation；

When needs select sampled data to be iterated calculating from sampling set, it is judged that whether described sampling set exists non-selected sampled data.Concrete, it is possible to arranging mark for selected sampled data, whether arrange mark by sampled data and judge whether sampled data is chosen, if the mark of carrying, then judge that sampled data is selected, if the mark of not carrying, then sampled data is not selected.Or, also it may also is that after sampled data is selected, this selected sampled data deleted, if described sampling set exists sampled data, then judge that described sampling set exists non-selected sampled data；If described sampling set is absent from sampled data, then judge described sampling set is absent from non-selected sampled data.Also other modes being suitable for can also be adopted to judge whether described sampling set exists non-selected sampled data.

Step S332, if being absent from non-selected sampled data in described sampling set, then generates new sampling set in the manner described above, selects the sampled data in new sampling set to be iterated calculating successively；

If described sampling set is absent from non-selected sampled data, then according to Gibbs sampling formula, each word in each group is sampled, obtain the sampled data of the first preset times, new sampling set is generated, successively to the sampled data iterative computation in described new sampling set according to sampled data.

Step S333, if there is non-selected sampled data in described sampling set, then selects the non-selected sampled data in sampling set to be iterated calculating successively.

If described sampling set exists non-selected sampled data, then continue to select non-selected sampled data in sampling set to complete iterative computation successively.

Concrete, with reference to Fig. 5, the process that the described sampled data selecting successively every time to obtain is iterated calculating may include that

Step S334, selects a sampled data from sampling set, and its theme is j, and probability is Q_j；

Step S335, if current topic is i, according to MetropolisHasting algorithm, goes out a probit s at random, is compared with acceptance probability by described probability s；

In probability transfer matrix, there is acceptance probability and transition probability.After the sampled data obtaining the first preset times is sampled in each word in each group by formula of sampling according to Gibbs, sampling set in sampled data not selected complete time, selecting a sampled data from sampling set, its theme is j, and the probability of this theme j is Q_jIf current topic is i, according to M-H algorithm, go out a probit s at random, described probability s is compared with acceptance probability.

Step S336, if described probability s is more than acceptance probability, then theme transfers to j from i, using theme j as new current topic, completes an iteration and calculates；

Step S337, if described probability s is not more than acceptance probability, then theme remains i, using theme i as new current topic, completes an iteration and calculates；

Step S338, selects the sampled data in sampling set to complete iterative computation successively.

If described probability s is more than acceptance probability, then theme transfers to j from i, using theme j as new current topic, completes an iteration and calculates, and current topic becomes theme j, and theme changes；If described probability s is not more than acceptance probability, then theme remains i, using theme i as new current topic, completes an iteration and calculates, and retaining theme i is current topic；Non-selected sampled data in sampling set is selected to complete iterative computation in the manner described above successively.

In said process, setting up the calculating of AliasTable consuming time is 0 (predetermined number K), and can multiplexing the first preset times time.Step S234 is to step 237, and calculating consuming time is 0 (1), the iterative computation of the first preset times time is consuming time is 0 (K), then the complexity of word iteration is consuming time is 0 (K/n), in theory performance boost n times.Separately, for setting up the Q of AliasTable_iBeing preferably at most the theme probability distribution before n wheel, iteration is rear q steadily_iWith current topic distribution p_jReaching unanimity, iteration efficiency is significantly high.By improving Gibbs sampling process, improve iteration efficiency, and then improve the efficiency that theme obtains.

Step S40, after the number of times of the iterative computation to each word reaches the second preset times, generates, according to iteration result, the theme that each word is corresponding；

After the number of times of the iterative computation of each word reaches the second preset times, obtain the theme that each iterative computation generates, generate, according to acquired theme, the theme that each word is corresponding.Such as, select in acquired theme the theme of maximum probability as the theme of equivalent, or select theme that in acquired theme, occurrence number is maximum as the theme of equivalent.

Step S50, the theme according to each word obtained, calculate the theme generating each subdata.

Concrete, get the theme of word corresponding to each subdata, according to each word theme rate got, the maximum theme of select probability, as the theme of described subdata, obtains the theme of other subdatas according to the method described above.Can also be the mode needing setting according to user, the theme according to each word, calculate the theme generating each digital data.

The present embodiment is by by pending packet, and the quantity of word is close in each group, and sample formula to each word multiple repairing weld in each group according to Gibbs, sampling obtains the sampled data of the first preset times every time, and the sampled data obtained that selects successively every time to sample is iterated calculating；After the number of times of the iterative computation to each word reaches the second preset times, generate the theme of each word according to iteration result, and then generate the theme of each subdata.It is prevented effectively from the existing mode by document piecemeal, the quantity making word in each piecemeal is different, and each block iterative solution calculates duration difference, result in " wooden pail effect ", make iterative computation duration partially long, and then cause efficiency that Data subject obtains problem on the low side.Decrease the duration of iterative computation, and then improve the efficiency that Data subject obtains.

Further, based on the first embodiment of above-mentioned Data subject acquisition methods, it is proposed to the second embodiment of Data subject acquisition methods of the present invention.As shown in Figure 6, after described step S50, it is also possible to including:

Step S60, obtains word identical in same subdata；

Obtain word identical in same subdata, and obtain the AliasTable that described identical word is corresponding.

Step S70, completes the iterative computation of same words, and generates the theme that each same words is corresponding according to iteration result respectively according to identical AliasTable.

Identical word in same subdata, Gibbs sampling formula is identical, it is possible to shared AliasTable completes iterative computation, and generates the theme that each same words is corresponding respectively according to iteration result, the theme generated can be identical also different, generate according to iteration result.By the identical word in same subdata is adopted identical AliasTable, its performance boost ratio is relevant with the word repetitive rate in subdata, and number of repetition is more many, and performance improves more big, and efficiency is more high.

Further, based on the second embodiment of above-mentioned Data subject acquisition methods, it is proposed to the 3rd embodiment of Data subject acquisition methods of the present invention.As it is shown in fig. 7, after described step S50, it is also possible to including:

Step S80, obtains word identical in different subdata, and obtains the factor identical in described Gibbs sampling formula；

Step S90, builds the Gibbs of same words in different subdata according to the identical factor and samples formula, and completes the iterative computation of equivalent according to the Gibbs built formula of sampling, and generates each same words theme in corresponding subdata according to iteration result.

Word identical in different subdatas, the Gibbs sampling formula adopted is different, but the partial factors of formula is the same, AliasTable can be set up with this part common factor, performance boost is relevant with the repetitive rate of the word of subdata set, and namely relevant with the number of times that a word repeats, number of repetition is more many, performance boost is more big, and efficiency is more high.

The execution theme of the Data subject acquisition methods of the first to the 3rd embodiment of above-mentioned Data subject acquisition methods can be all terminal.Further, the method can be realized by the client (obtaining software etc. such as Data subject) being arranged in terminal, wherein, this terminal can include but not limited to the electronic equipments such as notebook computer, mobile phone, panel computer or PDA (PersonalDigitalAssistant, personal digital assistant).

Accordingly, it is proposed to the preferred embodiment of Data subject acquisition device of the present invention.With reference to Fig. 8, described Data subject acquisition device includes sending and receiving module 10, grouping module 20, sampling module 30, iteration module 40 and theme acquisition module 50.

Described sending and receiving module 10, for receiving the pending data of input, described pending data include multiple subdata, and each subdata includes multiple word；

Described grouping module 20, for described pending packet is generated multiple groups, in each group after packet, the quantity of word is close；

Described sampling module 30, for each word in each group being carried out multiple repairing weld according to gibbs Gibbs sampling formula, sampling obtains the sampled data of the first preset times every time；

Concrete, with reference to Fig. 9, described sampling module 30 includes construction unit 31, determines unit 32 and sampling unit 33,

Described construction unit 31, is used for adopting MetropolisHasting algorithm to build probability transfer matrix；

Described determine unit 32, for determining the probability distribution that described probability transfer matrix is current；

Wherein, P_iFor stationary binomial random process, namely current theme probability distribution, determine the shift direction of M-H algorithm, qi is transfering probability distribution, for Invariant Distribution, can be used for AliasMethod (another name method) and samples；For probability transfer matrix, determine the efficiency of single word iteration.If p_iAnd q_iFor same probability distribution, then transfer matrix is most effective, now transfer ratio

Described construction unit 31, is additionally operable to described current probability distribution as transfering probability distribution, sets up alias table AliasTable；

Described sampling unit 33, for each word being carried out multiple repairing weld according to described AliasTable, sampling obtains the sampled data of the first preset times and generates sampling set every time；

Described iteration module 40, for selecting the sampled data in the sampling set generated to be iterated calculating successively every time.

Concrete, with reference to Figure 10, described iteration module 40 includes judging unit 41 and iteration unit 42,

Described judging unit 41, for judging whether there is non-selected sampled data in the sampling set of current pending iterative computation；

Described iteration unit 42, if for being absent from non-selected sampled data in described sampling set, then generating new sampling set in the manner described above, selects the sampled data in new sampling set to be iterated calculating successively；

Described iteration unit 42, if being additionally operable in described sampling set there is non-selected sampled data, then selects the non-selected sampled data in sampling set to be iterated calculating successively.

Further, described iteration unit 42, it is additionally operable to select a sampled data from sampling set, its theme is j, and probability is Q_j；

Described theme processing module 50, if being i for current topic, according to MetropolisHasting algorithm, going out a probit s at random, being compared with acceptance probability by described probability s；

Described theme processing module 50, if being additionally operable to described probability s more than acceptance probability, then theme transfers to j from i, using theme j as new current topic, completes an iteration and calculates；If being additionally operable to described probability s to be not more than acceptance probability, then theme remains i, using theme i as new current topic, completes an iteration and calculates；

Described iteration unit 42, is additionally operable to select the sampled data in sampling set to complete iterative computation successively.

Described theme processing module 50, for, after the number of times of the iterative computation to each word reaches the second preset times, generating, according to iteration result, the theme that each word is corresponding.

Described theme processing module 50, is additionally operable to the theme according to each word obtained, and calculates the theme generating each subdata.

Further, based on the first embodiment of above-mentioned Data subject acquisition device, it is proposed to the second embodiment of Data subject acquisition device of the present invention.As shown in figure 11, described Data subject acquisition device also includes:

Acquisition module 60, for obtaining word identical in same subdata；

Described iteration unit 42, is additionally operable to complete the iterative computation of same words according to identical AliasTable；

Described theme processing module 50, is additionally operable to and generates the theme that each same words is corresponding according to iteration result respectively.

Further, described acquisition module 60, it is additionally operable to obtain word identical in different subdata, and obtains the factor identical in described Gibbs sampling formula；

Described construction unit 31, is additionally operable to build the Gibbs sampling formula of same words in different subdata according to the identical factor；

Described iteration unit 42, is additionally operable to complete the iterative computation of equivalent according to the Gibbs sampling formula built；

Described theme processing module 50, is additionally operable to generate each same words theme in corresponding subdata according to iteration result.

It should be noted that, in this article, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or device not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or device.When there is no more restriction, statement " including ... " key element limited, it is not excluded that there is also other identical element in including the process of this key element, method, article or device.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

Through the above description of the embodiments, those skilled in the art is it can be understood that can add the mode of required general hardware platform by software to above-described embodiment method and realize, hardware can certainly be passed through, but in a lot of situation, the former is embodiment more preferably.Based on such understanding, the part that prior art is contributed by technical scheme substantially in other words can embody with the form of software product, this computer software product is stored in a storage medium (such as ROM/RAM, magnetic disc, CD), including some instructions with so that a station terminal equipment (can be mobile phone, computer, server, or the network equipment etc.) perform the method described in each embodiment of the present invention.

Claims

1. a Data subject acquisition methods, it is characterised in that include step:

2. Data subject acquisition methods as claimed in claim 1, it is characterised in that described according to gibbs Gibbs formula of sampling, each word in each group is carried out multiple repairing weld, sampling obtains the sampled data of the first preset times every time；The step that the sampled data every time obtained is iterated calculating is selected to include successively:

3. Data subject acquisition methods as claimed in claim 2, it is characterised in that the step that the sampled data in the described sampling set selecting successively every time to generate is iterated calculating includes:

4. Data subject acquisition methods as claimed in claim 2, it is characterised in that the step that the described sampled data selecting successively every time to obtain is iterated calculating includes:

5. Data subject acquisition methods as claimed in claim 2, it is characterised in that the theme of each word that described basis obtains, after calculating the step of the theme generating each subdata, also includes:

Obtain word identical in same subdata；

6. the Data subject acquisition methods as described in any one of claim 2 to 5, it is characterised in that the theme of each word that described basis obtains, after calculating the step of the theme generating each subdata, also includes:

7. a Data subject acquisition device, it is characterised in that including:

8. Data subject acquisition device as claimed in claim 7, it is characterised in that described sampling module includes construction unit, determines unit and sampling unit,

Described construction unit, is used for adopting MetropolisHasting algorithm to build probability transfer matrix；Described determine unit, for determining the probability distribution that described probability transfer matrix is current；

Described construction unit, is additionally operable to described current probability distribution as transfering probability distribution, sets up alias table AliasTable；

Described sampling unit, for each word being carried out multiple repairing weld according to described AliasTable, sampling obtains the sampled data of the first preset times and generates sampling set every time；

Described iteration module, the sampled data being additionally operable in the sampling set selecting every time to generate successively is iterated calculating.

9. Data subject acquisition device as claimed in claim 8, it is characterised in that described iteration module includes judging unit and iteration unit,

Described judging unit, for judging whether there is non-selected sampled data in the sampling set of current pending iterative computation；

Described iteration unit, if be absent from non-selected sampled data in gathering for described sampling, generates new sampling set in the manner described above, selects the sampled data in new sampling set to be iterated calculating successively；

Described iteration unit, if be additionally operable to there is non-selected sampled data in described sampling set, selects the non-selected sampled data in sampling set to be iterated calculating successively.

10. Data subject acquisition device as claimed in claim 9, it is characterised in that described iteration unit, is additionally operable to select a sampled data from sampling set, and its theme is j, and probability is Q_j；If being additionally operable to current topic is i, according to MetropolisHasting algorithm, goes out a probit s at random, described probability s is compared with acceptance probability；

Described theme processing module, if being additionally operable to described probability s more than acceptance probability, then theme transfers to j from i, using theme j as new current topic, completes an iteration and calculates；If being additionally operable to described probability s to be not more than acceptance probability, then theme remains i, using theme i as new current topic, completes an iteration and calculates；

Described iteration unit, is additionally operable to select the sampled data in sampling set to complete iterative computation successively.

11. the Data subject acquisition device as described in claim 9 or 10, it is characterised in that also include:

Acquisition module, for obtaining word identical in same subdata；

Described iteration unit, is additionally operable to complete the iterative computation of same words according to identical AliasTable；

Described theme processing module, is additionally operable to generate respectively the theme that each same words is corresponding according to iteration result.

12. Data subject acquisition device as claimed in claim 11, it is characterised in that also include: described acquisition module, it is additionally operable to obtain word identical in different subdata, and obtains the factor identical in described Gibbs sampling formula；

Described construction unit, is additionally operable to build the Gibbs sampling formula of same words in different subdata according to the identical factor；

Described iteration unit, is additionally operable to complete the iterative computation of equivalent according to the Gibbs sampling formula built；

Described theme processing module, is additionally operable to generate each same words theme in corresponding subdata according to iteration result.