CN109857942A

CN109857942A - For handling the method, apparatus, equipment and storage medium of document

Info

Publication number: CN109857942A
Application number: CN201910194822.0A
Authority: CN
Inventors: 李健
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2019-06-07

Abstract

This disclosure relates to method, apparatus, equipment and storage medium for handling document.According to an example implementations, a kind of document processing method is provided.In the method, determine to include one group of word in one group of document.Based on one group of document structure tree, one group of puppet document, the pseudo- document in one group of puppet document describes the incidence relation between other words in word and one group of word in one group of word.Keyword and one group of puppet document in terms of target based on specified one group of document, determine the associated probability distribution between each word and keyword in one group of word.Based on probability distribution, at least one theme associated with target aspect involved in one group of document is determined.Using above-mentioned implementation, at least one theme under the target aspect specified by keyword can be determined in more accurate mode.

Description

For handling the method, apparatus, equipment and storage medium of document

Technical field

The implementation of present disclosure broadly relates to document process, and more particularly, to for determining one group Method, apparatus, equipment and the computer storage medium of theme of the document under specified aspect.

Background technique

With the development of computer technology, the document of more and more types has been had already appeared at present.Especially, with social activity Network, electronic commerce network enter people's lives, this allows people to edit document, publication via these network platforms The comment etc. of oneself.The magnanimity document for coming automatic network or other media is faced, as how more accurate mode is literary from magnanimity Theme involved in document is excavated in shelves becomes a technical problem.

Summary of the invention

According to the sample implementation of present disclosure, a kind of scheme for document process is provided.

In the first aspect of present disclosure, a kind of document processing method is provided.In the method, one group of text is obtained It include one group of word in shelves.Based on one group of document structure tree, one group of puppet document, the pseudo- document in one group of puppet document describes one group of word In word and one group of word in other words between incidence relation.Key in terms of target based on specified one group of document Word and one group of puppet document determine the associated probability distribution between each word and keyword in one group of word.Based on general Rate distribution determines at least one theme associated with target aspect involved in one group of document.

In in the second aspect of the present disclosure, a kind of document processing device, document processing is provided.The device includes: acquisition module, Be configured to obtain includes one group of word in one group of document；Generation module is configured to based on one group of document structure tree, one group of pseudo-text Grade, the pseudo- document in one group of puppet document describes being associated between word and other words in one group of word in one group of word System；Determining module is configured to keyword and one group of puppet document in terms of the target based on specified one group of document, determines one group Each word in word and the associated probability distribution between keyword；And topic module, it is configured to based on probability point Cloth determines at least one theme associated with target aspect involved in one group of document.

In the third aspect of present disclosure, a kind of equipment is provided.The equipment includes one or more processors；With And storage device, for storing one or more programs, when one or more programs are executed by one or more processors, so that The method that one or more processors realize the first aspect according to present disclosure.

In the fourth aspect of present disclosure, a kind of computer-readable Jie for being stored thereon with computer program is provided Matter, the method which realizes the first aspect according to present disclosure when being executed by processor.

It should be appreciated that content described in Summary is not intended to limit the implementation of present disclosure Crucial or important feature, it is also non-for limiting the scope of the disclosure.Other features of present disclosure will be by below Description is easy to understand.

Detailed description of the invention

It refers to the following detailed description in conjunction with the accompanying drawings, it is the above and other feature of each implementation of present disclosure, excellent Point and aspect will be apparent.In the accompanying drawings, the same or similar appended drawing reference indicates the same or similar element, In:

Fig. 1 diagrammatically illustrates the schematic diagram of the relationship between document, word and theme；

Fig. 2 diagrammatically illustrates the signal of the relationship included by particular document, the particular document between word and theme Figure；

Fig. 3 diagrammatically illustrates the technical solution for document process of the example implementations according to present disclosure Block diagram；

Fig. 4 diagrammatically illustrates the stream of the method for document process of the example implementations according to present disclosure Cheng Tu；

Fig. 5 A and Fig. 5 B are diagrammatically illustrated respectively according to the example implementations of present disclosure based on sliding window Mouthful determine the block diagram occurred jointly of word；

Fig. 6 diagrammatically illustrates the block diagram of the format of the pseudo- document of the example implementations according to present disclosure；

Fig. 7 diagrammatically illustrates determining based on probability Distribution Model according to the example implementations of present disclosure The block diagram for the associated probability distribution between word and keyword for including in one group of puppet document；

Fig. 8 diagrammatically illustrates each ginseng in the probability Distribution Model according to the example implementations of present disclosure Several block diagrams；

Fig. 9 diagrammatically illustrates the block diagram of the document processing device, document processing of the example implementations according to present disclosure；With And

Figure 10 shows the block diagram that can implement the calculating equipment of multiple implementations of present disclosure.

Specific embodiment

The implementation of present disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing Certain implementations of content, it should be understood that, present disclosure can be realized by various forms, and not answered This is construed as limited to the implementation illustrated here, provides these on the contrary and is achieved in that for more thorough and complete geography Solve present disclosure.It should be understood that the attached drawing and being given for example only property of implementation of present disclosure act on, it is not intended to Limit the protection scope of present disclosure.

In the description of the implementation of present disclosure, term " includes " and its similar term should be understood as opening Include, i.e., " including but not limited to ".Term "based" should be understood as " being based at least partially on ".Term " implementation " Or " implementation " should be understood as " at least one implementation ".Term " first ", " second " etc. may refer to difference Or identical object.Hereafter it is also possible that other specific and implicit definition.

The kinds of schemes for determining the theme of one group of document has been had already appeared at present.For example, having been proposed leading at present The concept for inscribing model, can determine theme involved in one group of document based on topic model.However, topic model be related to towards The full text to document of all aspects carries out complete analysis, to excavate all themes.Text is described referring first to Fig. 1 Relationship between shelves, word and theme.

Fig. 1 diagrammatically illustrates the schematic diagram 100 of the relationship between document, word and theme.Fig. 1 shows one group of document 110, one group of document 110 in this for example can be the article for coming automatic network or other media, each user's comments in forum By etc..Each document in one group of document 110 may include different number word 130,132 ... and 134 etc..? This, if theme refers to that the word for including in the theme namely document of semantic structure represented by the word in document discusses Topic.As shown in Figure 1, one group of document 110 can be related to multiple themes, for example, theme 120 ... and theme 122 etc..Into one Step, each theme 120 ... and theme 122 can relate separately to different words.For example, theme 120 can be related to word Language 130,132 and 134, and theme 122 can be related to theme 130 and 132.

Hereinafter, it will refer to Fig. 2 and the more details of document, theme and word be discussed in detail.Fig. 2 diagrammatically illustrates spy Determine the schematic diagram 200 of the relationship included by document, the particular document between word and theme.Fig. 2 diagrammatically illustrates specific Document 210, the document 210 include text: " two camera manufacturers of Nikon (Nikon) and Canon (Canon) are in the market The product of competitive position, Liang Jia manufacturer respectively has advantage.For example, for camera screen, clarity ... ".

The concept that topic model has been proposed at present, document 210 is analyzed based on topic model, can obtain the document Multiple themes 220,222 and 224 involved in 210.For example, document 210 can be related to following three theme: 220 " Buddhist nun of theme Health ", theme 222 " Canon " and theme 224 " screen ".Further, it can also be determined based on topic model related to each theme The word of connection.For example, theme 220 is related to word: Nikon, Nikon etc., theme 222 are related to word Canon, Canon etc., and theme 224 are related to word: screen, clarity etc..

However, topic model is related to carry out complete analysis to whole words of document towards all aspects, to excavate Some themes.If it is desire to the theme in terms of obtaining given target, then need searching and target in the whole themes excavated The relevant theme of aspect.Thus, be based on the theme that topic model obtains it is coarse, in terms of can not at large describing target. Such as, if it is desired to it analyzes more multi-threaded under " screen " in terms of target in one group of document in relation to camera, then needs head First obtain involved in one group of document all aspects, then in terms of the whole in be filtered based on " screen ".At this point, as how The granularity more refined becomes one and needs come (one or more) theme for handling document to obtain in the case where specifying in terms of target It solves the problems, such as.

In order at least be partially solved the deficiency in above-mentioned technical proposal, according to the exemplary realization of the disclosure, provide A kind of document process to determine target involved in document in terms of under theme technical solution.It will be understood that being different from passing The technical solution that document subject matter is determined based on topic model of system, theme in this refer to the master under scheduled target aspect Topic.Here, theme refers to the theme of semantic structure represented by the word in document, namely indicate the multinomial probability point of word Cloth there is high probability word can express the meaning of this theme semantically under one theme.

Hereinafter, it will refer to the exemplary realization that Fig. 3 is broadly described the disclosure.Fig. 3 is diagrammatically illustrated according to this public affairs Open the block diagram 300 of the technical solution for document process of the example implementations of content.As shown in Figure 3, it is first determined one One group of word 310 for including in group document 110.It will be understood that one group of word 310 in this is the whole in one group of document 110 The whole words for including in document.In this example implementations, the concept of pseudo- document is proposed, one group of document can be based on 110 generate one group of puppet document 320.A pseudo- document in one group of puppet document 320 in this describes one in one group of word 310 The incidence relation between other words in word and one group of word 310.

Further, it is possible to determine one group of puppet document based on the keyword 330 in terms of one group of puppet document 320 and specified target The probability distribution 340 of incidence relation between each word in 320 and keyword 330.Here, keyword 330 is specified mesh Mark the keyword of aspect.In terms of belonging to the theme for including in the determining one group of document 110 of the specified expectation of this keyword 330. For example, if one group of document 110 be discuss camera related content document, and it is expected determine one group of document 110 in " screen " relevant theme of camera, then keyword can be " screen " at this time.In another example, if it is desired to determine one group of document Theme relevant to " weight " of camera in 110, then keyword can be " weight " at this time.

Then, probability distribution 340 can be based on, determine involved in one group of document 110 it is associated with target aspect extremely A few theme 350.Specifically, it is assumed that the keyword 330 in terms of target is " screen ", then can be from one group of document 110 really The one or more themes to set the goal under aspect.For example, it may be determined that the theme under in terms of " screen ": picture, menu, imaging.

Based on probability distribution 340, at least one master associated in terms of target involved in one group of document 110 is determined Topic.For example, theme can for example be related to the " dish that " picture ", the screen that screen is shown are shown for " screen " in terms of target Multiple son aspects such as " imaging " of list " and screen.In this way, it is possible to be determined in one group of document in more accurate mode Theme under in terms of the target.

Hereinafter, it will refer to Fig. 4 and describe the more details in relation to document process.

Fig. 4 diagrammatically illustrates the method 400 for document process of the example implementations according to present disclosure Flow chart.At frame 410, determine to include one group of word 310 in one group of document 110.One group of document 110 in this is (for example, N A document) indicate one group of document to be analyzed.Each document in one group of document 110 may include the word of different number, This one group of word 310 refers to the summation of the word in whole documents.Assuming that each of N number of document in one group of document 110 Document respectively include N1, N2 ... and Nn word, then one group of word 310 may include M word at this time, and M= N1+N2+…+Nn。

According to the example implementations of the disclosure, can be executed at text for each document in one group of document 110 Reason has semantic word using as one group of word 310 to extract from one group of document 110.It will be understood that at text in this Reason can be related to filtering out redundancy word, word or other unnecessary ingredients without practical semanteme from document, into And extracting has semantic word as the word in one group of word 310.In this way it is possible to assure that executing the base of document process Plinth can really reflect document content and have practical semantic meaning.

At frame 420, one group of puppet document 310 is generated based on one group of document 110.It will be understood that one group of puppet document 320 herein In pseudo- document the incidence relation between other words in word and one group of word in one group of word 310 is described.Herein one Pseudo- number of documents in the pseudo- document 320 of group is identical as the word quantity in one group of word 310.In other words, a word corresponds to One pseudo- document, thus based on one group of word 310 including M word, total M pseudo- document can be generated.

According to the example implementations of the disclosure, can be generated one by one for each word in one group of document 110 Corresponding puppet document.For example, the corresponding first pseudo- document can be generated for the first word in one group of word 310.Specifically Ground, can based on the common appearance between other multiple words in the first word and one group of word 310, determine the first word with The common frequency of occurrences between other multiple words.Further, it is possible to be generated in one group of puppet document 110 based on the common frequency of occurrences Pseudo- document associated with the first word.

For example, it is assumed that word " picture " is first word in one group of word 310, and other words include " color " ..., " camera lens " etc..Can determine at this time word " picture " whether with other words " color " ..., " camera lens " altogether With appearance.If common occur, the frequency that word " picture " and other words occur jointly can be increased.For example, it is assumed that " figure Piece " and " color " occur that twice, then 2 can be set by the common frequency of occurrences at this time jointly.If uncommon occur, can To set 0 for the common frequency of occurrences.The frequency occurred jointly can be stored using data structure shown in table 1 as follows.

The frequency that 1 word of table occurs jointly

	Picture	Color	…	Camera lens
					Picture	0	2	…	1
Color	2	0	…	1
					…	…	…	0	…
Camera lens	1	1	…	0

Table 1 includes M+1 row (the 0th row of serial number to M row), wherein the 1st row to M row respectively indicates in M word Each word.Table 1 includes M+1 column (column of serial number the 0th to m column), wherein the 1st column to m column respectively indicates M word In each word.As shown in table 1, the i-th row is located in table, the numerical value for the intersection that jth arranges indicate in M word the The frequency that i word and j-th of word occur jointly.For example, in 2 table of numerical value of word " picture " and the intersection of " color " Show: the frequency that word " picture " and " color " occur jointly is 2.

By executing process described above for each word in M word, can be obtained any in M word The frequency that two words occur jointly, and then determine the common frequency of occurrences as shown in Table 1.It will be understood that table 1 above is only shown The example that meaning property is shown for storing the common frequency of occurrences can also be adopted according to the example implementations of the disclosure The common frequency of occurrences is stored with other data structures.For example, can be stored using matrix or other modes.

It will be understood that the meaning of " common to occur " can be defined based on Different Rule.For example, a rule can specify If two words appear in simultaneously indicates common appearance in a paragraph.In another example if a rule can specify two Word appears in simultaneously indicates common appearance in a sentence.According to the example implementations of the disclosure, it can also be specified His rule defines common appearance, such as can determine whether two words go out jointly based on the distance between two words It is existing.

It is common to occur to refer to that the distance between two words are less than pre- spacing according to the example implementations of the disclosure From.Distance in this can be the quantity of word included between two words.Alternatively, distance can also utilize two words The difference between position where language determines.According to the example implementations of the disclosure, can be set according to preset distance The length of sliding window is set, and determines the common frequency of occurrences based on sliding window.It can be based on the sliding window of predetermined length To scan each document in one group of document 110.For example, predetermined length can be set to 10 or other numerical value, and utilize Sliding window scans each document in N number of document one by one.It should be appreciated that predetermined length " 10 " in this can be sliding window The quantity for the word for including in mouthful.Although each word may include different numbers of words, in the sliding process of sliding window, with Word is that unit is slided.For example, sliding step can be set to one or more words.

Hereinafter, Fig. 5 A and Fig. 5 B be will refer to and describe the more details in relation to sliding window.It can be first by sliding window Mouth is placed in the starting position of document 210, and executes sliding to the ending of document 210.If in the present scope of sliding window Two words of interior determination occur jointly, increase the common frequency of occurrences of the two words.Referring to Fig. 5 A, which shows root According to the block diagram 500A occurred jointly for determining word based on sliding window of the example implementations of present disclosure.Scheming In 5A, sliding window 510A is shown the case where being located at 210 middle position of document after repeatedly sliding.In this example, Word " screen " and " clarity " are located in sliding window 510A simultaneously, thus being total to word " screen " and " clarity " Increase by 1 with the frequency of occurrences.By the way of sliding window, each word can be determined in a manner of more simple and efficient The frequency occurred jointly.

It, can be by the mobile predetermined step of sliding window 510A after having handled each word in sliding window 510A Long (for example, position of mobile 1 word).For example, sliding window 510A can be moved backward to 1 word, to reach as schemed Position shown in 5B.Fig. 5 B is diagrammatically illustrated according to the example implementations of present disclosure based on sliding window come really Determine the block diagram 500B of word occurred jointly.In figure 5B, sliding window 510B still includes word " screen " and " clarity ", The common frequency of occurrences of two words can be increased by 1 again at this time.Then, sliding window 510B can be moved backward, and The common frequency of occurrences of other words is determined in a similar manner.After scanning all N number of document, it can be generated in M word Any two word the common frequency of occurrences (as shown in Table 1).

Based on the common frequency of occurrences as shown in Table 1, can be generated for each word in one group of word 310 corresponding Pseudo- document, to generate one group of puppet document 110.The format of pseudo- document is described referring first to Fig. 6, Fig. 6 diagrammatically illustrates root According to the block diagram 600 of the format of the pseudo- document of the example implementations of present disclosure.As shown in fig. 6, pseudo- document may include Two parts: document head 610 and document body 620.The document head 610 can indicate that pseudo- document is generated for which word, and Document body 620 may include other words in the one group of word 310 occurred jointly with the word in document head 610.With this side Formula can indicate that pseudo- document is generated for which word by document head 610 in a simple manner.

According to the example implementations of the disclosure, it can be added into the document head 610 of pseudo- document and be used as comparison basis Word, and the word occurred jointly with the word can be added into the document body 620 of pseudo- document.With in table 1 above most A line " camera lens " is example afterwards, and word " camera lens " is the word as comparison basis, thus can be added into document head 610 " camera lens ".Word " picture ", " color " are the words occurred jointly with word " camera lens ", thus can be added into document body 620 Add " picture ", " color ".In this way, it is possible to which obtaining table 2 as follows diagrammatically illustrates pseudo- document.

Table 2 is directed to the pseudo- document of word " camera lens "

Document head	Document body
		Camera lens	Picture, color ...

Need to consider being total to for word when adding word into document body 620 according to the example implementations of the disclosure With the frequency occurred.Based on the common frequency of occurrences, word is added into document body 620.As shown in the first row in table 1, " face Color " and " picture " occur 2 times jointly, and " camera lens " and " picture " occurs 1 time jointly.At this point, should be added into document body 620 2 times " colors " and 1 time " camera lens " is added.Thus, it will be as shown in table 3 below for the pseudo- document of word " picture ".

Table 3 is directed to the pseudo- document of word " picture "

Document head	Document body
		Picture	Color, color ..., camera lens

According to the example implementations of the disclosure, the word for including in document body 620 is unordered.In other words, document Body is the set of multiple words, and the sequence between each word is unrelated.In the puppet document shown in table 2, for word The pseudo- document that " camera lens " generates is also denoted as shown in following table 4.

Table 4 is directed to the pseudo- document of word " camera lens "

Document head	Document body
		Camera lens	Color, picture ...

It in this implementation, can only consider whether each word occurs jointly, and not need to consider each word Successive position.Also, in document body 620 may include multiple identical words, to indicate in the word and document head 610 Word occur jointly repeatedly.It can determine in one group of document 110 and target side in a more efficient manner by this method The relevant word in face.

Fig. 4 is returned to, the keyword 330 and one group of puppet at frame 430, in terms of the target based on specified one group of document 110 Document 320 determines the associated probability distribution 340 between each word and keyword 330 in one group of word 310.Specifically, According to the example implementations of the disclosure, under proposing a kind of words of description in terms of the target specified by keyword 330 The probability Distribution Model of probability distribution.Probability Distribution Model in this includes will be to word and the mesh specified by keyword 330 The multiple parameters that incidence relation between mark aspect has an impact.It by this method, can be with by adjusting the numerical value of parameters More flexible mode controls the process of determining theme.Hereinafter, it will refer to the related determining probability Distribution Model of Fig. 6 description More details.

Fig. 7 diagrammatically illustrates each in the probability Distribution Model 700 according to the example implementations of present disclosure The block diagram of a parameter.Hereinafter, it will refer to the concrete meaning for the parameters that Fig. 7 is described in probability Distribution Model 700.Such as figure Shown in 7, probability Distribution Model 700 may include multiple parameters.Parameter N indicates the quantity of the document in one group of document 110.Parameter Expectation obtains how many a themes under K is indicated in terms of being specified by keyword 330.For example, it is assumed that keyword is " screen " and K It is arranged to numerical value 3, then at this time using 3 themes under probability Distribution Model 700 available " screen " aspect.

As shown in fig. 7, each document d is related to a Bernoulli Jacob and is distributed π for N number of document_d, the distribution is by parameter For γ conjugate prior Beta be distributed generate, for indicating this document and target in terms of degree of correlation.Additionally, there are N number of Multinomial distributionThe distribution is obeyed the Di Li Cray that parameter is α and is distributed, each Θ_dIn terms of indicating document d to target Multinomial distribution.

Can be with an indicator variable r, whether the word for indicating input is related to target aspect.As r=1, indicate This word is related in terms of target, is by the multinomial distribution of the keyword in terms of targetIt generates.As r=0, table Show that this word and target aspect are unrelated.It will be understood that since the purpose of the disclosure is to obtain the mesh specified by keyword 330 The lower theme of mark aspect, thus in every document and the incoherent word of target aspect, it can be by the document under the document To the multinomial distribution of wordIt generates.In addition, introducing correlation priori variable an x, x=1 indicate to include related in document d Word in key word set S, and think that document d and target aspect are perfectly correlated.

By configuring the numerical value of multiple parameters, and by the word and the training probability point of keyword 330 in one group of puppet document 320 Cloth model can obtain each word probability distribution associated with keyword 330 in one group of word 310.Fig. 8 schematically shows Go out and has determined word and keyword 330 based on probability Distribution Model 700 according to the example implementations of present disclosure Between associated probability distribution block diagram 800.Using operation as shown in Figure 8, one group in one group of document 110 can be obtained Word 310 in terms of the target as specified by keyword 330 under probability distribution.

It specifically, can be by one group of puppet document 320 and keyword 330 as input, for training probability shown in Fig. 7 point Cloth model 700, to obtain corresponding probability distribution 340.Continue example above, it is assumed that keyword 330 is " screen ", is based on Probability Distribution Model 700 can obtain probability distribution of each word in M word under " screen " this aspect.Change speech It, each word in M word has corresponding probability, which indicates word possibility associated with " screen " this aspect Property.

According to the example implementations of the disclosure, the quantity of the determining theme of expectation can also be predefined, above Parameter K indicate theme quantity.If it is desire to obtaining 3 themes under " screen " this aspect, then parameter K can be arranged It is 3.If it is desire to obtaining 4 themes under " screen " this aspect, then 4 can be set by parameter K.Then, according to above The method of description can be obtained word by the probability distribution under 330 designated key of keyword.Following table 5 diagrammatically illustrates The example of the probability distribution of word in one theme:

The example of probability distribution of 5 word of table in a theme

Serial number	Word	Probability
			1	Picture	0.002
2	Color	0.001
			…	…	…
M	Camera lens	0.0005

As shown in table 5, first row indicates the serial number of each word in M word, and secondary series indicates each in M word A word, and third column indicate that each word is the probability in a theme.Exist although table 5 only diagrammatically illustrates each word Probability distribution in one theme can obtain 3 probability distribution under 3 themes, and under each theme as K=3 The format of probability distribution is similar to table 5.It will be understood that under each theme, the specific value of probability shown in third column It will be different.

Return to Fig. 4, at frame 440, be based on probability distribution 340, determine involved in one group of document 110 with target aspect phase At least one associated theme 350.Specifically, a master can be determined based on probability distribution of the word under a theme Topic.According to the example implementations of the disclosure, it can be based on probability distribution 340, multiple words are ranked up.Then it is based on Multiple words of sequence, determine the theme at least one theme.

For the probability distribution under a theme as shown in Table 5, can be arranged according to third in probability number Value is ranked up from big to small, to obtain the probability distribution after sequence as shown in table 6.

Probability distribution after the sequence of table 6

Ranking	Word	Probability
			1	Picture	0.002
2	Color	0.001
			3	Camera lens	0.0005
…	…	…

As shown in table 6, the first row in table indicates the ranking that word is ranked up according to the height of probability, and secondary series indicates Word in M word, and third column indicate the corresponding probability of word.After height in table 6 according to probability is ranked up, Originally the word " camera lens " for being located at last line in table 5 is rearranged the position of ranking the 3rd.At this point, being arranged under the theme Highest 3 words of name are " picture ", " color " and " camera lens " respectively.Thus, it at this time can be based on former words of ranking To determine the details of theme.In this example, the color that can be related to picture from the theme obtained of table 6 is related to camera lens Content.It will be understood that due to top ranked word and theme correlation more closely, thus master determining by this method Topic will be more accurate.

It will be understood that although the simple examples of the probability of 3 words are only gived in table 5 and table 6, in concrete application ring In border, one group of document 110 may include thousands of even more words.It is obtained according to method as described above at this time It will include more rows in table 5 and table 6, and every row indicates the probability of a word in M word.

According to the example implementations of the disclosure, greater number of word can also be selected from the probability distribution after sequence Language.For example, can choose the word that ranking is located at preceding 10.Assuming that the keyword of input is " screen ", and K=3 is set to obtain Obtain 3 themes.Following table 7 is shown based on probability distribution and 3 themes in terms of the target under " screen " for obtaining: figure Piece, menu, imaging.Probability size in three themes is illustrated only herein and comes preceding 10 words, wherein being shown with underscore Word indicate with target in terms of incoherent word.

Three themes under in terms of table 7 " screen "

According to the example implementations of the disclosure, for same group of document 110, it is assumed that the keyword 330 of input is " weight Amount ", and K=3 is set, then 3 themes shown in table 8 as follows: eyeglass, battery, carrying can be generated.It illustrates only herein Probability size comes preceding 10 words in three themes, wherein in terms of being indicated with the word shown in underscore with target " weight " Incoherent word.

Three themes under in terms of table 8 " weight "

According to the example implementations of the disclosure, each word and key in one group of document 110 can be fully considered Incidence relation between word 330, and the one or more under only generating in terms of the target as specified by keyword 330.With This mode can improve the defect in terms of cannot specifying target in existing topic model.Further, using the example of the disclosure Property implementation, can be with the quantity of designated key, by the numerical value of the K in setting probability Distribution Model, can be with more smart Thin granularity is come one or more themes under in terms of determining specified target.

The multiple implementations for how handling the method 400 of document are hereinbefore described in detail.According to the disclosure Example implementations, additionally provide the device for handling document.Hereinafter, it will refer to Fig. 9 detailed description.Fig. 9 shows Meaning property shows the block diagram of the document processing device, document processing 900 according to the example implementations of present disclosure.As shown in figure 9, should Device 900 includes: to obtain module 910, and being configured to obtain includes one group of word in one group of document；Generation module 920, configuration are used In being based on one group of document structure tree, one group of puppet document, the pseudo- document in one group of puppet document describes word and one group of word in one group of word The incidence relation between other words in language；Determining module 930, in terms of being configured to the target based on specified one group of document Keyword and one group of puppet document determine the associated probability distribution between each word and keyword in one group of word；With And topic module 940, be configured to based on probability distribution, determine involved in one group of document it is associated with target aspect at least One theme.

According to the example implementations of the disclosure, generation module 920 includes: pseudo- document creation module, is configured to give birth to At the associated with the first word in one group of word first pseudo- document in one group of document.

According to the example implementations of the disclosure, pseudo- document creation module includes: frequency determining module, is configured to base The common appearance between other multiple words in the first word and one group of word, determines the first word and other multiple words Between the common frequency of occurrences；And establish module, be configured to establish based on the common frequency of occurrences in one group of puppet document with The pseudo- document of first word associated first.

According to the example implementations of the disclosure, determining module 930 includes: scan module, is configured to based on predetermined The sliding window of length scans each document in one group of document；And increase module, it is configured in response to determining first Word occurs in the present scope of sliding window with the word in other multiple words jointly, increases the common frequency of occurrences；With And mobile module, it is configured to the mobile pre- fixed step size of sliding window.

According to the example implementations of the disclosure, establishing module includes: document head generation module, is configured to first Pseudo- document adds the first word using the document head as the first pseudo- document；And document body generation module, it is configured to first The second word that the addition of pseudo- document occurs jointly with the first word is using the document body as the first pseudo- document.

According to the example implementations of the disclosure, document body generation module includes: adding module, is configured to based on altogether The same frequency of occurrences adds the second word into the first pseudo- document.

According to the example implementations of the disclosure, the word for including in document body is unordered.

According to the example implementations of the disclosure, determining module 930 includes: acquisition module, is configured to obtain description Associated probability Distribution Model between word and keyword；And training module, it is configured to based in one group of puppet document One group of word and crucial word train probability Distribution Model, it is associated with keyword general to obtain each word in one group of word Rate distribution.

According to the example implementations of the disclosure, training module includes: that quantity obtains module, is configured to acquisition and mesh The quantity of at least one associated theme of mark aspect；And

Training module based on quantity is configured to obtain at least one of quantity based on quantity and probability Distribution Model Probability distribution.

According to the example implementations of the disclosure, topic module 940 includes: sorting module, is configured to based on probability Distribution, multiple words are ranked up；And mark module, multiple words based on sequence are configured to, at least one is identified Theme in theme.

According to the example implementations of the disclosure, obtaining module 910 includes: text processing module, is configured to be directed to Document in one group of document executes text-processing, has semantic word using as one group of word to extract from one group of document.

Figure 10 shows the block diagram that can implement the calculating equipment 1000 of multiple implementations of present disclosure.Equipment 1000 methods that can be used to implement Fig. 4 description.As shown, equipment 1000 includes central processing unit (CPU) 1001, it can To be loaded at random according to the computer program instructions being stored in read-only memory (ROM) 1002 or from storage unit 1008 The computer program instructions in memory (RAM) 1003 are accessed, to execute various movements appropriate and processing.In RAM 1003 In, it can also store equipment 1000 and operate required various programs and data.CPU 1001, ROM 1002 and RAM 1003 pass through Bus 1004 is connected with each other.Input/output (I/O) interface 1005 is also connected to bus 1004.

Multiple components in equipment 1000 are connected to I/O interface 1005, comprising: input unit 1006, such as keyboard, mouse Deng；Output unit 1007, such as various types of displays, loudspeaker etc.；Storage unit 1008, such as disk, CD etc.；With And communication unit 1009, such as network interface card, modem, wireless communication transceiver etc..Communication unit 1009 allows equipment 1000 Information/data is exchanged with other equipment by the computer network and/or various telecommunication networks of such as internet.

Processing unit 1001 executes each method as described above and processing, such as method 400.For example, in some realities In existing mode, method 400 can be implemented as computer software programs, be tangibly embodied in machine readable media, such as deposit Storage unit 1008.In some implementations, some or all of of computer program can be via ROM 1002 and/or communication Unit 1009 and be loaded into and/or be installed in equipment 1000.When computer program loads are to RAM 1003 and by CPU 1001 When execution, the one or more steps of method as described above 400 can be executed.Alternatively, in other implementations, CPU 1001 can be configured as execution method 400 by other any modes (for example, by means of firmware) appropriate.

According to the example implementations of present disclosure, a kind of computer for being stored thereon with computer program is provided Readable storage medium storing program for executing.Method described in the disclosure is realized when program is executed by processor.

Function described herein can be executed at least partly by one or more hardware logic components.Example Such as, without limitation, the hardware logic component for the exemplary type that can be used includes: field programmable gate array (FPGA), dedicated Integrated circuit (ASIC), Application Specific Standard Product (ASSP), the system (SOC) of system on chip, load programmable logic device (CPLD) etc..

Program code for implementing the method for present disclosure can be using any group of one or more programming languages It closes to write.These program codes can be supplied to general purpose computer, special purpose computer or other programmable data processing units Processor or controller so that program code when by processor or controller execution when make to be advised in flowchart and or block diagram Fixed function/operation is carried out.Program code can be executed completely on machine, partly be executed on machine, as independence Software package partly executes on machine and partly executes or hold on remote machine or server on the remote machine completely Row.

In the context of present disclosure, machine readable media can be tangible medium, may include or stores The program for using or being used in combination with instruction execution system, device or equipment for instruction execution system, device or equipment.Machine Device readable medium can be machine-readable signal medium or machine-readable storage medium.Machine readable media may include but unlimited In times of electronics, magnetic, optical, electromagnetism, infrared or semiconductor system, device or equipment or above content What appropriate combination.The more specific example of machine readable storage medium will include the electrical connection of line based on one or more, portable Formula computer disks, hard disk, random access memory (RAM), read-only memory (ROM), Erasable Programmable Read Only Memory EPROM (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage are set Standby or above content any appropriate combination.

Although this should be understood as requiring operating in this way with shown in addition, depicting each operation using certain order Certain order out executes in sequential order, or requires the operation of all diagrams that should be performed to obtain desired result. Under certain environment, multitask and parallel processing be may be advantageous.Similarly, although containing several tools in being discussed above Body realize details, but these be not construed as to scope of the present disclosure limitation.In individual implementation Certain features described in context can also be realized in combination in single realize.On the contrary, in the context individually realized Described in various features can also realize individually or in any suitable subcombination in multiple realizations.

Although having used specific to this theme of the language description of structure feature and/or method logical action, answer When understanding that theme defined in the appended claims is not necessarily limited to special characteristic described above or movement.On on the contrary, Special characteristic described in face and movement are only to realize the exemplary forms of claims.

Claims

1. a kind of document processing method, comprising:

Obtaining includes one group of word in one group of document；

Based on described one group of document structure tree, one group of puppet document, the pseudo- document in one group of puppet document is described in one group of word Word and one group of word in other words between incidence relation；

Keyword and one group of puppet document in terms of target based on specified one group of document, determine in one group of word Associated probability distribution between each word and the keyword；And

Based on the probability distribution, at least one master associated in terms of the target involved in one group of document is determined Topic.

2. according to the method described in claim 1, wherein including: based on one group of puppet document described in one group of document structure tree

Generate the associated with the first word in one group of word first pseudo- document in one group of document, comprising:

Based on the common appearance between other multiple words in first word and one group of word, described first is determined The common frequency of occurrences between word and other the multiple words；And

Associated with first word described first in one group of puppet document is established based on the common frequency of occurrences Pseudo- document.

3. according to the method described in claim 2, wherein determining that the common frequency of occurrences includes:

Each document in one group of document is scanned based on the sliding window of predetermined length；And

In response to determination first word in the present scope of the sliding window with the word in other the multiple words Language occurs jointly, increases the common frequency of occurrences；And

By the mobile pre- fixed step size of the sliding window.

4. according to the method described in claim 2, wherein being generated in one group of puppet document based on the common frequency of occurrences The first pseudo- document associated with first word includes:

First word is added using the document head as the described first pseudo- document to the described first pseudo- document；And

To the second word that the described first pseudo- document addition occurs jointly with first word using as the described first pseudo- document Document body.

5. according to the method described in claim 4, wherein occurring jointly to the described first pseudo- document addition with first word The second word include:

Based on the common frequency of occurrences, second word is added in the pseudo- document of Xiang Suoshu first.

6. according to the method described in claim 4, the word for wherein including in the document body is unordered.

7. according to the method described in claim 1, wherein determine each word in one group of word and the keyword it Between associated probability distribution include:

Obtain the associated probability Distribution Model between words of description and keyword；And

Based on probability Distribution Model described in the one group of word and the crucial word train in one group of puppet document, described in obtaining Each word probability distribution associated with the keyword in one group of word.

8. according to the method described in claim 7, wherein obtaining the probability distribution and further comprising:

The quantity of at least one associated theme in terms of obtaining with the target；And

Based on the quantity and the probability Distribution Model, at least one probability distribution of the quantity is obtained.

9. according to the method described in claim 1, wherein determining related to the specified aspect involved in one group of document Connection at least one theme include:

Based on the probability distribution, the multiple word is ranked up；And

The multiple word based on sequence identifies the theme at least one described theme.

10. according to the method described in claim 1, wherein obtaining in one group of document includes that one group of word includes:

Text-processing is executed for the document in one group of document, there is semantic word to extract from one group of document Using as one group of word.

11. a kind of document processing device, document processing, comprising:

Module is obtained, being configured to obtain includes one group of word in one group of document；

Generation module is configured to be based on described one group of document structure tree, one group of puppet document, the pseudo- document in one group of puppet document Incidence relation between other words in word and one group of word in one group of word is described；

Determining module is configured to keyword and one group of puppet document in terms of the target based on specified one group of document, really The associated probability distribution between each word and the keyword in fixed one group of word；And

Topic module is configured to determine involved in one group of document and in terms of the target based on the probability distribution At least one associated theme.

12. device according to claim 11, wherein the generation module includes:

Pseudo- document creation module is configured to generate related to the first word in one group of word in one group of document The pseudo- document of the first of connection, comprising:

Frequency determining module is configured to based between other multiple words in first word and one group of word It is common to occur, determine the common frequency of occurrences between first word and other the multiple words；And

Establish module, be configured to establish based on the common frequency of occurrences in one group of puppet document with first word Associated described first pseudo- document.

13. device according to claim 12, wherein the determining module includes:

Scan module is configured to the sliding window based on predetermined length to scan each document in one group of document；With And

Increase module, be configured in response to determination first word in the present scope of the sliding window with it is described more Word in other a words occurs jointly, increases the common frequency of occurrences；And

Mobile module is configured to the mobile pre- fixed step size of the sliding window.

14. device according to claim 12, wherein the module of establishing includes:

Document head generation module is configured to add first word to the described first pseudo- document using as first pseudo-text The document head of shelves；And

Document body generation module is configured to the second word occurred jointly to the described first pseudo- document addition with first word Language is using the document body as the described first pseudo- document.

15. device according to claim 14, wherein the document body generation module includes:

Adding module is configured to be based on the common frequency of occurrences, adds second word in the pseudo- document of Xiang Suoshu first.

16. device according to claim 14, wherein the word for including in the document body is unordered.

17. device according to claim 11, wherein the determining module includes:

Module is obtained, is configured to obtain the associated probability Distribution Model between words of description and keyword；And

Training module is configured to based on probability point described in the one group of word and the crucial word train in one group of puppet document Cloth model, to obtain each word probability distribution associated with the keyword in one group of word.

18. device according to claim 17, wherein the training module includes:

Quantity obtains module, is configured to the quantity of at least one associated theme in terms of obtaining with the target；And

Training module based on quantity is configured to obtain the quantity based on the quantity and the probability Distribution Model At least one probability distribution.

19. device according to claim 11, wherein the topic module includes:

Sorting module is configured to be ranked up the multiple word based on the probability distribution；And

Mark module is configured to the multiple word based on sequence, identifies the theme at least one described theme.

20. the apparatus according to claim 1, wherein the acquisition module further comprises:

Text processing module is configured to execute text-processing for the document in one group of document, with literary from described one group Extracting in shelves has semantic word using as one group of word.

21. a kind of document processing device, the equipment include:

One or more processors；And

Storage device, for storing one or more programs, when one or more of programs are by one or more of processing Device executes, so that one or more of processors realize method according to claim 1 to 10.

22. a kind of computer readable storage medium is stored thereon with computer program, realization when described program is executed by processor Method according to claim 1 to 10.