CN109857942A - For handling the method, apparatus, equipment and storage medium of document - Google Patents
For handling the method, apparatus, equipment and storage medium of document Download PDFInfo
- Publication number
- CN109857942A CN109857942A CN201910194822.0A CN201910194822A CN109857942A CN 109857942 A CN109857942 A CN 109857942A CN 201910194822 A CN201910194822 A CN 201910194822A CN 109857942 A CN109857942 A CN 109857942A
- Authority
- CN
- China
- Prior art keywords
- document
- word
- group
- pseudo
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
This disclosure relates to method, apparatus, equipment and storage medium for handling document.According to an example implementations, a kind of document processing method is provided.In the method, determine to include one group of word in one group of document.Based on one group of document structure tree, one group of puppet document, the pseudo- document in one group of puppet document describes the incidence relation between other words in word and one group of word in one group of word.Keyword and one group of puppet document in terms of target based on specified one group of document, determine the associated probability distribution between each word and keyword in one group of word.Based on probability distribution, at least one theme associated with target aspect involved in one group of document is determined.Using above-mentioned implementation, at least one theme under the target aspect specified by keyword can be determined in more accurate mode.
Description
Technical field
The implementation of present disclosure broadly relates to document process, and more particularly, to for determining one group
Method, apparatus, equipment and the computer storage medium of theme of the document under specified aspect.
Background technique
With the development of computer technology, the document of more and more types has been had already appeared at present.Especially, with social activity
Network, electronic commerce network enter people's lives, this allows people to edit document, publication via these network platforms
The comment etc. of oneself.The magnanimity document for coming automatic network or other media is faced, as how more accurate mode is literary from magnanimity
Theme involved in document is excavated in shelves becomes a technical problem.
Summary of the invention
According to the sample implementation of present disclosure, a kind of scheme for document process is provided.
In the first aspect of present disclosure, a kind of document processing method is provided.In the method, one group of text is obtained
It include one group of word in shelves.Based on one group of document structure tree, one group of puppet document, the pseudo- document in one group of puppet document describes one group of word
In word and one group of word in other words between incidence relation.Key in terms of target based on specified one group of document
Word and one group of puppet document determine the associated probability distribution between each word and keyword in one group of word.Based on general
Rate distribution determines at least one theme associated with target aspect involved in one group of document.
In in the second aspect of the present disclosure, a kind of document processing device, document processing is provided.The device includes: acquisition module,
Be configured to obtain includes one group of word in one group of document;Generation module is configured to based on one group of document structure tree, one group of pseudo-text
Grade, the pseudo- document in one group of puppet document describes being associated between word and other words in one group of word in one group of word
System;Determining module is configured to keyword and one group of puppet document in terms of the target based on specified one group of document, determines one group
Each word in word and the associated probability distribution between keyword;And topic module, it is configured to based on probability point
Cloth determines at least one theme associated with target aspect involved in one group of document.
In the third aspect of present disclosure, a kind of equipment is provided.The equipment includes one or more processors;With
And storage device, for storing one or more programs, when one or more programs are executed by one or more processors, so that
The method that one or more processors realize the first aspect according to present disclosure.
In the fourth aspect of present disclosure, a kind of computer-readable Jie for being stored thereon with computer program is provided
Matter, the method which realizes the first aspect according to present disclosure when being executed by processor.
It should be appreciated that content described in Summary is not intended to limit the implementation of present disclosure
Crucial or important feature, it is also non-for limiting the scope of the disclosure.Other features of present disclosure will be by below
Description is easy to understand.
Detailed description of the invention
It refers to the following detailed description in conjunction with the accompanying drawings, it is the above and other feature of each implementation of present disclosure, excellent
Point and aspect will be apparent.In the accompanying drawings, the same or similar appended drawing reference indicates the same or similar element,
In:
Fig. 1 diagrammatically illustrates the schematic diagram of the relationship between document, word and theme;
Fig. 2 diagrammatically illustrates the signal of the relationship included by particular document, the particular document between word and theme
Figure;
Fig. 3 diagrammatically illustrates the technical solution for document process of the example implementations according to present disclosure
Block diagram;
Fig. 4 diagrammatically illustrates the stream of the method for document process of the example implementations according to present disclosure
Cheng Tu;
Fig. 5 A and Fig. 5 B are diagrammatically illustrated respectively according to the example implementations of present disclosure based on sliding window
Mouthful determine the block diagram occurred jointly of word;
Fig. 6 diagrammatically illustrates the block diagram of the format of the pseudo- document of the example implementations according to present disclosure;
Fig. 7 diagrammatically illustrates determining based on probability Distribution Model according to the example implementations of present disclosure
The block diagram for the associated probability distribution between word and keyword for including in one group of puppet document;
Fig. 8 diagrammatically illustrates each ginseng in the probability Distribution Model according to the example implementations of present disclosure
Several block diagrams;
Fig. 9 diagrammatically illustrates the block diagram of the document processing device, document processing of the example implementations according to present disclosure;With
And
Figure 10 shows the block diagram that can implement the calculating equipment of multiple implementations of present disclosure.
Specific embodiment
The implementation of present disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Certain implementations of content, it should be understood that, present disclosure can be realized by various forms, and not answered
This is construed as limited to the implementation illustrated here, provides these on the contrary and is achieved in that for more thorough and complete geography
Solve present disclosure.It should be understood that the attached drawing and being given for example only property of implementation of present disclosure act on, it is not intended to
Limit the protection scope of present disclosure.
In the description of the implementation of present disclosure, term " includes " and its similar term should be understood as opening
Include, i.e., " including but not limited to ".Term "based" should be understood as " being based at least partially on ".Term " implementation "
Or " implementation " should be understood as " at least one implementation ".Term " first ", " second " etc. may refer to difference
Or identical object.Hereafter it is also possible that other specific and implicit definition.
The kinds of schemes for determining the theme of one group of document has been had already appeared at present.For example, having been proposed leading at present
The concept for inscribing model, can determine theme involved in one group of document based on topic model.However, topic model be related to towards
The full text to document of all aspects carries out complete analysis, to excavate all themes.Text is described referring first to Fig. 1
Relationship between shelves, word and theme.
Fig. 1 diagrammatically illustrates the schematic diagram 100 of the relationship between document, word and theme.Fig. 1 shows one group of document
110, one group of document 110 in this for example can be the article for coming automatic network or other media, each user's comments in forum
By etc..Each document in one group of document 110 may include different number word 130,132 ... and 134 etc..?
This, if theme refers to that the word for including in the theme namely document of semantic structure represented by the word in document discusses
Topic.As shown in Figure 1, one group of document 110 can be related to multiple themes, for example, theme 120 ... and theme 122 etc..Into one
Step, each theme 120 ... and theme 122 can relate separately to different words.For example, theme 120 can be related to word
Language 130,132 and 134, and theme 122 can be related to theme 130 and 132.
Hereinafter, it will refer to Fig. 2 and the more details of document, theme and word be discussed in detail.Fig. 2 diagrammatically illustrates spy
Determine the schematic diagram 200 of the relationship included by document, the particular document between word and theme.Fig. 2 diagrammatically illustrates specific
Document 210, the document 210 include text: " two camera manufacturers of Nikon (Nikon) and Canon (Canon) are in the market
The product of competitive position, Liang Jia manufacturer respectively has advantage.For example, for camera screen, clarity ... ".
The concept that topic model has been proposed at present, document 210 is analyzed based on topic model, can obtain the document
Multiple themes 220,222 and 224 involved in 210.For example, document 210 can be related to following three theme: 220 " Buddhist nun of theme
Health ", theme 222 " Canon " and theme 224 " screen ".Further, it can also be determined based on topic model related to each theme
The word of connection.For example, theme 220 is related to word: Nikon, Nikon etc., theme 222 are related to word Canon, Canon etc., and theme
224 are related to word: screen, clarity etc..
However, topic model is related to carry out complete analysis to whole words of document towards all aspects, to excavate
Some themes.If it is desire to the theme in terms of obtaining given target, then need searching and target in the whole themes excavated
The relevant theme of aspect.Thus, be based on the theme that topic model obtains it is coarse, in terms of can not at large describing target.
Such as, if it is desired to it analyzes more multi-threaded under " screen " in terms of target in one group of document in relation to camera, then needs head
First obtain involved in one group of document all aspects, then in terms of the whole in be filtered based on " screen ".At this point, as how
The granularity more refined becomes one and needs come (one or more) theme for handling document to obtain in the case where specifying in terms of target
It solves the problems, such as.
In order at least be partially solved the deficiency in above-mentioned technical proposal, according to the exemplary realization of the disclosure, provide
A kind of document process to determine target involved in document in terms of under theme technical solution.It will be understood that being different from passing
The technical solution that document subject matter is determined based on topic model of system, theme in this refer to the master under scheduled target aspect
Topic.Here, theme refers to the theme of semantic structure represented by the word in document, namely indicate the multinomial probability point of word
Cloth there is high probability word can express the meaning of this theme semantically under one theme.
Hereinafter, it will refer to the exemplary realization that Fig. 3 is broadly described the disclosure.Fig. 3 is diagrammatically illustrated according to this public affairs
Open the block diagram 300 of the technical solution for document process of the example implementations of content.As shown in Figure 3, it is first determined one
One group of word 310 for including in group document 110.It will be understood that one group of word 310 in this is the whole in one group of document 110
The whole words for including in document.In this example implementations, the concept of pseudo- document is proposed, one group of document can be based on
110 generate one group of puppet document 320.A pseudo- document in one group of puppet document 320 in this describes one in one group of word 310
The incidence relation between other words in word and one group of word 310.
Further, it is possible to determine one group of puppet document based on the keyword 330 in terms of one group of puppet document 320 and specified target
The probability distribution 340 of incidence relation between each word in 320 and keyword 330.Here, keyword 330 is specified mesh
Mark the keyword of aspect.In terms of belonging to the theme for including in the determining one group of document 110 of the specified expectation of this keyword 330.
For example, if one group of document 110 be discuss camera related content document, and it is expected determine one group of document 110 in
" screen " relevant theme of camera, then keyword can be " screen " at this time.In another example, if it is desired to determine one group of document
Theme relevant to " weight " of camera in 110, then keyword can be " weight " at this time.
Then, probability distribution 340 can be based on, determine involved in one group of document 110 it is associated with target aspect extremely
A few theme 350.Specifically, it is assumed that the keyword 330 in terms of target is " screen ", then can be from one group of document 110 really
The one or more themes to set the goal under aspect.For example, it may be determined that the theme under in terms of " screen ": picture, menu, imaging.
Based on probability distribution 340, at least one master associated in terms of target involved in one group of document 110 is determined
Topic.For example, theme can for example be related to the " dish that " picture ", the screen that screen is shown are shown for " screen " in terms of target
Multiple son aspects such as " imaging " of list " and screen.In this way, it is possible to be determined in one group of document in more accurate mode
Theme under in terms of the target.
Hereinafter, it will refer to Fig. 4 and describe the more details in relation to document process.
Fig. 4 diagrammatically illustrates the method 400 for document process of the example implementations according to present disclosure
Flow chart.At frame 410, determine to include one group of word 310 in one group of document 110.One group of document 110 in this is (for example, N
A document) indicate one group of document to be analyzed.Each document in one group of document 110 may include the word of different number,
This one group of word 310 refers to the summation of the word in whole documents.Assuming that each of N number of document in one group of document 110
Document respectively include N1, N2 ... and Nn word, then one group of word 310 may include M word at this time, and M=
N1+N2+…+Nn。
According to the example implementations of the disclosure, can be executed at text for each document in one group of document 110
Reason has semantic word using as one group of word 310 to extract from one group of document 110.It will be understood that at text in this
Reason can be related to filtering out redundancy word, word or other unnecessary ingredients without practical semanteme from document, into
And extracting has semantic word as the word in one group of word 310.In this way it is possible to assure that executing the base of document process
Plinth can really reflect document content and have practical semantic meaning.
At frame 420, one group of puppet document 310 is generated based on one group of document 110.It will be understood that one group of puppet document 320 herein
In pseudo- document the incidence relation between other words in word and one group of word in one group of word 310 is described.Herein one
Pseudo- number of documents in the pseudo- document 320 of group is identical as the word quantity in one group of word 310.In other words, a word corresponds to
One pseudo- document, thus based on one group of word 310 including M word, total M pseudo- document can be generated.
According to the example implementations of the disclosure, can be generated one by one for each word in one group of document 110
Corresponding puppet document.For example, the corresponding first pseudo- document can be generated for the first word in one group of word 310.Specifically
Ground, can based on the common appearance between other multiple words in the first word and one group of word 310, determine the first word with
The common frequency of occurrences between other multiple words.Further, it is possible to be generated in one group of puppet document 110 based on the common frequency of occurrences
Pseudo- document associated with the first word.
For example, it is assumed that word " picture " is first word in one group of word 310, and other words include
" color " ..., " camera lens " etc..Can determine at this time word " picture " whether with other words " color " ..., " camera lens " altogether
With appearance.If common occur, the frequency that word " picture " and other words occur jointly can be increased.For example, it is assumed that " figure
Piece " and " color " occur that twice, then 2 can be set by the common frequency of occurrences at this time jointly.If uncommon occur, can
To set 0 for the common frequency of occurrences.The frequency occurred jointly can be stored using data structure shown in table 1 as follows.
The frequency that 1 word of table occurs jointly
Picture | Color | … | Camera lens | |
Picture | 0 | 2 | … | 1 |
Color | 2 | 0 | … | 1 |
… | … | … | 0 | … |
Camera lens | 1 | 1 | … | 0 |
Table 1 includes M+1 row (the 0th row of serial number to M row), wherein the 1st row to M row respectively indicates in M word
Each word.Table 1 includes M+1 column (column of serial number the 0th to m column), wherein the 1st column to m column respectively indicates M word
In each word.As shown in table 1, the i-th row is located in table, the numerical value for the intersection that jth arranges indicate in M word the
The frequency that i word and j-th of word occur jointly.For example, in 2 table of numerical value of word " picture " and the intersection of " color "
Show: the frequency that word " picture " and " color " occur jointly is 2.
By executing process described above for each word in M word, can be obtained any in M word
The frequency that two words occur jointly, and then determine the common frequency of occurrences as shown in Table 1.It will be understood that table 1 above is only shown
The example that meaning property is shown for storing the common frequency of occurrences can also be adopted according to the example implementations of the disclosure
The common frequency of occurrences is stored with other data structures.For example, can be stored using matrix or other modes.
It will be understood that the meaning of " common to occur " can be defined based on Different Rule.For example, a rule can specify
If two words appear in simultaneously indicates common appearance in a paragraph.In another example if a rule can specify two
Word appears in simultaneously indicates common appearance in a sentence.According to the example implementations of the disclosure, it can also be specified
His rule defines common appearance, such as can determine whether two words go out jointly based on the distance between two words
It is existing.
It is common to occur to refer to that the distance between two words are less than pre- spacing according to the example implementations of the disclosure
From.Distance in this can be the quantity of word included between two words.Alternatively, distance can also utilize two words
The difference between position where language determines.According to the example implementations of the disclosure, can be set according to preset distance
The length of sliding window is set, and determines the common frequency of occurrences based on sliding window.It can be based on the sliding window of predetermined length
To scan each document in one group of document 110.For example, predetermined length can be set to 10 or other numerical value, and utilize
Sliding window scans each document in N number of document one by one.It should be appreciated that predetermined length " 10 " in this can be sliding window
The quantity for the word for including in mouthful.Although each word may include different numbers of words, in the sliding process of sliding window, with
Word is that unit is slided.For example, sliding step can be set to one or more words.
Hereinafter, Fig. 5 A and Fig. 5 B be will refer to and describe the more details in relation to sliding window.It can be first by sliding window
Mouth is placed in the starting position of document 210, and executes sliding to the ending of document 210.If in the present scope of sliding window
Two words of interior determination occur jointly, increase the common frequency of occurrences of the two words.Referring to Fig. 5 A, which shows root
According to the block diagram 500A occurred jointly for determining word based on sliding window of the example implementations of present disclosure.Scheming
In 5A, sliding window 510A is shown the case where being located at 210 middle position of document after repeatedly sliding.In this example,
Word " screen " and " clarity " are located in sliding window 510A simultaneously, thus being total to word " screen " and " clarity "
Increase by 1 with the frequency of occurrences.By the way of sliding window, each word can be determined in a manner of more simple and efficient
The frequency occurred jointly.
It, can be by the mobile predetermined step of sliding window 510A after having handled each word in sliding window 510A
Long (for example, position of mobile 1 word).For example, sliding window 510A can be moved backward to 1 word, to reach as schemed
Position shown in 5B.Fig. 5 B is diagrammatically illustrated according to the example implementations of present disclosure based on sliding window come really
Determine the block diagram 500B of word occurred jointly.In figure 5B, sliding window 510B still includes word " screen " and " clarity ",
The common frequency of occurrences of two words can be increased by 1 again at this time.Then, sliding window 510B can be moved backward, and
The common frequency of occurrences of other words is determined in a similar manner.After scanning all N number of document, it can be generated in M word
Any two word the common frequency of occurrences (as shown in Table 1).
Based on the common frequency of occurrences as shown in Table 1, can be generated for each word in one group of word 310 corresponding
Pseudo- document, to generate one group of puppet document 110.The format of pseudo- document is described referring first to Fig. 6, Fig. 6 diagrammatically illustrates root
According to the block diagram 600 of the format of the pseudo- document of the example implementations of present disclosure.As shown in fig. 6, pseudo- document may include
Two parts: document head 610 and document body 620.The document head 610 can indicate that pseudo- document is generated for which word, and
Document body 620 may include other words in the one group of word 310 occurred jointly with the word in document head 610.With this side
Formula can indicate that pseudo- document is generated for which word by document head 610 in a simple manner.
According to the example implementations of the disclosure, it can be added into the document head 610 of pseudo- document and be used as comparison basis
Word, and the word occurred jointly with the word can be added into the document body 620 of pseudo- document.With in table 1 above most
A line " camera lens " is example afterwards, and word " camera lens " is the word as comparison basis, thus can be added into document head 610
" camera lens ".Word " picture ", " color " are the words occurred jointly with word " camera lens ", thus can be added into document body 620
Add " picture ", " color ".In this way, it is possible to which obtaining table 2 as follows diagrammatically illustrates pseudo- document.
Table 2 is directed to the pseudo- document of word " camera lens "
Document head | Document body |
Camera lens | Picture, color ... |
Need to consider being total to for word when adding word into document body 620 according to the example implementations of the disclosure
With the frequency occurred.Based on the common frequency of occurrences, word is added into document body 620.As shown in the first row in table 1, " face
Color " and " picture " occur 2 times jointly, and " camera lens " and " picture " occurs 1 time jointly.At this point, should be added into document body 620
2 times " colors " and 1 time " camera lens " is added.Thus, it will be as shown in table 3 below for the pseudo- document of word " picture ".
Table 3 is directed to the pseudo- document of word " picture "
Document head | Document body |
Picture | Color, color ..., camera lens |
According to the example implementations of the disclosure, the word for including in document body 620 is unordered.In other words, document
Body is the set of multiple words, and the sequence between each word is unrelated.In the puppet document shown in table 2, for word
The pseudo- document that " camera lens " generates is also denoted as shown in following table 4.
Table 4 is directed to the pseudo- document of word " camera lens "
Document head | Document body |
Camera lens | Color, picture ... |
It in this implementation, can only consider whether each word occurs jointly, and not need to consider each word
Successive position.Also, in document body 620 may include multiple identical words, to indicate in the word and document head 610
Word occur jointly repeatedly.It can determine in one group of document 110 and target side in a more efficient manner by this method
The relevant word in face.
Fig. 4 is returned to, the keyword 330 and one group of puppet at frame 430, in terms of the target based on specified one group of document 110
Document 320 determines the associated probability distribution 340 between each word and keyword 330 in one group of word 310.Specifically,
According to the example implementations of the disclosure, under proposing a kind of words of description in terms of the target specified by keyword 330
The probability Distribution Model of probability distribution.Probability Distribution Model in this includes will be to word and the mesh specified by keyword 330
The multiple parameters that incidence relation between mark aspect has an impact.It by this method, can be with by adjusting the numerical value of parameters
More flexible mode controls the process of determining theme.Hereinafter, it will refer to the related determining probability Distribution Model of Fig. 6 description
More details.
Fig. 7 diagrammatically illustrates each in the probability Distribution Model 700 according to the example implementations of present disclosure
The block diagram of a parameter.Hereinafter, it will refer to the concrete meaning for the parameters that Fig. 7 is described in probability Distribution Model 700.Such as figure
Shown in 7, probability Distribution Model 700 may include multiple parameters.Parameter N indicates the quantity of the document in one group of document 110.Parameter
Expectation obtains how many a themes under K is indicated in terms of being specified by keyword 330.For example, it is assumed that keyword is " screen " and K
It is arranged to numerical value 3, then at this time using 3 themes under probability Distribution Model 700 available " screen " aspect.
As shown in fig. 7, each document d is related to a Bernoulli Jacob and is distributed π for N number of documentd, the distribution is by parameter
For γ conjugate prior Beta be distributed generate, for indicating this document and target in terms of degree of correlation.Additionally, there are N number of
Multinomial distributionThe distribution is obeyed the Di Li Cray that parameter is α and is distributed, each ΘdIn terms of indicating document d to target
Multinomial distribution.
Can be with an indicator variable r, whether the word for indicating input is related to target aspect.As r=1, indicate
This word is related in terms of target, is by the multinomial distribution of the keyword in terms of targetIt generates.As r=0, table
Show that this word and target aspect are unrelated.It will be understood that since the purpose of the disclosure is to obtain the mesh specified by keyword 330
The lower theme of mark aspect, thus in every document and the incoherent word of target aspect, it can be by the document under the document
To the multinomial distribution of wordIt generates.In addition, introducing correlation priori variable an x, x=1 indicate to include related in document d
Word in key word set S, and think that document d and target aspect are perfectly correlated.
By configuring the numerical value of multiple parameters, and by the word and the training probability point of keyword 330 in one group of puppet document 320
Cloth model can obtain each word probability distribution associated with keyword 330 in one group of word 310.Fig. 8 schematically shows
Go out and has determined word and keyword 330 based on probability Distribution Model 700 according to the example implementations of present disclosure
Between associated probability distribution block diagram 800.Using operation as shown in Figure 8, one group in one group of document 110 can be obtained
Word 310 in terms of the target as specified by keyword 330 under probability distribution.
It specifically, can be by one group of puppet document 320 and keyword 330 as input, for training probability shown in Fig. 7 point
Cloth model 700, to obtain corresponding probability distribution 340.Continue example above, it is assumed that keyword 330 is " screen ", is based on
Probability Distribution Model 700 can obtain probability distribution of each word in M word under " screen " this aspect.Change speech
It, each word in M word has corresponding probability, which indicates word possibility associated with " screen " this aspect
Property.
According to the example implementations of the disclosure, the quantity of the determining theme of expectation can also be predefined, above
Parameter K indicate theme quantity.If it is desire to obtaining 3 themes under " screen " this aspect, then parameter K can be arranged
It is 3.If it is desire to obtaining 4 themes under " screen " this aspect, then 4 can be set by parameter K.Then, according to above
The method of description can be obtained word by the probability distribution under 330 designated key of keyword.Following table 5 diagrammatically illustrates
The example of the probability distribution of word in one theme:
The example of probability distribution of 5 word of table in a theme
Serial number | Word | Probability |
1 | Picture | 0.002 |
2 | Color | 0.001 |
… | … | … |
M | Camera lens | 0.0005 |
As shown in table 5, first row indicates the serial number of each word in M word, and secondary series indicates each in M word
A word, and third column indicate that each word is the probability in a theme.Exist although table 5 only diagrammatically illustrates each word
Probability distribution in one theme can obtain 3 probability distribution under 3 themes, and under each theme as K=3
The format of probability distribution is similar to table 5.It will be understood that under each theme, the specific value of probability shown in third column
It will be different.
Return to Fig. 4, at frame 440, be based on probability distribution 340, determine involved in one group of document 110 with target aspect phase
At least one associated theme 350.Specifically, a master can be determined based on probability distribution of the word under a theme
Topic.According to the example implementations of the disclosure, it can be based on probability distribution 340, multiple words are ranked up.Then it is based on
Multiple words of sequence, determine the theme at least one theme.
For the probability distribution under a theme as shown in Table 5, can be arranged according to third in probability number
Value is ranked up from big to small, to obtain the probability distribution after sequence as shown in table 6.
Probability distribution after the sequence of table 6
Ranking | Word | Probability |
1 | Picture | 0.002 |
2 | Color | 0.001 |
3 | Camera lens | 0.0005 |
… | … | … |
As shown in table 6, the first row in table indicates the ranking that word is ranked up according to the height of probability, and secondary series indicates
Word in M word, and third column indicate the corresponding probability of word.After height in table 6 according to probability is ranked up,
Originally the word " camera lens " for being located at last line in table 5 is rearranged the position of ranking the 3rd.At this point, being arranged under the theme
Highest 3 words of name are " picture ", " color " and " camera lens " respectively.Thus, it at this time can be based on former words of ranking
To determine the details of theme.In this example, the color that can be related to picture from the theme obtained of table 6 is related to camera lens
Content.It will be understood that due to top ranked word and theme correlation more closely, thus master determining by this method
Topic will be more accurate.
It will be understood that although the simple examples of the probability of 3 words are only gived in table 5 and table 6, in concrete application ring
In border, one group of document 110 may include thousands of even more words.It is obtained according to method as described above at this time
It will include more rows in table 5 and table 6, and every row indicates the probability of a word in M word.
According to the example implementations of the disclosure, greater number of word can also be selected from the probability distribution after sequence
Language.For example, can choose the word that ranking is located at preceding 10.Assuming that the keyword of input is " screen ", and K=3 is set to obtain
Obtain 3 themes.Following table 7 is shown based on probability distribution and 3 themes in terms of the target under " screen " for obtaining: figure
Piece, menu, imaging.Probability size in three themes is illustrated only herein and comes preceding 10 words, wherein being shown with underscore
Word indicate with target in terms of incoherent word.
Three themes under in terms of table 7 " screen "
According to the example implementations of the disclosure, for same group of document 110, it is assumed that the keyword 330 of input is " weight
Amount ", and K=3 is set, then 3 themes shown in table 8 as follows: eyeglass, battery, carrying can be generated.It illustrates only herein
Probability size comes preceding 10 words in three themes, wherein in terms of being indicated with the word shown in underscore with target " weight "
Incoherent word.
Three themes under in terms of table 8 " weight "
According to the example implementations of the disclosure, each word and key in one group of document 110 can be fully considered
Incidence relation between word 330, and the one or more under only generating in terms of the target as specified by keyword 330.With
This mode can improve the defect in terms of cannot specifying target in existing topic model.Further, using the example of the disclosure
Property implementation, can be with the quantity of designated key, by the numerical value of the K in setting probability Distribution Model, can be with more smart
Thin granularity is come one or more themes under in terms of determining specified target.
The multiple implementations for how handling the method 400 of document are hereinbefore described in detail.According to the disclosure
Example implementations, additionally provide the device for handling document.Hereinafter, it will refer to Fig. 9 detailed description.Fig. 9 shows
Meaning property shows the block diagram of the document processing device, document processing 900 according to the example implementations of present disclosure.As shown in figure 9, should
Device 900 includes: to obtain module 910, and being configured to obtain includes one group of word in one group of document;Generation module 920, configuration are used
In being based on one group of document structure tree, one group of puppet document, the pseudo- document in one group of puppet document describes word and one group of word in one group of word
The incidence relation between other words in language;Determining module 930, in terms of being configured to the target based on specified one group of document
Keyword and one group of puppet document determine the associated probability distribution between each word and keyword in one group of word;With
And topic module 940, be configured to based on probability distribution, determine involved in one group of document it is associated with target aspect at least
One theme.
According to the example implementations of the disclosure, generation module 920 includes: pseudo- document creation module, is configured to give birth to
At the associated with the first word in one group of word first pseudo- document in one group of document.
According to the example implementations of the disclosure, pseudo- document creation module includes: frequency determining module, is configured to base
The common appearance between other multiple words in the first word and one group of word, determines the first word and other multiple words
Between the common frequency of occurrences;And establish module, be configured to establish based on the common frequency of occurrences in one group of puppet document with
The pseudo- document of first word associated first.
According to the example implementations of the disclosure, determining module 930 includes: scan module, is configured to based on predetermined
The sliding window of length scans each document in one group of document;And increase module, it is configured in response to determining first
Word occurs in the present scope of sliding window with the word in other multiple words jointly, increases the common frequency of occurrences;With
And mobile module, it is configured to the mobile pre- fixed step size of sliding window.
According to the example implementations of the disclosure, establishing module includes: document head generation module, is configured to first
Pseudo- document adds the first word using the document head as the first pseudo- document;And document body generation module, it is configured to first
The second word that the addition of pseudo- document occurs jointly with the first word is using the document body as the first pseudo- document.
According to the example implementations of the disclosure, document body generation module includes: adding module, is configured to based on altogether
The same frequency of occurrences adds the second word into the first pseudo- document.
According to the example implementations of the disclosure, the word for including in document body is unordered.
According to the example implementations of the disclosure, determining module 930 includes: acquisition module, is configured to obtain description
Associated probability Distribution Model between word and keyword;And training module, it is configured to based in one group of puppet document
One group of word and crucial word train probability Distribution Model, it is associated with keyword general to obtain each word in one group of word
Rate distribution.
According to the example implementations of the disclosure, training module includes: that quantity obtains module, is configured to acquisition and mesh
The quantity of at least one associated theme of mark aspect;And
Training module based on quantity is configured to obtain at least one of quantity based on quantity and probability Distribution Model
Probability distribution.
According to the example implementations of the disclosure, topic module 940 includes: sorting module, is configured to based on probability
Distribution, multiple words are ranked up;And mark module, multiple words based on sequence are configured to, at least one is identified
Theme in theme.
According to the example implementations of the disclosure, obtaining module 910 includes: text processing module, is configured to be directed to
Document in one group of document executes text-processing, has semantic word using as one group of word to extract from one group of document.
Figure 10 shows the block diagram that can implement the calculating equipment 1000 of multiple implementations of present disclosure.Equipment
1000 methods that can be used to implement Fig. 4 description.As shown, equipment 1000 includes central processing unit (CPU) 1001, it can
To be loaded at random according to the computer program instructions being stored in read-only memory (ROM) 1002 or from storage unit 1008
The computer program instructions in memory (RAM) 1003 are accessed, to execute various movements appropriate and processing.In RAM 1003
In, it can also store equipment 1000 and operate required various programs and data.CPU 1001, ROM 1002 and RAM 1003 pass through
Bus 1004 is connected with each other.Input/output (I/O) interface 1005 is also connected to bus 1004.
Multiple components in equipment 1000 are connected to I/O interface 1005, comprising: input unit 1006, such as keyboard, mouse
Deng;Output unit 1007, such as various types of displays, loudspeaker etc.;Storage unit 1008, such as disk, CD etc.;With
And communication unit 1009, such as network interface card, modem, wireless communication transceiver etc..Communication unit 1009 allows equipment 1000
Information/data is exchanged with other equipment by the computer network and/or various telecommunication networks of such as internet.
Processing unit 1001 executes each method as described above and processing, such as method 400.For example, in some realities
In existing mode, method 400 can be implemented as computer software programs, be tangibly embodied in machine readable media, such as deposit
Storage unit 1008.In some implementations, some or all of of computer program can be via ROM 1002 and/or communication
Unit 1009 and be loaded into and/or be installed in equipment 1000.When computer program loads are to RAM 1003 and by CPU 1001
When execution, the one or more steps of method as described above 400 can be executed.Alternatively, in other implementations, CPU
1001 can be configured as execution method 400 by other any modes (for example, by means of firmware) appropriate.
According to the example implementations of present disclosure, a kind of computer for being stored thereon with computer program is provided
Readable storage medium storing program for executing.Method described in the disclosure is realized when program is executed by processor.
Function described herein can be executed at least partly by one or more hardware logic components.Example
Such as, without limitation, the hardware logic component for the exemplary type that can be used includes: field programmable gate array (FPGA), dedicated
Integrated circuit (ASIC), Application Specific Standard Product (ASSP), the system (SOC) of system on chip, load programmable logic device
(CPLD) etc..
Program code for implementing the method for present disclosure can be using any group of one or more programming languages
It closes to write.These program codes can be supplied to general purpose computer, special purpose computer or other programmable data processing units
Processor or controller so that program code when by processor or controller execution when make to be advised in flowchart and or block diagram
Fixed function/operation is carried out.Program code can be executed completely on machine, partly be executed on machine, as independence
Software package partly executes on machine and partly executes or hold on remote machine or server on the remote machine completely
Row.
In the context of present disclosure, machine readable media can be tangible medium, may include or stores
The program for using or being used in combination with instruction execution system, device or equipment for instruction execution system, device or equipment.Machine
Device readable medium can be machine-readable signal medium or machine-readable storage medium.Machine readable media may include but unlimited
In times of electronics, magnetic, optical, electromagnetism, infrared or semiconductor system, device or equipment or above content
What appropriate combination.The more specific example of machine readable storage medium will include the electrical connection of line based on one or more, portable
Formula computer disks, hard disk, random access memory (RAM), read-only memory (ROM), Erasable Programmable Read Only Memory EPROM
(EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage are set
Standby or above content any appropriate combination.
Although this should be understood as requiring operating in this way with shown in addition, depicting each operation using certain order
Certain order out executes in sequential order, or requires the operation of all diagrams that should be performed to obtain desired result.
Under certain environment, multitask and parallel processing be may be advantageous.Similarly, although containing several tools in being discussed above
Body realize details, but these be not construed as to scope of the present disclosure limitation.In individual implementation
Certain features described in context can also be realized in combination in single realize.On the contrary, in the context individually realized
Described in various features can also realize individually or in any suitable subcombination in multiple realizations.
Although having used specific to this theme of the language description of structure feature and/or method logical action, answer
When understanding that theme defined in the appended claims is not necessarily limited to special characteristic described above or movement.On on the contrary,
Special characteristic described in face and movement are only to realize the exemplary forms of claims.
Claims (22)
1. a kind of document processing method, comprising:
Obtaining includes one group of word in one group of document;
Based on described one group of document structure tree, one group of puppet document, the pseudo- document in one group of puppet document is described in one group of word
Word and one group of word in other words between incidence relation;
Keyword and one group of puppet document in terms of target based on specified one group of document, determine in one group of word
Associated probability distribution between each word and the keyword;And
Based on the probability distribution, at least one master associated in terms of the target involved in one group of document is determined
Topic.
2. according to the method described in claim 1, wherein including: based on one group of puppet document described in one group of document structure tree
Generate the associated with the first word in one group of word first pseudo- document in one group of document, comprising:
Based on the common appearance between other multiple words in first word and one group of word, described first is determined
The common frequency of occurrences between word and other the multiple words;And
Associated with first word described first in one group of puppet document is established based on the common frequency of occurrences
Pseudo- document.
3. according to the method described in claim 2, wherein determining that the common frequency of occurrences includes:
Each document in one group of document is scanned based on the sliding window of predetermined length;And
In response to determination first word in the present scope of the sliding window with the word in other the multiple words
Language occurs jointly, increases the common frequency of occurrences;And
By the mobile pre- fixed step size of the sliding window.
4. according to the method described in claim 2, wherein being generated in one group of puppet document based on the common frequency of occurrences
The first pseudo- document associated with first word includes:
First word is added using the document head as the described first pseudo- document to the described first pseudo- document;And
To the second word that the described first pseudo- document addition occurs jointly with first word using as the described first pseudo- document
Document body.
5. according to the method described in claim 4, wherein occurring jointly to the described first pseudo- document addition with first word
The second word include:
Based on the common frequency of occurrences, second word is added in the pseudo- document of Xiang Suoshu first.
6. according to the method described in claim 4, the word for wherein including in the document body is unordered.
7. according to the method described in claim 1, wherein determine each word in one group of word and the keyword it
Between associated probability distribution include:
Obtain the associated probability Distribution Model between words of description and keyword;And
Based on probability Distribution Model described in the one group of word and the crucial word train in one group of puppet document, described in obtaining
Each word probability distribution associated with the keyword in one group of word.
8. according to the method described in claim 7, wherein obtaining the probability distribution and further comprising:
The quantity of at least one associated theme in terms of obtaining with the target;And
Based on the quantity and the probability Distribution Model, at least one probability distribution of the quantity is obtained.
9. according to the method described in claim 1, wherein determining related to the specified aspect involved in one group of document
Connection at least one theme include:
Based on the probability distribution, the multiple word is ranked up;And
The multiple word based on sequence identifies the theme at least one described theme.
10. according to the method described in claim 1, wherein obtaining in one group of document includes that one group of word includes:
Text-processing is executed for the document in one group of document, there is semantic word to extract from one group of document
Using as one group of word.
11. a kind of document processing device, document processing, comprising:
Module is obtained, being configured to obtain includes one group of word in one group of document;
Generation module is configured to be based on described one group of document structure tree, one group of puppet document, the pseudo- document in one group of puppet document
Incidence relation between other words in word and one group of word in one group of word is described;
Determining module is configured to keyword and one group of puppet document in terms of the target based on specified one group of document, really
The associated probability distribution between each word and the keyword in fixed one group of word;And
Topic module is configured to determine involved in one group of document and in terms of the target based on the probability distribution
At least one associated theme.
12. device according to claim 11, wherein the generation module includes:
Pseudo- document creation module is configured to generate related to the first word in one group of word in one group of document
The pseudo- document of the first of connection, comprising:
Frequency determining module is configured to based between other multiple words in first word and one group of word
It is common to occur, determine the common frequency of occurrences between first word and other the multiple words;And
Establish module, be configured to establish based on the common frequency of occurrences in one group of puppet document with first word
Associated described first pseudo- document.
13. device according to claim 12, wherein the determining module includes:
Scan module is configured to the sliding window based on predetermined length to scan each document in one group of document;With
And
Increase module, be configured in response to determination first word in the present scope of the sliding window with it is described more
Word in other a words occurs jointly, increases the common frequency of occurrences;And
Mobile module is configured to the mobile pre- fixed step size of the sliding window.
14. device according to claim 12, wherein the module of establishing includes:
Document head generation module is configured to add first word to the described first pseudo- document using as first pseudo-text
The document head of shelves;And
Document body generation module is configured to the second word occurred jointly to the described first pseudo- document addition with first word
Language is using the document body as the described first pseudo- document.
15. device according to claim 14, wherein the document body generation module includes:
Adding module is configured to be based on the common frequency of occurrences, adds second word in the pseudo- document of Xiang Suoshu first.
16. device according to claim 14, wherein the word for including in the document body is unordered.
17. device according to claim 11, wherein the determining module includes:
Module is obtained, is configured to obtain the associated probability Distribution Model between words of description and keyword;And
Training module is configured to based on probability point described in the one group of word and the crucial word train in one group of puppet document
Cloth model, to obtain each word probability distribution associated with the keyword in one group of word.
18. device according to claim 17, wherein the training module includes:
Quantity obtains module, is configured to the quantity of at least one associated theme in terms of obtaining with the target;And
Training module based on quantity is configured to obtain the quantity based on the quantity and the probability Distribution Model
At least one probability distribution.
19. device according to claim 11, wherein the topic module includes:
Sorting module is configured to be ranked up the multiple word based on the probability distribution;And
Mark module is configured to the multiple word based on sequence, identifies the theme at least one described theme.
20. the apparatus according to claim 1, wherein the acquisition module further comprises:
Text processing module is configured to execute text-processing for the document in one group of document, with literary from described one group
Extracting in shelves has semantic word using as one group of word.
21. a kind of document processing device, the equipment include:
One or more processors;And
Storage device, for storing one or more programs, when one or more of programs are by one or more of processing
Device executes, so that one or more of processors realize method according to claim 1 to 10.
22. a kind of computer readable storage medium is stored thereon with computer program, realization when described program is executed by processor
Method according to claim 1 to 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910194822.0A CN109857942A (en) | 2019-03-14 | 2019-03-14 | For handling the method, apparatus, equipment and storage medium of document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910194822.0A CN109857942A (en) | 2019-03-14 | 2019-03-14 | For handling the method, apparatus, equipment and storage medium of document |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109857942A true CN109857942A (en) | 2019-06-07 |
Family
ID=66900886
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910194822.0A Pending CN109857942A (en) | 2019-03-14 | 2019-03-14 | For handling the method, apparatus, equipment and storage medium of document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109857942A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110427488A (en) * | 2019-07-30 | 2019-11-08 | 北京明略软件系统有限公司 | The processing method and processing device of document |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140108327A1 (en) * | 2011-04-19 | 2014-04-17 | Yahoo! Inc. | System and method for mining tags using social endorsement networks |
CN103903164A (en) * | 2014-03-25 | 2014-07-02 | 华南理工大学 | Semi-supervised automatic aspect extraction method and system based on domain information |
CN105955957A (en) * | 2016-05-05 | 2016-09-21 | 北京邮电大学 | Determining method and device for aspect score in general comment of merchant |
CN107220232A (en) * | 2017-04-06 | 2017-09-29 | 北京百度网讯科技有限公司 | Keyword extracting method and device, equipment and computer-readable recording medium based on artificial intelligence |
CN108710611A (en) * | 2018-05-17 | 2018-10-26 | 南京大学 | A kind of short text topic model generation method of word-based network and term vector |
CN108763390A (en) * | 2018-05-18 | 2018-11-06 | 浙江新能量科技股份有限公司 | Fine granularity subject distillation method based on sliding window technique |
-
2019
- 2019-03-14 CN CN201910194822.0A patent/CN109857942A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140108327A1 (en) * | 2011-04-19 | 2014-04-17 | Yahoo! Inc. | System and method for mining tags using social endorsement networks |
CN103903164A (en) * | 2014-03-25 | 2014-07-02 | 华南理工大学 | Semi-supervised automatic aspect extraction method and system based on domain information |
CN105955957A (en) * | 2016-05-05 | 2016-09-21 | 北京邮电大学 | Determining method and device for aspect score in general comment of merchant |
CN107220232A (en) * | 2017-04-06 | 2017-09-29 | 北京百度网讯科技有限公司 | Keyword extracting method and device, equipment and computer-readable recording medium based on artificial intelligence |
CN108710611A (en) * | 2018-05-17 | 2018-10-26 | 南京大学 | A kind of short text topic model generation method of word-based network and term vector |
CN108763390A (en) * | 2018-05-18 | 2018-11-06 | 浙江新能量科技股份有限公司 | Fine granularity subject distillation method based on sliding window technique |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110427488A (en) * | 2019-07-30 | 2019-11-08 | 北京明略软件系统有限公司 | The processing method and processing device of document |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cappellari et al. | Calculation of multivariate normal probabilities by simulation, with applications to maximum simulated likelihood estimation | |
CN105487864B (en) | The method and apparatus of Code automatic build | |
Rimoldini et al. | Gaia Data Release 2-All-sky classification of high-amplitude pulsating stars | |
US9977655B2 (en) | System and method for automatic extraction of software design from requirements | |
CN104077147A (en) | Software reusing method based on code clone automatic detection and timely prompting | |
CN110287097A (en) | Batch testing method, device and computer readable storage medium | |
WO2015026681A1 (en) | Database access | |
CN111126019B (en) | Report generation method and device based on mode customization and electronic equipment | |
Rosenthal | A gams tutorial | |
CN107330014B (en) | Data table creating method and device | |
Abdul‐Rahman et al. | Constructive visual analytics for text similarity detection | |
Lamela Seijas et al. | Towards property-based testing of RESTful web services | |
CN103257858A (en) | Declarative dynamic control flow in continuation-based runtime | |
CN115390821B (en) | Natural language code-free programming-oriented software application construction method | |
EP3903182B1 (en) | Natural solution language | |
CN109857942A (en) | For handling the method, apparatus, equipment and storage medium of document | |
Warfield et al. | The DELTA chart: A method for R&D project portrayal | |
CN104133680A (en) | Fast building method of ERP form module | |
Brückner et al. | Learning GUI completions with user-defined constraints | |
US20230138367A1 (en) | Generation of graphical user interface prototypes | |
US11726748B2 (en) | Developing a software product in a no-code development platform to address a problem related to a business domain | |
Mengle et al. | Mastering machine learning on Aws: advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow | |
Chen et al. | VisualTPL: A visual dataflow language for report data transformation | |
Wang et al. | Interactive inconsistency fixing in feature modeling | |
Chen | Data-driven techniques for improving data collection in low-resource environments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |