CN103870440B - A kind of text data processing method and device - Google Patents
A kind of text data processing method and device Download PDFInfo
- Publication number
- CN103870440B CN103870440B CN201210534859.1A CN201210534859A CN103870440B CN 103870440 B CN103870440 B CN 103870440B CN 201210534859 A CN201210534859 A CN 201210534859A CN 103870440 B CN103870440 B CN 103870440B
- Authority
- CN
- China
- Prior art keywords
- text
- question
- answer
- answer text
- parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 12
- 230000004044 response Effects 0.000 claims abstract description 59
- 230000000875 corresponding Effects 0.000 claims abstract description 49
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 25
- 238000003860 storage Methods 0.000 claims abstract description 21
- 239000000203 mixture Substances 0.000 claims description 5
- 241000208340 Araliaceae Species 0.000 claims description 3
- 235000003140 Panax quinquefolius Nutrition 0.000 claims description 3
- 235000005035 ginseng Nutrition 0.000 claims description 3
- 235000008434 ginseng Nutrition 0.000 claims description 3
- 238000005303 weighing Methods 0.000 abstract description 5
- 238000009826 distribution Methods 0.000 description 29
- 238000000034 method Methods 0.000 description 21
- 238000010586 diagram Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000006011 modification reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 239000004744 fabric Substances 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 241000196324 Embryophyta Species 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000003287 optical Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Abstract
The invention discloses a kind of text data processing method and device, it is applied in mutual question answering system, in described mutual question answering system, storage has at least one question text, each question text is to there being at least one answer text, in order to improve the accuracy of dependency weighing result between answer text and question text.Text data processing method includes: receive the new answer text that user submits to for arbitrary question text;By described new answer text participle, obtain all words that described new answer text comprises;The all words comprised according to described new answer text and the first preset algorithm determine the response variable parameter that described new answer text is corresponding, wherein, described first preset algorithm determines according at least one answer text that the described question text stored in described mutual question answering system is corresponding with this question text, and described response variable parameter represents the matching degree between described new answer text and described question text.
Description
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of text data processing method and device.
Background technology
Along with the development of the network communications technology, become the one of solution problem by the answer of Network Capture problem
Effective means.But for same problem, answer present on network may have a lot, but which is answered
Case is more accurate, and the dependency the most how weighed between answer and problem becomes one of study hotspot.
At present, be the most all by question and answer between Text similarity computing carry out weighing, but
Being under normal circumstances, problem all compares briefly, and the word comprised is few, thus causes corresponding answer
Between there is semantic gap problem, therefore, utilize the balancing method of traditional text similarity so that weigh
There is bigger error in result.
Summary of the invention
The embodiment of the present invention provides a kind of text data processing method, in order to improve answer text and question text
Between the accuracy of dependency weighing result.
The embodiment of the present invention provides a kind of text data processing method, is applied in mutual question answering system, described
In mutual question answering system, storage has at least one question text, and each question text is to there being at least one answer
Text, including:
Receive user for arbitrary question text submit to new answer text;
By described new answer text participle, obtain all words that described new answer text comprises;
The all words and the first preset algorithm that comprise according to described new answer text determine described new answer
The response variable parameter that text is corresponding, wherein, described first preset algorithm is according in described mutual question answering system
The described question text that stored at least one answer text corresponding with this question text determines, described response
Variable parameter represents the matching degree between described new answer text and described question text.
The embodiment of the present invention provides a kind of text data processing device, is applied in mutual question answering system, described
In mutual question answering system, storage has at least one question text, and each question text is to there being at least one answer
Text, including:
Receive unit, for receive user for arbitrary question text submit to new answer text;
Participle unit, for will described new answer text participle, obtain that described new answer text comprises owns
Word;
Determine unit, true for all words and the first preset algorithm comprised according to described new answer text
The response variable parameter that fixed described new answer text is corresponding, wherein, described first preset algorithm is according to described friendship
At least one answer text that the described question text that stored in question answering system mutually is corresponding with this question text
Determining, described response variable parameter represents mates journey between described new answer text and described question text
Degree.
The text data processing method that the embodiment of the present invention provides, the new answer text to be weighed that will receive
Participle, obtains all words that new answer text comprises, all words comprised according to this new answer text,
Determining, according to preset algorithm, the response variable parameter that new answer text is corresponding, this response variable parameter has reacted new
Matching degree between answer text and question text, however, it is determined that the response variable parameter gone out is the biggest, then answer
More mate between text with question text, otherwise, more do not mate between answer text and question text.
Other features and advantages of the present invention will illustrate in the following description, and, partly from explanation
Book becomes apparent, or understands by implementing the present invention.The purpose of the present invention and other advantages can
Realize by structure specifically noted in the description write, claims and accompanying drawing and obtain
?.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes of the present invention
Point, the schematic description and description of the present invention is used for explaining the present invention, is not intended that to the present invention not
Work as restriction.In the accompanying drawings:
Fig. 1 is in prior art, under LDA model, and document structure tree process schematic;
Fig. 2 is in the embodiment of the present invention, under sLDA model, and document structure tree process schematic;
Fig. 3 is in the embodiment of the present invention, the implementing procedure schematic diagram of text data processing method;
Fig. 4 is in the embodiment of the present invention, the structural representation of text data processing device.
Detailed description of the invention
In order to improve the accuracy of answer text and question text weighing result, embodiments provide one
Plant text data processing method and device.
Below in conjunction with Figure of description, the preferred embodiments of the present invention are illustrated, it will be appreciated that this place
The preferred embodiment described is merely to illustrate and explains the present invention, is not intended to limit the present invention, and not
In the case of conflict, the embodiment in the present invention and the feature in embodiment can be mutually combined.
In order to be more fully understood that the embodiment of the present invention, first introduce potential Di Li Cray distribution (Latent
Dirichlet Allocation, LDA) topic model.LDA model is typical oriented probability graph model,
There is hierarchical structure clearly, be followed successively by: collection of document layer, document level and word layer.Utilize LDA model
It is capable of identify that in extensive document sets hiding subject information, such as, incompatible for problem-answer document collection
Saying, problem-answer is to being properly termed as a document, and some main bodys of each documents representative are constituted
One probability distribution, and each theme represents the probability distribution that a lot of word is constituted.Therefore,
For each document, can generate according to procedure below: 1) to each document, extract from theme distribution
One theme;2) from the word corresponding to the above-mentioned main body being drawn into is distributed, a word is extracted;3)
Repeat said process until traversal document in each word.More formalization is a little said, each document and T
One multinomial distribution of (being given in advance by methods such as repetition tests) individual theme is corresponding, by this multinomial point
Cloth is designated as θ, and each theme is corresponding, by this with V word in vocabulary multinomial distribution again
Multinomial distribution is designated as φ, and above-mentioned vocabulary is made up of the inequality word in documents all in document sets.θ and φ
There is a Dirichlet prior distribution with hyper parameter α and β respectively.Single for each in document d
Word, extracts a theme z, the most again from corresponding to theme z from the multinomial distribution θ corresponding to the document
Multinomial distribution φ extracts a word w, this process is repeated NdSecondary, can produce document d, wherein,
NdThe total words comprised for document d, as it is shown in figure 1, above-mentioned generation process can use the figure shown in Fig. 1
Model representation, in Fig. 1, shaded circles represents that observable variable, non-shadow circle represent latent variable, side
Frame represents sampling with repetition, and number of repetition is in the lower right corner of square frame.
The LDA model of above-mentioned introduction belongs to without supervision topic model, and compared to without monitor model, supervision is main
Topic model sLDA introduces response variable parameter, after introducing response variable parameter, subject extraction can be improved
Accuracy such that it is able to improve further the accuracy of answer text and question text weighing result.
In the embodiment of the present invention, a question text and an answer text are referred to as a document, and above-mentioned
Response variable parameter is to describe the parameter whether answer text is the optimum answer text of question text.If answered
Case text is the optimum answer of question text, and in the case, answer text and question text are at theme distribution
On there is bigger similarity, thus the response variable parameter of its correspondence is set as 1;Otherwise, answer text
With question text, there is on theme distribution less co-occurrence, in this case, the response variable of its correspondence
Parameter is set as 0.Based on the relation between response variable and question text-answer text, it is possible to preferably send out
Potential theme in existing document, it is possible to according to the training result of existing document, to training result for new
In response variable parameter determination between question text-answer text.
Based on this, in the embodiment of the present invention, according to (the most existing to known question text-answer text
History answer text) training result, determine the meter of response variable parameter between answer text and question text
Calculate model.Then for the new answer text that this question text is corresponding, can come really according to above-mentioned computation model
Fixed response variable parameter between new answer text and question text, if response variable parameter is the highest, then
Illustrating between this new answer text and question text the most relevant, this new answer text is more probably most preferably to be answered
Case, otherwise, the most uncorrelated between answer text and question text, this new answer text may be more one
Individual unrelated answer text is possibly even a rubbish answer text.
Introduced below in sLDA agent model, how to produce document d, be similar to LDA model, at sLDA
In model, it is assumed that document d is the multinomial distribution on a theme z, and the word comprised in document d is
Joint Distribution on theme z and multinomial distribution β, and response variable parameter b be one theme z and with
η, σ are the Joint Distribution of the normal distribution of parameter.Therefore, in sLDA model, the generation model of document can
To be divided into three below part:
1) for document d, its theme sampling θ is a Di Li Cray distribution in parameter alpha,
Dirichlet distribution is exactly k Conjugate Prior (conjugate gradient descent method) when taking 1.If k ties up random vector
θ~Drichlet is distributed, then k component θ _ 1 of θ, θ _ 2 ..., θ _ k takes continuous print nonnegative value, and θ _ 1+
θ _ 2+...+ θ _ k=1.When being embodied as, owing to same problem text there may be multiple known answer
Text, this question text forms a document with each answer text, therefore, may deposit for same problem
At multiple documents, each document is carried out theme sampling, travel through all documents and determine all themes, finally
Theme probability distribution θ of each document | α~Dir (α), for example, it is assumed that each document is made up of 3 themes,
θ represents the probability that each theme occurs, and for example, { 1/6,2/6,3/6}, θ corresponding to different documents is the most just
Difference, and θ can be used to judge the similarity of document;
2) for each word w that document d is comprised, its generation process can be divided into following two steps:
The first step is the theme the profile samples of z, and it is the multinomial distribution on θ, i.e. z | θ~Mult (θ);
Second step is the sampling of word w, and it is the joint probability distribution on theme z and multinomial distribution β,
I.e. w | z, β~Mult (β);
3) response variable parameter b is with the Joint Distribution of normal distribution that η, σ are parameter based on theme z, i.e.Wherein,N represents described question text and is somebody's turn to do with known
The quantity of the word included in history answer text that question text is corresponding.
As in figure 2 it is shown, as shown in Figure 2, variable w and b can be according to known for the generation step of above-mentioned document
Document directly obtain, and other parameter such as α, β, η, σ2Need it is carried out parameter estimation.It is preferred that this
In inventive embodiments, expectation maximization (EM) algorithm can be used to carry out parameter estimation.EM maximizes calculation
Method is a kind of when observing data and being imperfect, solves the iterative algorithm of maximal possibility estimation, or divides posteriority
Cloth carries out maximization simulation, and adds " hidden variable " on the basis of observation data, thus simplifies
Calculate and complete a series of simple maximization or simulation.EM algorithm is through two steps alternately
Calculate: the first step is to calculate expectation (E), utilizes the existing estimated value to hidden variable, calculates its maximum seemingly
So estimated value;Second step is to maximize (M), maximizes the maximum likelihood value tried to achieve in E step and calculates
The value of parameter.The estimates of parameters found in M step is used in next E step calculating, and this process is not
Break alternately, be eventually converged in a value.Below for the embodiment of the present invention, introduce above-mentioned respectively
Two steps:
The first step: expectation estimation
For a given problem-answer to (i.e. one document), the prior probability of latent variable θ and z can
To be determined by formula (1):
In formula (1), by variable word w observable in known document and response variable parameter b
The calculating of marginal probability is normalized, the same with without supervision topic model LDA, is difficult to calculate formula
(1) likelihood estimator of each the implicit variable in, therefore uses Variational Calculation side in the embodiment of the present invention
Method is carried out approximate evaluation and is implied the prior probability of variable, such as formula (2) institute of the object function in this variation analysis
Show:
Wherein, about shown in the expectation computing formula such as formula (3) of variation distribution q:
Wherein, γ is the Di Li Cray parameter vector of a K dimension, represents a distribution on K element,
It can obtain by calculating for the desired probability distribution of response variable under given theme, and computing formula is as public
Shown in formula (4):
In formula (4), E [Z] and E [ZZ can be determined according to formula (5) and formula (6) respectivelyT]:
Through above-mentioned formula (3)~formula (6), variable φ and γ in formula (2) can be determined by
Determine its prior probability.
Second step: parameter maximizes
Maximize in calculation procedure in parameter, by each question text-answer text to (the most each
Individual document) the maximization of likelihood estimator determine the β in computation model, η, σ2Parameter value, β, η, σ2Ginseng
The calculating of numerical value is respectively as shown in formula (7), formula (8) and formula (9):
Wherein, D represents the quantity of problem-answer pair in known document, and N represents in training data set and contains
The number of different terms, A is that the matrix each of which row of a D*K is
In said process, when calculating, input α and β in the first step every time, calculate likelihood function, also
It is exactly variational inference (actually variation reasoning is for approximating Posterior distrbutionp with a function)
Process, maximizes this function, obtains α and β in second step.So continuous iteration, until convergence, is just asked
Obtained final α and β value.Thus, repeatedly calculate iteration by expectation estimation and parameter are maximized, can
With the theme distribution by obtaining each problem-answer pair in study known document, it is thus achieved that each implicit change
The parameter estimation result of amount.
When being embodied as, for mutual question answering system, it storage may have substantial amounts of question text, and
May there is the answer text of more than one for this question text, therefore, for each question text, just may be used
Multiple question text-answer text pair can be there is, therefore, according to the question text-answer text stored to adopting
Use sLDA model, it is possible to implicit variable parameter α therein, β, η, σ2Estimate, obtain concrete ginseng
Numerical value.
Based on this, embodiments provide a kind of text data processing method, can apply to ask alternately
Answering in system, in described mutual question answering system, storage has at least one question text, and each question text is corresponding
There is at least one answer text, as it is shown on figure 3, the method may comprise steps of:
S301, receive new answer text corresponding to arbitrary question text that user submits to;
S302, new answer text is carried out participle, obtain all words that new answer text comprises;
When being embodied as, the answer text obtained can be carried out participle, determine this answer according to word segmentation result
All words that text comprises.
S303, all words comprised according to new answer text and the first preset algorithm determine described answer literary composition
Response variable parameter between this and described question text.
Wherein, the first preset algorithm is according to the described question text stored in mutual question answering system and this problem
At least one answer text that text is corresponding determines, response variable parameter represents new answer text and described problem
Matching degree between text.
When being embodied as, can determine, according to formula (10), the response variable that this new answer text is corresponding
Parameter:
Wherein, during z is described question text and described mutual question answering system, this question text of storage is corresponding
Theme included at least one answer text;W is all words that described new answer text comprises;
α,β,η,σ2It is respectively according to this question text pair of storage in described question text and described mutual question answering system
The parameter that at least one the answer text answered and the second preset algorithm are determined.Wherein, the second preset algorithm can
With but be not limited to EM (expectation-maximization algorithm).
According to formula (10), it is possible to determine the response variable between a certain new answer text and question text
Parameter, response variable parameter is the biggest, and it is the highest with the dependency of question text.Further, dependency is
High answer text will be considered as optimum answer corresponding to this question text;Relatively, response variable ginseng
Number the least, it is the lowest with the dependency of question text, its may be a unrelated answer be possibly even one
Individual rubbish answer.
During it is preferred that be embodied as, after determining the response variable parameter that each answer text is corresponding,
Can also according to the response variable parameter determined, the response variable parameter pre-build and match parameter it
Between mapping relations in, search the match parameter that this response variable parameter is corresponding, this match parameter represents newly to be answered
The value of case text.
Such as, in a particular application, response variable parameter can be mapped as different score values, this score value energy
Enough reflect a certain answer text refers to value or for the evaluation submitting this new answer text user to
Mark.
The embodiment of the present invention can apply to following two scene:
Scene one utilizes the supervision topic model sLDA solving problem to train answer relevance evaluation
First take out several question text and the corresponding all answer texts of question text, each is solved
Question text certainly, it includes that (label that such as this answer text is corresponding is excellent to an optimum answer text
Answer) and some other uncorrelated answer texts (label that such as this answer text is corresponding is other times
Answer).For solving question text, multiple problem-answer text can be generated to (Question-Answer
Pairs), by its response variable, 1 is labeled as optimum answer text, the response variable mark of other answer texts
It is designated as 0.Then the problem obtained-answer text is trained by supervision topic model sLDA, estimates
Count parameter alpha therein, β, η, σ2, and obtain training pattern M.For a new problem-answer pair, such as:
Problem-answer is to " how installing linux virtual machine under win7 system?"-" installs virtual machine, and you need:
Downloading vmware, this is a virtual software;Linux mirror image uses image file to install after running vmware
Virtual system.If others has the virtual machine file installed, can copy directly with (not being that hard disk is pacified
The virtual file of dress) ", utilize training pattern M obtained, calculate can obtain by above formula (10)
To response variable parameter between this answer text and question text, it is worth the highest then expression answer the most relevant, instead
Then answer the most uncorrelated.
Scene two is applied in the model optimum answer of forum
For a certain forum, some in data base are selected to have the model conduct of money order receipt to be signed and returned to the sender (replying or follow-up post)
Training data, is then labeled the money order receipt to be signed and returned to the sender in model, if most preferably replying, by its response variable mark
Being designated as 1, other are replied response variable and are labeled as 0.Whole to collecting finally by supervision topic model sLDA
" model-reply " text managed, to being trained, estimates parameter alpha therein, β, η, σ2, and trained
Model M.For the reply of new posts, just can be combined into " model-reply " text pair, utilize
The model M that training obtains, is calculated this " model-reply " response variable parameter by formula (10),
Be worth the biggest represent reply the best, otherwise then reply the poorest.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of text data processing device, by
The principle solving problem in said apparatus is similar to interface of mobile terminal icon arrangement method, therefore said apparatus
Enforcement may refer to the enforcement of method, repeat no more in place of repetition.
As shown in Figure 4, for the structural representation of the text data processing device that the embodiment of the present invention provides, bag
Include:
Receive unit 401, for receive user for arbitrary question text submit to new answer text;
Participle unit 402, for by described new answer text participle, obtains what described new answer text comprised
All words;
Determine unit 403, impute in advance for all words and first comprised according to described new answer text
Method determines the response variable parameter that described new answer text is corresponding.
Wherein, the first preset algorithm is according to the described question text stored in mutual question answering system and this problem
At least one answer text that text is corresponding determines, response variable parameter represents that described new answer text is with described
Matching degree between question text.
When being embodied as, determine unit 403, may be used for determining that new answer text is corresponding according to below equation
Response variable parameter:Wherein: E [b | w, α, β, η, σ2] be institute
State the response variable parameter that answer text is corresponding;Z is to deposit in described question text and described mutual question answering system
Theme included at least one answer text corresponding to this question text of storage;W is described new answer literary composition
Originally all words comprised;α,β,η,σ2It is respectively according in described question text and described mutual question answering system
The parameter that at least one answer text corresponding to this question text of storage and the second preset algorithm are determined.
When being embodied as, determine unit 403, may be used for determining according to below equation
Wherein: N represents that in described question text and described mutual question answering system, this question text of storage is corresponding extremely
The quantity of the word included in a few answer text.
When being embodied as, the text processing apparatus that the embodiment of the present invention provides, it is also possible to including:
Search unit, for according to the response variable parameter determining that unit is determined, in the response pre-build
In mapping relations between variable parameter and match parameter, determine between this answer text and described question text
Match parameter, match parameter represents the value of described new answer text.
Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or meter
Calculation machine program product.Therefore, the present invention can use complete hardware embodiment, complete software implementation or knot
The form of the embodiment in terms of conjunction software and hardware.And, the present invention can use and wherein wrap one or more
Computer-usable storage medium containing computer usable program code (include but not limited to disk memory,
CD-ROM, optical memory etc.) form of the upper computer program implemented.
The present invention is with reference to method, equipment (system) and computer program product according to embodiments of the present invention
The flow chart of product and/or block diagram describe.It should be understood that can by computer program instructions flowchart and
/ or block diagram in each flow process and/or flow process in square frame and flow chart and/or block diagram and/
Or the combination of square frame.These computer program instructions can be provided to general purpose computer, special-purpose computer, embedding
The processor of formula datatron or other programmable data processing device is to produce a machine so that by calculating
The instruction that the processor of machine or other programmable data processing device performs produces for realizing at flow chart one
The device of the function specified in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions may be alternatively stored in and computer or the process of other programmable datas can be guided to set
In the standby computer-readable memory worked in a specific way so that be stored in this computer-readable memory
Instruction produce and include the manufacture of command device, this command device realizes in one flow process or multiple of flow chart
The function specified in flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device, makes
Sequence of operations step must be performed to produce computer implemented place on computer or other programmable devices
Reason, thus the instruction performed on computer or other programmable devices provides for realizing flow chart one
The step of the function specified in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
Although preferred embodiments of the present invention have been described, but those skilled in the art once know base
This creativeness concept, then can make other change and amendment to these embodiments.So, appended right is wanted
Ask and be intended to be construed to include preferred embodiment and fall into all changes and the amendment of the scope of the invention.
Obviously, those skilled in the art can carry out various change and modification without deviating from this to the present invention
Bright spirit and scope.So, if the present invention these amendment and modification belong to the claims in the present invention and
Within the scope of its equivalent technologies, then the present invention is also intended to comprise these change and modification.
Claims (7)
1. a text data processing method, is applied in mutual question answering system, described mutual question answering system
Middle storage has at least one question text, and each question text is to having at least one answer text, its feature
It is, including:
Receive the new answer text that user submits to for arbitrary question text;
By described new answer text participle, obtain all words that described new answer text comprises;
The all words and the first preset algorithm that comprise according to described new answer text determine described new answer
The response variable parameter that text is corresponding, wherein, described first preset algorithm is according in described mutual question answering system
The described question text that stored at least one answer text corresponding with this question text determines, described response
Variable parameter represents the matching degree between described new answer text and described question text;
Response variable parameter corresponding to described new answer text is by formulaDetermine, wherein:
E[b|w,α,β,η,σ2] it is the response variable parameter that described answer text is corresponding;
Z is at least that in described question text and described mutual question answering system, this question text of storage is corresponding
Theme included in individual answer text;
W is all words that described new answer text comprises;
α,β,η,σ2It is respectively according to this problem literary composition of storage in described question text and described mutual question answering system
The parameter that at least one answer text of this correspondence and the second preset algorithm are determined.
2. the method for claim 1, it is characterised in that determine according to below equation Wherein:
N represents that in described question text and described mutual question answering system, this question text of storage is corresponding extremely
The quantity of the word included in a few answer text, znRepresent described question text and described mutual question and answer
The n-th word institute included at least one answer text that in system, this question text of storage is corresponding is right
The theme answered.
3. the method for claim 1, it is characterised in that described second preset algorithm includes expectation
Maximize EM algorithm.
4. the method for claim 1, it is characterised in that also include:
According to described response variable parameter, reflecting between response variable parameter and the match parameter pre-build
Penetrate in relation, search the match parameter that described response variable parameter is corresponding, described match parameter represent described newly
The value of answer text.
5. a text data processing device, is applied in mutual question answering system, described mutual question answering system
Middle storage has at least one question text, and each question text is to having at least one answer text, its feature
It is, including:
Receive unit, for receiving the new answer text that user submits to for arbitrary question text;
Participle unit, for will described new answer text participle, obtain that described new answer text comprises owns
Word;
Determine unit, true for all words and the first preset algorithm comprised according to described new answer text
The response variable parameter that fixed described new answer text is corresponding, wherein, described first preset algorithm is according to described friendship
At least one answer text that the described question text that stored in question answering system mutually is corresponding with this question text
Determining, described response variable parameter represents mates journey between described new answer text and described question text
Degree, described determines unit, specifically for determining that the response that described new answer text is corresponding becomes according to below equation
Amount parameter:
Wherein:
E[b|w,α,β,η,σ2] it is the response variable parameter that described answer text is corresponding;
Z is at least that in described question text and described mutual question answering system, this question text of storage is corresponding
Theme included in individual answer text;
W is all words that described new answer text comprises;
α,β,η,σ2It is respectively according to this problem literary composition of storage in described question text and described mutual question answering system
The parameter that at least one answer text of this correspondence and the second preset algorithm are determined.
6. device as claimed in claim 5, it is characterised in that
Described determine unit, specifically for determining according to below equation Wherein:
N represents that in described question text and described mutual question answering system, this question text of storage is corresponding extremely
The quantity of the word included in a few answer text, znRepresent described question text and described mutual question and answer
The n-th word institute included at least one answer text that in system, this question text of storage is corresponding is right
The theme answered.
7. device as claimed in claim 5, it is characterised in that also include:
Search unit, for according to described response variable parameter, the response variable parameter pre-build with
Join in the mapping relations between parameter, determine and mate ginseng between described answer text and described question text
Number, described match parameter represents the value of described new answer text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210534859.1A CN103870440B (en) | 2012-12-12 | A kind of text data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210534859.1A CN103870440B (en) | 2012-12-12 | A kind of text data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103870440A CN103870440A (en) | 2014-06-18 |
CN103870440B true CN103870440B (en) | 2016-11-30 |
Family
ID=
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6282534B1 (en) * | 1998-03-13 | 2001-08-28 | Intel Corporation | Reverse content indexing |
CN1794233A (en) * | 2005-12-28 | 2006-06-28 | 刘文印 | Network user interactive asking answering method and its system |
CN101118554A (en) * | 2007-09-14 | 2008-02-06 | 中兴通讯股份有限公司 | Intelligent interactive request-answering system and processing method thereof |
CN101369265A (en) * | 2008-01-14 | 2009-02-18 | 北京百问百答网络技术有限公司 | Method and system for automatically generating semantic template of problem |
CN102521239A (en) * | 2011-11-14 | 2012-06-27 | 江苏联著实业有限公司 | Question-answering information matching system and method based on OWL (web ontology language) for Internet |
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6282534B1 (en) * | 1998-03-13 | 2001-08-28 | Intel Corporation | Reverse content indexing |
CN1794233A (en) * | 2005-12-28 | 2006-06-28 | 刘文印 | Network user interactive asking answering method and its system |
CN101118554A (en) * | 2007-09-14 | 2008-02-06 | 中兴通讯股份有限公司 | Intelligent interactive request-answering system and processing method thereof |
CN101369265A (en) * | 2008-01-14 | 2009-02-18 | 北京百问百答网络技术有限公司 | Method and system for automatically generating semantic template of problem |
CN102521239A (en) * | 2011-11-14 | 2012-06-27 | 江苏联著实业有限公司 | Question-answering information matching system and method based on OWL (web ontology language) for Internet |
Non-Patent Citations (3)
Title |
---|
基于语法分析和统计方法的答案排序模型;李波 等;《中文信息学报》;20090331;第23卷(第2期);第23-28页 * |
改进的基于模式匹配的答案抽取方法;战学刚 等;《情报理论与实践》;20090930;第32卷(第9期);第105-108页 * |
问答社区中的问题与答案推荐机制研究与实现;曲明成;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100815;第2010年卷(第8期);第I138-926页 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111353037B (en) | Topic generation method and device and computer readable storage medium | |
CN108563703A (en) | A kind of determination method of charge, device and computer equipment, storage medium | |
Sukhija et al. | The recent state of educational data mining: A survey and future visions | |
CN110941723A (en) | Method, system and storage medium for constructing knowledge graph | |
Hromkovič | Theoretical computer science: introduction to Automata, computability, complexity, algorithmics, randomization, communication, and cryptography | |
CN102439597A (en) | Parameter deducing method, computing device and system based on potential dirichlet model | |
CN102831119B (en) | Short text clustering Apparatus and method for | |
CN106599194A (en) | Label determining method and device | |
Mostaeen et al. | Clonecognition: machine learning based code clone validation tool | |
Nababan et al. | Determination feasibility of poor household surgery by using weighted product method | |
CN112131587A (en) | Intelligent contract pseudo-random number security inspection method, system, medium and device | |
CN107133218A (en) | Trade name intelligent Matching method, system and computer-readable recording medium | |
CN103870440B (en) | A kind of text data processing method and device | |
Yang et al. | Prototype-guided pseudo labeling for semi-supervised text classification | |
DeLaVina | Some history of the development of Graffiti | |
CN103514194B (en) | Determine method and apparatus and the classifier training method of the dependency of language material and entity | |
Li et al. | Evaluating indicators of answer quality in social Q&A websites | |
Ma et al. | Selecting test inputs for DNNs using differential testing with subspecialized model instances | |
Capuano et al. | LIA: an Intelligent Advisor for e-Learning | |
CN107102543A (en) | The forming method and device of a kind of energy router anti-interference controller | |
CN106897436A (en) | A kind of academic research hot keyword extracting method inferred based on variation | |
CN106971306A (en) | The recognition methods of product problem and system | |
Labutov et al. | Optimally Discriminative Choice Sets in Discrete Choice Models: Application to Data-Driven Test Design | |
CN104281670B (en) | The real-time incremental formula detection method and system of a kind of social networks event | |
Ramathulasi et al. | Enhanced PMF model to predict user interest for web API recommendation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |