CN103870440B - A kind of text data processing method and device - Google Patents

A kind of text data processing method and device Download PDF

Info

Publication number
CN103870440B
CN103870440B CN201210534859.1A CN201210534859A CN103870440B CN 103870440 B CN103870440 B CN 103870440B CN 201210534859 A CN201210534859 A CN 201210534859A CN 103870440 B CN103870440 B CN 103870440B
Authority
CN
China
Prior art keywords
text
question
answer
answer text
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210534859.1A
Other languages
Chinese (zh)
Other versions
CN103870440A (en
Inventor
凌俊民
刘晓峰
梁耿
李广杰
韦媚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Group Guangxi Co Ltd
Original Assignee
China Mobile Group Guangxi Co Ltd
Filing date
Publication date
Application filed by China Mobile Group Guangxi Co Ltd filed Critical China Mobile Group Guangxi Co Ltd
Priority to CN201210534859.1A priority Critical patent/CN103870440B/en
Publication of CN103870440A publication Critical patent/CN103870440A/en
Application granted granted Critical
Publication of CN103870440B publication Critical patent/CN103870440B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of text data processing method and device, it is applied in mutual question answering system, in described mutual question answering system, storage has at least one question text, each question text is to there being at least one answer text, in order to improve the accuracy of dependency weighing result between answer text and question text.Text data processing method includes: receive the new answer text that user submits to for arbitrary question text;By described new answer text participle, obtain all words that described new answer text comprises;The all words comprised according to described new answer text and the first preset algorithm determine the response variable parameter that described new answer text is corresponding, wherein, described first preset algorithm determines according at least one answer text that the described question text stored in described mutual question answering system is corresponding with this question text, and described response variable parameter represents the matching degree between described new answer text and described question text.

Description

A kind of text data processing method and device
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of text data processing method and device.
Background technology
Along with the development of the network communications technology, become the one of solution problem by the answer of Network Capture problem Effective means.But for same problem, answer present on network may have a lot, but which is answered Case is more accurate, and the dependency the most how weighed between answer and problem becomes one of study hotspot.
At present, be the most all by question and answer between Text similarity computing carry out weighing, but Being under normal circumstances, problem all compares briefly, and the word comprised is few, thus causes corresponding answer Between there is semantic gap problem, therefore, utilize the balancing method of traditional text similarity so that weigh There is bigger error in result.
Summary of the invention
The embodiment of the present invention provides a kind of text data processing method, in order to improve answer text and question text Between the accuracy of dependency weighing result.
The embodiment of the present invention provides a kind of text data processing method, is applied in mutual question answering system, described In mutual question answering system, storage has at least one question text, and each question text is to there being at least one answer Text, including:
Receive user for arbitrary question text submit to new answer text;
By described new answer text participle, obtain all words that described new answer text comprises;
The all words and the first preset algorithm that comprise according to described new answer text determine described new answer The response variable parameter that text is corresponding, wherein, described first preset algorithm is according in described mutual question answering system The described question text that stored at least one answer text corresponding with this question text determines, described response Variable parameter represents the matching degree between described new answer text and described question text.
The embodiment of the present invention provides a kind of text data processing device, is applied in mutual question answering system, described In mutual question answering system, storage has at least one question text, and each question text is to there being at least one answer Text, including:
Receive unit, for receive user for arbitrary question text submit to new answer text;
Participle unit, for will described new answer text participle, obtain that described new answer text comprises owns Word;
Determine unit, true for all words and the first preset algorithm comprised according to described new answer text The response variable parameter that fixed described new answer text is corresponding, wherein, described first preset algorithm is according to described friendship At least one answer text that the described question text that stored in question answering system mutually is corresponding with this question text Determining, described response variable parameter represents mates journey between described new answer text and described question text Degree.
The text data processing method that the embodiment of the present invention provides, the new answer text to be weighed that will receive Participle, obtains all words that new answer text comprises, all words comprised according to this new answer text, Determining, according to preset algorithm, the response variable parameter that new answer text is corresponding, this response variable parameter has reacted new Matching degree between answer text and question text, however, it is determined that the response variable parameter gone out is the biggest, then answer More mate between text with question text, otherwise, more do not mate between answer text and question text.
Other features and advantages of the present invention will illustrate in the following description, and, partly from explanation Book becomes apparent, or understands by implementing the present invention.The purpose of the present invention and other advantages can Realize by structure specifically noted in the description write, claims and accompanying drawing and obtain ?.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes of the present invention Point, the schematic description and description of the present invention is used for explaining the present invention, is not intended that to the present invention not Work as restriction.In the accompanying drawings:
Fig. 1 is in prior art, under LDA model, and document structure tree process schematic;
Fig. 2 is in the embodiment of the present invention, under sLDA model, and document structure tree process schematic;
Fig. 3 is in the embodiment of the present invention, the implementing procedure schematic diagram of text data processing method;
Fig. 4 is in the embodiment of the present invention, the structural representation of text data processing device.
Detailed description of the invention
In order to improve the accuracy of answer text and question text weighing result, embodiments provide one Plant text data processing method and device.
Below in conjunction with Figure of description, the preferred embodiments of the present invention are illustrated, it will be appreciated that this place The preferred embodiment described is merely to illustrate and explains the present invention, is not intended to limit the present invention, and not In the case of conflict, the embodiment in the present invention and the feature in embodiment can be mutually combined.
In order to be more fully understood that the embodiment of the present invention, first introduce potential Di Li Cray distribution (Latent Dirichlet Allocation, LDA) topic model.LDA model is typical oriented probability graph model, There is hierarchical structure clearly, be followed successively by: collection of document layer, document level and word layer.Utilize LDA model It is capable of identify that in extensive document sets hiding subject information, such as, incompatible for problem-answer document collection Saying, problem-answer is to being properly termed as a document, and some main bodys of each documents representative are constituted One probability distribution, and each theme represents the probability distribution that a lot of word is constituted.Therefore, For each document, can generate according to procedure below: 1) to each document, extract from theme distribution One theme;2) from the word corresponding to the above-mentioned main body being drawn into is distributed, a word is extracted;3) Repeat said process until traversal document in each word.More formalization is a little said, each document and T One multinomial distribution of (being given in advance by methods such as repetition tests) individual theme is corresponding, by this multinomial point Cloth is designated as θ, and each theme is corresponding, by this with V word in vocabulary multinomial distribution again Multinomial distribution is designated as φ, and above-mentioned vocabulary is made up of the inequality word in documents all in document sets.θ and φ There is a Dirichlet prior distribution with hyper parameter α and β respectively.Single for each in document d Word, extracts a theme z, the most again from corresponding to theme z from the multinomial distribution θ corresponding to the document Multinomial distribution φ extracts a word w, this process is repeated NdSecondary, can produce document d, wherein, NdThe total words comprised for document d, as it is shown in figure 1, above-mentioned generation process can use the figure shown in Fig. 1 Model representation, in Fig. 1, shaded circles represents that observable variable, non-shadow circle represent latent variable, side Frame represents sampling with repetition, and number of repetition is in the lower right corner of square frame.
The LDA model of above-mentioned introduction belongs to without supervision topic model, and compared to without monitor model, supervision is main Topic model sLDA introduces response variable parameter, after introducing response variable parameter, subject extraction can be improved Accuracy such that it is able to improve further the accuracy of answer text and question text weighing result.
In the embodiment of the present invention, a question text and an answer text are referred to as a document, and above-mentioned Response variable parameter is to describe the parameter whether answer text is the optimum answer text of question text.If answered Case text is the optimum answer of question text, and in the case, answer text and question text are at theme distribution On there is bigger similarity, thus the response variable parameter of its correspondence is set as 1;Otherwise, answer text With question text, there is on theme distribution less co-occurrence, in this case, the response variable of its correspondence Parameter is set as 0.Based on the relation between response variable and question text-answer text, it is possible to preferably send out Potential theme in existing document, it is possible to according to the training result of existing document, to training result for new In response variable parameter determination between question text-answer text.
Based on this, in the embodiment of the present invention, according to (the most existing to known question text-answer text History answer text) training result, determine the meter of response variable parameter between answer text and question text Calculate model.Then for the new answer text that this question text is corresponding, can come really according to above-mentioned computation model Fixed response variable parameter between new answer text and question text, if response variable parameter is the highest, then Illustrating between this new answer text and question text the most relevant, this new answer text is more probably most preferably to be answered Case, otherwise, the most uncorrelated between answer text and question text, this new answer text may be more one Individual unrelated answer text is possibly even a rubbish answer text.
Introduced below in sLDA agent model, how to produce document d, be similar to LDA model, at sLDA In model, it is assumed that document d is the multinomial distribution on a theme z, and the word comprised in document d is Joint Distribution on theme z and multinomial distribution β, and response variable parameter b be one theme z and with η, σ are the Joint Distribution of the normal distribution of parameter.Therefore, in sLDA model, the generation model of document can To be divided into three below part:
1) for document d, its theme sampling θ is a Di Li Cray distribution in parameter alpha, Dirichlet distribution is exactly k Conjugate Prior (conjugate gradient descent method) when taking 1.If k ties up random vector θ~Drichlet is distributed, then k component θ _ 1 of θ, θ _ 2 ..., θ _ k takes continuous print nonnegative value, and θ _ 1+ θ _ 2+...+ θ _ k=1.When being embodied as, owing to same problem text there may be multiple known answer Text, this question text forms a document with each answer text, therefore, may deposit for same problem At multiple documents, each document is carried out theme sampling, travel through all documents and determine all themes, finally Theme probability distribution θ of each document | α~Dir (α), for example, it is assumed that each document is made up of 3 themes, θ represents the probability that each theme occurs, and for example, { 1/6,2/6,3/6}, θ corresponding to different documents is the most just Difference, and θ can be used to judge the similarity of document;
2) for each word w that document d is comprised, its generation process can be divided into following two steps:
The first step is the theme the profile samples of z, and it is the multinomial distribution on θ, i.e. z | θ~Mult (θ);
Second step is the sampling of word w, and it is the joint probability distribution on theme z and multinomial distribution β, I.e. w | z, β~Mult (β);
3) response variable parameter b is with the Joint Distribution of normal distribution that η, σ are parameter based on theme z, i.e.Wherein,N represents described question text and is somebody's turn to do with known The quantity of the word included in history answer text that question text is corresponding.
As in figure 2 it is shown, as shown in Figure 2, variable w and b can be according to known for the generation step of above-mentioned document Document directly obtain, and other parameter such as α, β, η, σ2Need it is carried out parameter estimation.It is preferred that this In inventive embodiments, expectation maximization (EM) algorithm can be used to carry out parameter estimation.EM maximizes calculation Method is a kind of when observing data and being imperfect, solves the iterative algorithm of maximal possibility estimation, or divides posteriority Cloth carries out maximization simulation, and adds " hidden variable " on the basis of observation data, thus simplifies Calculate and complete a series of simple maximization or simulation.EM algorithm is through two steps alternately Calculate: the first step is to calculate expectation (E), utilizes the existing estimated value to hidden variable, calculates its maximum seemingly So estimated value;Second step is to maximize (M), maximizes the maximum likelihood value tried to achieve in E step and calculates The value of parameter.The estimates of parameters found in M step is used in next E step calculating, and this process is not Break alternately, be eventually converged in a value.Below for the embodiment of the present invention, introduce above-mentioned respectively Two steps:
The first step: expectation estimation
For a given problem-answer to (i.e. one document), the prior probability of latent variable θ and z can To be determined by formula (1):
p ( θ , z | w , b , α , β , η , σ 2 ) = p ( θ | α ) ( Π n = 1 N p ( z | θ ) p ( w | z , β ) ) p ( b | z , η , σ 2 ) ∫ d θ p ( θ | α ) Σ z ( Π n = 1 N p ( z | θ ) p ( w | z , β ) ) p ( b | z , η , σ 2 ) - - - ( 1 )
In formula (1), by variable word w observable in known document and response variable parameter b The calculating of marginal probability is normalized, the same with without supervision topic model LDA, is difficult to calculate formula (1) likelihood estimator of each the implicit variable in, therefore uses Variational Calculation side in the embodiment of the present invention Method is carried out approximate evaluation and is implied the prior probability of variable, such as formula (2) institute of the object function in this variation analysis Show:
log p ( w , b | α , β , η , σ 2 ) ≥ E [ log p ( θ | α ) ] + Σ n = 1 N E [ log p ( z | θ ) ] + Σ n = 1 N E [ log p ( w | z , β ) ] + E [ log p ( b | z , η , σ 2 ) ] + H ( q ) - - - ( 2 )
Wherein, about shown in the expectation computing formula such as formula (3) of variation distribution q:
q ( θ , z | γ , φ ) = q ( θ | γ ) Π n = 1 N q ( z | φ ) - - - ( 3 )
Wherein, γ is the Di Li Cray parameter vector of a K dimension, represents a distribution on K element, It can obtain by calculating for the desired probability distribution of response variable under given theme, and computing formula is as public Shown in formula (4):
E [ log p ( b | z , η , σ 2 ) ] = - 1 2 log ( 2 πσ 2 ) - y 2 - 2 bη T E [ Z ‾ ] + η T E [ Z Z ‾ T ] η 2 σ 2 - - - ( 4 )
In formula (4), E [Z] and E [ZZ can be determined according to formula (5) and formula (6) respectivelyT]:
E [ Z ‾ ] = φ ‾ : = ( 1 / N ) Σ n = 1 N φ n - - - ( 5 )
E [ Z Z ‾ T ] = Σ n = 1 N Σ m ≠ n φ n φ m T + Σ n = 1 N d i a g { φ n } N 2 - - - ( 6 )
Through above-mentioned formula (3)~formula (6), variable φ and γ in formula (2) can be determined by Determine its prior probability.
Second step: parameter maximizes
Maximize in calculation procedure in parameter, by each question text-answer text to (the most each Individual document) the maximization of likelihood estimator determine the β in computation model, η, σ2Parameter value, β, η, σ2Ginseng The calculating of numerical value is respectively as shown in formula (7), formula (8) and formula (9):
β ^ n e w ∝ Σ d = 1 D Σ n = 1 N 1 ( w d , n = w ) φ d , n - - - ( 7 )
η ^ n e w ← ( E [ A T A ] ) - 1 E [ A ] T b - - - ( 8 )
σ ^ n e w 2 ← ( 1 / D ) { b T b - b T E [ A ] ( E [ A T A ] ) - 1 E [ A ] T b } - - - ( 9 )
Wherein, D represents the quantity of problem-answer pair in known document, and N represents in training data set and contains The number of different terms, A is that the matrix each of which row of a D*K is
In said process, when calculating, input α and β in the first step every time, calculate likelihood function, also It is exactly variational inference (actually variation reasoning is for approximating Posterior distrbutionp with a function) Process, maximizes this function, obtains α and β in second step.So continuous iteration, until convergence, is just asked Obtained final α and β value.Thus, repeatedly calculate iteration by expectation estimation and parameter are maximized, can With the theme distribution by obtaining each problem-answer pair in study known document, it is thus achieved that each implicit change The parameter estimation result of amount.
When being embodied as, for mutual question answering system, it storage may have substantial amounts of question text, and May there is the answer text of more than one for this question text, therefore, for each question text, just may be used Multiple question text-answer text pair can be there is, therefore, according to the question text-answer text stored to adopting Use sLDA model, it is possible to implicit variable parameter α therein, β, η, σ2Estimate, obtain concrete ginseng Numerical value.
Based on this, embodiments provide a kind of text data processing method, can apply to ask alternately Answering in system, in described mutual question answering system, storage has at least one question text, and each question text is corresponding There is at least one answer text, as it is shown on figure 3, the method may comprise steps of:
S301, receive new answer text corresponding to arbitrary question text that user submits to;
S302, new answer text is carried out participle, obtain all words that new answer text comprises;
When being embodied as, the answer text obtained can be carried out participle, determine this answer according to word segmentation result All words that text comprises.
S303, all words comprised according to new answer text and the first preset algorithm determine described answer literary composition Response variable parameter between this and described question text.
Wherein, the first preset algorithm is according to the described question text stored in mutual question answering system and this problem At least one answer text that text is corresponding determines, response variable parameter represents new answer text and described problem Matching degree between text.
When being embodied as, can determine, according to formula (10), the response variable that this new answer text is corresponding Parameter:
Wherein, during z is described question text and described mutual question answering system, this question text of storage is corresponding Theme included at least one answer text;W is all words that described new answer text comprises; α,β,η,σ2It is respectively according to this question text pair of storage in described question text and described mutual question answering system The parameter that at least one the answer text answered and the second preset algorithm are determined.Wherein, the second preset algorithm can With but be not limited to EM (expectation-maximization algorithm).
According to formula (10), it is possible to determine the response variable between a certain new answer text and question text Parameter, response variable parameter is the biggest, and it is the highest with the dependency of question text.Further, dependency is High answer text will be considered as optimum answer corresponding to this question text;Relatively, response variable ginseng Number the least, it is the lowest with the dependency of question text, its may be a unrelated answer be possibly even one Individual rubbish answer.
During it is preferred that be embodied as, after determining the response variable parameter that each answer text is corresponding, Can also according to the response variable parameter determined, the response variable parameter pre-build and match parameter it Between mapping relations in, search the match parameter that this response variable parameter is corresponding, this match parameter represents newly to be answered The value of case text.
Such as, in a particular application, response variable parameter can be mapped as different score values, this score value energy Enough reflect a certain answer text refers to value or for the evaluation submitting this new answer text user to Mark.
The embodiment of the present invention can apply to following two scene:
Scene one utilizes the supervision topic model sLDA solving problem to train answer relevance evaluation
First take out several question text and the corresponding all answer texts of question text, each is solved Question text certainly, it includes that (label that such as this answer text is corresponding is excellent to an optimum answer text Answer) and some other uncorrelated answer texts (label that such as this answer text is corresponding is other times Answer).For solving question text, multiple problem-answer text can be generated to (Question-Answer Pairs), by its response variable, 1 is labeled as optimum answer text, the response variable mark of other answer texts It is designated as 0.Then the problem obtained-answer text is trained by supervision topic model sLDA, estimates Count parameter alpha therein, β, η, σ2, and obtain training pattern M.For a new problem-answer pair, such as: Problem-answer is to " how installing linux virtual machine under win7 system?"-" installs virtual machine, and you need: Downloading vmware, this is a virtual software;Linux mirror image uses image file to install after running vmware Virtual system.If others has the virtual machine file installed, can copy directly with (not being that hard disk is pacified The virtual file of dress) ", utilize training pattern M obtained, calculate can obtain by above formula (10) To response variable parameter between this answer text and question text, it is worth the highest then expression answer the most relevant, instead Then answer the most uncorrelated.
Scene two is applied in the model optimum answer of forum
For a certain forum, some in data base are selected to have the model conduct of money order receipt to be signed and returned to the sender (replying or follow-up post) Training data, is then labeled the money order receipt to be signed and returned to the sender in model, if most preferably replying, by its response variable mark Being designated as 1, other are replied response variable and are labeled as 0.Whole to collecting finally by supervision topic model sLDA " model-reply " text managed, to being trained, estimates parameter alpha therein, β, η, σ2, and trained Model M.For the reply of new posts, just can be combined into " model-reply " text pair, utilize The model M that training obtains, is calculated this " model-reply " response variable parameter by formula (10), Be worth the biggest represent reply the best, otherwise then reply the poorest.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of text data processing device, by The principle solving problem in said apparatus is similar to interface of mobile terminal icon arrangement method, therefore said apparatus Enforcement may refer to the enforcement of method, repeat no more in place of repetition.
As shown in Figure 4, for the structural representation of the text data processing device that the embodiment of the present invention provides, bag Include:
Receive unit 401, for receive user for arbitrary question text submit to new answer text;
Participle unit 402, for by described new answer text participle, obtains what described new answer text comprised All words;
Determine unit 403, impute in advance for all words and first comprised according to described new answer text Method determines the response variable parameter that described new answer text is corresponding.
Wherein, the first preset algorithm is according to the described question text stored in mutual question answering system and this problem At least one answer text that text is corresponding determines, response variable parameter represents that described new answer text is with described Matching degree between question text.
When being embodied as, determine unit 403, may be used for determining that new answer text is corresponding according to below equation Response variable parameter:Wherein: E [b | w, α, β, η, σ2] be institute State the response variable parameter that answer text is corresponding;Z is to deposit in described question text and described mutual question answering system Theme included at least one answer text corresponding to this question text of storage;W is described new answer literary composition Originally all words comprised;α,β,η,σ2It is respectively according in described question text and described mutual question answering system The parameter that at least one answer text corresponding to this question text of storage and the second preset algorithm are determined.
When being embodied as, determine unit 403, may be used for determining according to below equation Wherein: N represents that in described question text and described mutual question answering system, this question text of storage is corresponding extremely The quantity of the word included in a few answer text.
When being embodied as, the text processing apparatus that the embodiment of the present invention provides, it is also possible to including:
Search unit, for according to the response variable parameter determining that unit is determined, in the response pre-build In mapping relations between variable parameter and match parameter, determine between this answer text and described question text Match parameter, match parameter represents the value of described new answer text.
Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or meter Calculation machine program product.Therefore, the present invention can use complete hardware embodiment, complete software implementation or knot The form of the embodiment in terms of conjunction software and hardware.And, the present invention can use and wherein wrap one or more Computer-usable storage medium containing computer usable program code (include but not limited to disk memory, CD-ROM, optical memory etc.) form of the upper computer program implemented.
The present invention is with reference to method, equipment (system) and computer program product according to embodiments of the present invention The flow chart of product and/or block diagram describe.It should be understood that can by computer program instructions flowchart and / or block diagram in each flow process and/or flow process in square frame and flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions can be provided to general purpose computer, special-purpose computer, embedding The processor of formula datatron or other programmable data processing device is to produce a machine so that by calculating The instruction that the processor of machine or other programmable data processing device performs produces for realizing at flow chart one The device of the function specified in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions may be alternatively stored in and computer or the process of other programmable datas can be guided to set In the standby computer-readable memory worked in a specific way so that be stored in this computer-readable memory Instruction produce and include the manufacture of command device, this command device realizes in one flow process or multiple of flow chart The function specified in flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device, makes Sequence of operations step must be performed to produce computer implemented place on computer or other programmable devices Reason, thus the instruction performed on computer or other programmable devices provides for realizing flow chart one The step of the function specified in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
Although preferred embodiments of the present invention have been described, but those skilled in the art once know base This creativeness concept, then can make other change and amendment to these embodiments.So, appended right is wanted Ask and be intended to be construed to include preferred embodiment and fall into all changes and the amendment of the scope of the invention.
Obviously, those skilled in the art can carry out various change and modification without deviating from this to the present invention Bright spirit and scope.So, if the present invention these amendment and modification belong to the claims in the present invention and Within the scope of its equivalent technologies, then the present invention is also intended to comprise these change and modification.

Claims (7)

1. a text data processing method, is applied in mutual question answering system, described mutual question answering system Middle storage has at least one question text, and each question text is to having at least one answer text, its feature It is, including:
Receive the new answer text that user submits to for arbitrary question text;
By described new answer text participle, obtain all words that described new answer text comprises;
The all words and the first preset algorithm that comprise according to described new answer text determine described new answer The response variable parameter that text is corresponding, wherein, described first preset algorithm is according in described mutual question answering system The described question text that stored at least one answer text corresponding with this question text determines, described response Variable parameter represents the matching degree between described new answer text and described question text;
Response variable parameter corresponding to described new answer text is by formulaDetermine, wherein:
E[b|w,α,β,η,σ2] it is the response variable parameter that described answer text is corresponding;
Z is at least that in described question text and described mutual question answering system, this question text of storage is corresponding Theme included in individual answer text;
W is all words that described new answer text comprises;
α,β,η,σ2It is respectively according to this problem literary composition of storage in described question text and described mutual question answering system The parameter that at least one answer text of this correspondence and the second preset algorithm are determined.
2. the method for claim 1, it is characterised in that determine according to below equation Wherein:
N represents that in described question text and described mutual question answering system, this question text of storage is corresponding extremely The quantity of the word included in a few answer text, znRepresent described question text and described mutual question and answer The n-th word institute included at least one answer text that in system, this question text of storage is corresponding is right The theme answered.
3. the method for claim 1, it is characterised in that described second preset algorithm includes expectation Maximize EM algorithm.
4. the method for claim 1, it is characterised in that also include:
According to described response variable parameter, reflecting between response variable parameter and the match parameter pre-build Penetrate in relation, search the match parameter that described response variable parameter is corresponding, described match parameter represent described newly The value of answer text.
5. a text data processing device, is applied in mutual question answering system, described mutual question answering system Middle storage has at least one question text, and each question text is to having at least one answer text, its feature It is, including:
Receive unit, for receiving the new answer text that user submits to for arbitrary question text;
Participle unit, for will described new answer text participle, obtain that described new answer text comprises owns Word;
Determine unit, true for all words and the first preset algorithm comprised according to described new answer text The response variable parameter that fixed described new answer text is corresponding, wherein, described first preset algorithm is according to described friendship At least one answer text that the described question text that stored in question answering system mutually is corresponding with this question text Determining, described response variable parameter represents mates journey between described new answer text and described question text Degree, described determines unit, specifically for determining that the response that described new answer text is corresponding becomes according to below equation Amount parameter:
Wherein:
E[b|w,α,β,η,σ2] it is the response variable parameter that described answer text is corresponding;
Z is at least that in described question text and described mutual question answering system, this question text of storage is corresponding Theme included in individual answer text;
W is all words that described new answer text comprises;
α,β,η,σ2It is respectively according to this problem literary composition of storage in described question text and described mutual question answering system The parameter that at least one answer text of this correspondence and the second preset algorithm are determined.
6. device as claimed in claim 5, it is characterised in that
Described determine unit, specifically for determining according to below equation Wherein:
N represents that in described question text and described mutual question answering system, this question text of storage is corresponding extremely The quantity of the word included in a few answer text, znRepresent described question text and described mutual question and answer The n-th word institute included at least one answer text that in system, this question text of storage is corresponding is right The theme answered.
7. device as claimed in claim 5, it is characterised in that also include:
Search unit, for according to described response variable parameter, the response variable parameter pre-build with Join in the mapping relations between parameter, determine and mate ginseng between described answer text and described question text Number, described match parameter represents the value of described new answer text.
CN201210534859.1A 2012-12-12 A kind of text data processing method and device Active CN103870440B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210534859.1A CN103870440B (en) 2012-12-12 A kind of text data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210534859.1A CN103870440B (en) 2012-12-12 A kind of text data processing method and device

Publications (2)

Publication Number Publication Date
CN103870440A CN103870440A (en) 2014-06-18
CN103870440B true CN103870440B (en) 2016-11-30

Family

ID=

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6282534B1 (en) * 1998-03-13 2001-08-28 Intel Corporation Reverse content indexing
CN1794233A (en) * 2005-12-28 2006-06-28 刘文印 Network user interactive asking answering method and its system
CN101118554A (en) * 2007-09-14 2008-02-06 中兴通讯股份有限公司 Intelligent interactive request-answering system and processing method thereof
CN101369265A (en) * 2008-01-14 2009-02-18 北京百问百答网络技术有限公司 Method and system for automatically generating semantic template of problem
CN102521239A (en) * 2011-11-14 2012-06-27 江苏联著实业有限公司 Question-answering information matching system and method based on OWL (web ontology language) for Internet

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6282534B1 (en) * 1998-03-13 2001-08-28 Intel Corporation Reverse content indexing
CN1794233A (en) * 2005-12-28 2006-06-28 刘文印 Network user interactive asking answering method and its system
CN101118554A (en) * 2007-09-14 2008-02-06 中兴通讯股份有限公司 Intelligent interactive request-answering system and processing method thereof
CN101369265A (en) * 2008-01-14 2009-02-18 北京百问百答网络技术有限公司 Method and system for automatically generating semantic template of problem
CN102521239A (en) * 2011-11-14 2012-06-27 江苏联著实业有限公司 Question-answering information matching system and method based on OWL (web ontology language) for Internet

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于语法分析和统计方法的答案排序模型;李波 等;《中文信息学报》;20090331;第23卷(第2期);第23-28页 *
改进的基于模式匹配的答案抽取方法;战学刚 等;《情报理论与实践》;20090930;第32卷(第9期);第105-108页 *
问答社区中的问题与答案推荐机制研究与实现;曲明成;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100815;第2010年卷(第8期);第I138-926页 *

Similar Documents

Publication Publication Date Title
CN111353037B (en) Topic generation method and device and computer readable storage medium
CN108563703A (en) A kind of determination method of charge, device and computer equipment, storage medium
Sukhija et al. The recent state of educational data mining: A survey and future visions
CN110941723A (en) Method, system and storage medium for constructing knowledge graph
Hromkovič Theoretical computer science: introduction to Automata, computability, complexity, algorithmics, randomization, communication, and cryptography
CN102439597A (en) Parameter deducing method, computing device and system based on potential dirichlet model
CN102831119B (en) Short text clustering Apparatus and method for
CN106599194A (en) Label determining method and device
Mostaeen et al. Clonecognition: machine learning based code clone validation tool
Nababan et al. Determination feasibility of poor household surgery by using weighted product method
CN112131587A (en) Intelligent contract pseudo-random number security inspection method, system, medium and device
CN107133218A (en) Trade name intelligent Matching method, system and computer-readable recording medium
CN103870440B (en) A kind of text data processing method and device
Yang et al. Prototype-guided pseudo labeling for semi-supervised text classification
DeLaVina Some history of the development of Graffiti
CN103514194B (en) Determine method and apparatus and the classifier training method of the dependency of language material and entity
Li et al. Evaluating indicators of answer quality in social Q&A websites
Ma et al. Selecting test inputs for DNNs using differential testing with subspecialized model instances
Capuano et al. LIA: an Intelligent Advisor for e-Learning
CN107102543A (en) The forming method and device of a kind of energy router anti-interference controller
CN106897436A (en) A kind of academic research hot keyword extracting method inferred based on variation
CN106971306A (en) The recognition methods of product problem and system
Labutov et al. Optimally Discriminative Choice Sets in Discrete Choice Models: Application to Data-Driven Test Design
CN104281670B (en) The real-time incremental formula detection method and system of a kind of social networks event
Ramathulasi et al. Enhanced PMF model to predict user interest for web API recommendation

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant