CN108763476A

CN108763476A - A kind of question and answer Data clean system based on part of speech weight calculation

Info

Publication number: CN108763476A
Application number: CN201810533314.6A
Authority: CN
Inventors: 庄永军
Original assignee: Shenzhen Sanbao Innovation And Intelligence Co Ltd
Current assignee: Shenzhen Sanbao Innovation And Intelligence Co Ltd
Priority date: 2018-05-29
Filing date: 2018-05-29
Publication date: 2018-11-06

Abstract

The invention discloses a kind of question and answer Data clean systems based on part of speech weight calculation, including question sentence word-dividing mode, part of speech weight computation module and data dump module, the question sentence word-dividing mode connects part of speech weight computation module, part of speech weight computation module is also connected with data dump module, after the present invention is by segmenting question sentence, calculate the participle weight of similar sentence, the problem of some repetitions can effectively be removed and not brief enough, accurate answer, not only improve the quality of question and answer data set, it can also reinforce allowing the important guarantee that feedback is obtained the problem of larger user base number.

Description

A kind of question and answer Data clean system based on part of speech weight calculation

Technical field

The present invention relates to a kind of Data clean system, specifically a kind of question and answer data cleansing system based on part of speech weight calculation System.

Background technology

In recent years, question answering system is largely extensively studied.So-called question answering system gives a problem as user, asks The system of answering can quickly carry out analyzing processing and respective feedback is brief, accurate answer.If according to systematic difference purpose and obtaining Data based on problem answers are taken, question answering system can be divided into question answering system, network question and answer based on fixed data library System and single text question answering system.And the question answering system wherein based on fixed data library is usually extensive true from what is pre-established It searched, fed back in text corpus, i.e., asked according to user, return to the answer of one problem of user.But current the type Question answering system performance capabilities, be largely dependent upon the scale of the database of the system, the reply of system is to know at this Know the answer with user's question matching searched in library.

Invention content

It is above-mentioned to solve the purpose of the present invention is to provide a kind of question and answer Data clean system based on part of speech weight calculation The problem of being proposed in background technology.

To achieve the above object, the present invention provides the following technical solutions：

A kind of question and answer Data clean system based on part of speech weight calculation, including question sentence word-dividing mode, part of speech weight computation module With data dump module, the question sentence word-dividing mode connects part of speech weight computation module, and part of speech weight computation module is also connected with number According to removing module.

A kind of question and answer data cleaning method based on part of speech weight calculation, comprises the steps of：

A, question and answer data set handle by the sequence of key to the issue word, obtain a series of Similar Problems lists；

B, word segmentation processing is carried out to two neighboring question sentence respectively；

C, according to adjacent two word segmentation results, the part of speech weight individually segmented is calculated；

D, the sum of part of speech weight of adjacent two whole sentences and the sum of the part of speech weight of shared word are calculated；

E, it calculates same words part of speech weight and closes the proportion closed in whole sentence problem part of speech weight, if proportion is two neighboring more than 0.8 Question sentence makes the processing for first sentence for deleting two neighboring question sentence.

Compared with prior art, the beneficial effects of the invention are as follows：After the present invention is by segmenting question sentence, similar sentence is calculated The problem of segmenting weight, capable of effectively removing some repetitions and not brief enough, accurate answer not only improve question and answer data The quality of collection, additionally it is possible to which reinforcement allows the important guarantee that feedback is obtained the problem of larger user base number.

Description of the drawings

Fig. 1 is a kind of question and answer Data clean system structure diagram based on part of speech weight calculation；

Fig. 2 is a kind of question and answer data cleaning method flow chart based on part of speech weight calculation.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

- 2 are please referred to Fig.1, a kind of question and answer Data clean system based on part of speech weight calculation comprising question sentence segments mould Block, part of speech weight computation module and data dump module.

Question sentence word-dividing mode：It is ranked up by the keyword of problem in question and answer data set, due to being based on keyword sequence, row Similar problem is all brought together in table, and is segmented respectively to two neighboring question sentence.

Part of speech weight computation module：According to the word segmentation result of adjacent two question sentence, the part of speech power individually segmented is first calculated separately Weight, secondly the sum of part of speech weight of the whole sentence of statistics and the sum of the part of speech weight of shared word, the final same words part of speech weight that calculates are closed In the proportion that whole sentence problem part of speech weight is closed.As adjacent question sentence participle after result be question1={ W1, W2 ... Wn } and Question2=W1, W2 ... Wn }, then the sum of its part of speech weight is：totalweight1= WPW1+ WPW2+…+ WPWn Be the weight that two adjacent question sentences share vocabulary with totalweight2=WPW1+ WPW2+ ...+WPWn, wherein WPWi, then it is adjacent Two calculate share word the sum of part of speech weights be：SimWeight=WPWi+ ...+WPWn are finally calculated and are shared vocabulary Ratio ratioweight=simWeight/totalWeight of the weight in total part of speech weight.

Data dump module：It is more than 0.8 two neighboring question sentence for proportion, is considered as repetition, while deleting two neighboring In addition to this first sentence of question sentence for the problem that the length of answer is long, is more than answer length the question and answer pair of 25 words The processing of deletion is made, question and answer data cleansing is completed.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.

In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiment being appreciated that.

Claims

1. a kind of question and answer Data clean system based on part of speech weight calculation, including question sentence word-dividing mode, part of speech weight calculation mould Block and data dump module, which is characterized in that the question sentence word-dividing mode connects part of speech weight computation module, part of speech weight calculation Module is also connected with data dump module.

2. a kind of question and answer data cleaning method based on part of speech weight calculation, which is characterized in that comprise the steps of：