CN108763476A - A kind of question and answer Data clean system based on part of speech weight calculation - Google Patents
A kind of question and answer Data clean system based on part of speech weight calculation Download PDFInfo
- Publication number
- CN108763476A CN108763476A CN201810533314.6A CN201810533314A CN108763476A CN 108763476 A CN108763476 A CN 108763476A CN 201810533314 A CN201810533314 A CN 201810533314A CN 108763476 A CN108763476 A CN 108763476A
- Authority
- CN
- China
- Prior art keywords
- question
- speech weight
- speech
- sentence
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
The invention discloses a kind of question and answer Data clean systems based on part of speech weight calculation, including question sentence word-dividing mode, part of speech weight computation module and data dump module, the question sentence word-dividing mode connects part of speech weight computation module, part of speech weight computation module is also connected with data dump module, after the present invention is by segmenting question sentence, calculate the participle weight of similar sentence, the problem of some repetitions can effectively be removed and not brief enough, accurate answer, not only improve the quality of question and answer data set, it can also reinforce allowing the important guarantee that feedback is obtained the problem of larger user base number.
Description
Technical field
The present invention relates to a kind of Data clean system, specifically a kind of question and answer data cleansing system based on part of speech weight calculation
System.
Background technology
In recent years, question answering system is largely extensively studied.So-called question answering system gives a problem as user, asks
The system of answering can quickly carry out analyzing processing and respective feedback is brief, accurate answer.If according to systematic difference purpose and obtaining
Data based on problem answers are taken, question answering system can be divided into question answering system, network question and answer based on fixed data library
System and single text question answering system.And the question answering system wherein based on fixed data library is usually extensive true from what is pre-established
It searched, fed back in text corpus, i.e., asked according to user, return to the answer of one problem of user.But current the type
Question answering system performance capabilities, be largely dependent upon the scale of the database of the system, the reply of system is to know at this
Know the answer with user's question matching searched in library.
Invention content
It is above-mentioned to solve the purpose of the present invention is to provide a kind of question and answer Data clean system based on part of speech weight calculation
The problem of being proposed in background technology.
To achieve the above object, the present invention provides the following technical solutions:
A kind of question and answer Data clean system based on part of speech weight calculation, including question sentence word-dividing mode, part of speech weight computation module
With data dump module, the question sentence word-dividing mode connects part of speech weight computation module, and part of speech weight computation module is also connected with number
According to removing module.
A kind of question and answer data cleaning method based on part of speech weight calculation, comprises the steps of:
A, question and answer data set handle by the sequence of key to the issue word, obtain a series of Similar Problems lists;
B, word segmentation processing is carried out to two neighboring question sentence respectively;
C, according to adjacent two word segmentation results, the part of speech weight individually segmented is calculated;
D, the sum of part of speech weight of adjacent two whole sentences and the sum of the part of speech weight of shared word are calculated;
E, it calculates same words part of speech weight and closes the proportion closed in whole sentence problem part of speech weight, if proportion is two neighboring more than 0.8
Question sentence makes the processing for first sentence for deleting two neighboring question sentence.
Compared with prior art, the beneficial effects of the invention are as follows:After the present invention is by segmenting question sentence, similar sentence is calculated
The problem of segmenting weight, capable of effectively removing some repetitions and not brief enough, accurate answer not only improve question and answer data
The quality of collection, additionally it is possible to which reinforcement allows the important guarantee that feedback is obtained the problem of larger user base number.
Description of the drawings
Fig. 1 is a kind of question and answer Data clean system structure diagram based on part of speech weight calculation;
Fig. 2 is a kind of question and answer data cleaning method flow chart based on part of speech weight calculation.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
- 2 are please referred to Fig.1, a kind of question and answer Data clean system based on part of speech weight calculation comprising question sentence segments mould
Block, part of speech weight computation module and data dump module.
Question sentence word-dividing mode:It is ranked up by the keyword of problem in question and answer data set, due to being based on keyword sequence, row
Similar problem is all brought together in table, and is segmented respectively to two neighboring question sentence.
Part of speech weight computation module:According to the word segmentation result of adjacent two question sentence, the part of speech power individually segmented is first calculated separately
Weight, secondly the sum of part of speech weight of the whole sentence of statistics and the sum of the part of speech weight of shared word, the final same words part of speech weight that calculates are closed
In the proportion that whole sentence problem part of speech weight is closed.As adjacent question sentence participle after result be question1={ W1, W2 ... Wn } and
Question2=W1, W2 ... Wn }, then the sum of its part of speech weight is:totalweight1= WPW1+ WPW2+…+ WPWn
Be the weight that two adjacent question sentences share vocabulary with totalweight2=WPW1+ WPW2+ ...+WPWn, wherein WPWi, then it is adjacent
Two calculate share word the sum of part of speech weights be:SimWeight=WPWi+ ...+WPWn are finally calculated and are shared vocabulary
Ratio ratioweight=simWeight/totalWeight of the weight in total part of speech weight.
Data dump module:It is more than 0.8 two neighboring question sentence for proportion, is considered as repetition, while deleting two neighboring
In addition to this first sentence of question sentence for the problem that the length of answer is long, is more than answer length the question and answer pair of 25 words
The processing of deletion is made, question and answer data cleansing is completed.
A kind of question and answer data cleaning method based on part of speech weight calculation, comprises the steps of:
A, question and answer data set handle by the sequence of key to the issue word, obtain a series of Similar Problems lists;
B, word segmentation processing is carried out to two neighboring question sentence respectively;
C, according to adjacent two word segmentation results, the part of speech weight individually segmented is calculated;
D, the sum of part of speech weight of adjacent two whole sentences and the sum of the part of speech weight of shared word are calculated;
E, it calculates same words part of speech weight and closes the proportion closed in whole sentence problem part of speech weight, if proportion is two neighboring more than 0.8
Question sentence makes the processing for first sentence for deleting two neighboring question sentence.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power
Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims
Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped
Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should
It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art
The other embodiment being appreciated that.
Claims (2)
1. a kind of question and answer Data clean system based on part of speech weight calculation, including question sentence word-dividing mode, part of speech weight calculation mould
Block and data dump module, which is characterized in that the question sentence word-dividing mode connects part of speech weight computation module, part of speech weight calculation
Module is also connected with data dump module.
2. a kind of question and answer data cleaning method based on part of speech weight calculation, which is characterized in that comprise the steps of:
A, question and answer data set handle by the sequence of key to the issue word, obtain a series of Similar Problems lists;
B, word segmentation processing is carried out to two neighboring question sentence respectively;
C, according to adjacent two word segmentation results, the part of speech weight individually segmented is calculated;
D, the sum of part of speech weight of adjacent two whole sentences and the sum of the part of speech weight of shared word are calculated;
E, it calculates same words part of speech weight and closes the proportion closed in whole sentence problem part of speech weight, if proportion is two neighboring more than 0.8
Question sentence makes the processing for first sentence for deleting two neighboring question sentence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810533314.6A CN108763476A (en) | 2018-05-29 | 2018-05-29 | A kind of question and answer Data clean system based on part of speech weight calculation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810533314.6A CN108763476A (en) | 2018-05-29 | 2018-05-29 | A kind of question and answer Data clean system based on part of speech weight calculation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108763476A true CN108763476A (en) | 2018-11-06 |
Family
ID=64003865
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810533314.6A Pending CN108763476A (en) | 2018-05-29 | 2018-05-29 | A kind of question and answer Data clean system based on part of speech weight calculation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108763476A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334331A (en) * | 2019-05-30 | 2019-10-15 | 重庆金融资产交易所有限责任公司 | Method, apparatus and computer equipment based on order models screening table |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120203776A1 (en) * | 2011-02-09 | 2012-08-09 | Maor Nissan | System and method for flexible speech to text search mechanism |
CN103020295A (en) * | 2012-12-28 | 2013-04-03 | 新浪网技术(中国)有限公司 | Problem label marking method and device |
CN103049548A (en) * | 2012-12-27 | 2013-04-17 | 安徽科大讯飞信息科技股份有限公司 | FAQ (frequently asked questions) recognition system and method for electronic channel application |
CN103870457A (en) * | 2012-12-07 | 2014-06-18 | 北京百度网讯科技有限公司 | Method and device for confirming priority of unanswered questions in question-and-answer platform |
CN104572618A (en) * | 2014-12-31 | 2015-04-29 | 哈尔滨工业大学深圳研究生院 | Question-answering system semantic-based similarity analyzing method, system and application |
CN105824798A (en) * | 2016-03-03 | 2016-08-03 | 云南电网有限责任公司教育培训评价中心 | Examination question de-duplicating method of examination question base based on examination question key word likeness |
CN106547734A (en) * | 2016-10-21 | 2017-03-29 | 上海智臻智能网络科技股份有限公司 | A kind of question sentence information processing method and device |
-
2018
- 2018-05-29 CN CN201810533314.6A patent/CN108763476A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120203776A1 (en) * | 2011-02-09 | 2012-08-09 | Maor Nissan | System and method for flexible speech to text search mechanism |
CN103870457A (en) * | 2012-12-07 | 2014-06-18 | 北京百度网讯科技有限公司 | Method and device for confirming priority of unanswered questions in question-and-answer platform |
CN103049548A (en) * | 2012-12-27 | 2013-04-17 | 安徽科大讯飞信息科技股份有限公司 | FAQ (frequently asked questions) recognition system and method for electronic channel application |
CN103020295A (en) * | 2012-12-28 | 2013-04-03 | 新浪网技术(中国)有限公司 | Problem label marking method and device |
CN104572618A (en) * | 2014-12-31 | 2015-04-29 | 哈尔滨工业大学深圳研究生院 | Question-answering system semantic-based similarity analyzing method, system and application |
CN105824798A (en) * | 2016-03-03 | 2016-08-03 | 云南电网有限责任公司教育培训评价中心 | Examination question de-duplicating method of examination question base based on examination question key word likeness |
CN106547734A (en) * | 2016-10-21 | 2017-03-29 | 上海智臻智能网络科技股份有限公司 | A kind of question sentence information processing method and device |
Non-Patent Citations (3)
Title |
---|
CHUNG-HSIEN WU ET AL.: "Semantic Segment Extraction and Matching for Internet FAQ Retrieval", 《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》 * |
张志飞 等: "基于LDA主题模型的短文本分类方法", 《计算机应用》 * |
彭月娥 等: "面向中文问答社区的问题去重技术研究", 《苏州科技学院学报(自然科学版)》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334331A (en) * | 2019-05-30 | 2019-10-15 | 重庆金融资产交易所有限责任公司 | Method, apparatus and computer equipment based on order models screening table |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110543552B (en) | Conversation interaction method and device and electronic equipment | |
Antiqueira et al. | Strong correlations between text quality and complex networks features | |
KR101646547B1 (en) | Interactive searching method and apparatus | |
CN103325061B (en) | A kind of community discovery method and system | |
CN104778173B (en) | Target user determination method, device and equipment | |
US20150356571A1 (en) | Trending Topics Tracking | |
CN112487173B (en) | Man-machine conversation method, device and storage medium | |
Eisenstein et al. | Mapping the geographical diffusion of new words | |
CN109522420B (en) | Method and system for acquiring learning demand | |
CN109684446B (en) | Text semantic similarity calculation method and device | |
EP3940582A1 (en) | Method for disambiguating between authors with same name on basis of network representation and semantic representation | |
CN110532368A (en) | Question answering method, electronic equipment and computer readable storage medium | |
CN108280218A (en) | A kind of flow system based on retrieval and production mixing question and answer | |
US20190095423A1 (en) | Text recognition method and apparatus, and storage medium | |
US20100311020A1 (en) | Teaching material auto expanding method and learning material expanding system using the same, and machine readable medium thereof | |
WO2019150583A1 (en) | Question group extraction method, question group extraction device, and recording medium | |
CN108364066A (en) | Artificial neural network chip and its application process based on N-GRAM and WFST models | |
CN108763476A (en) | A kind of question and answer Data clean system based on part of speech weight calculation | |
CN107861937A (en) | Update method, updating device and the more new procedures of paginal translation corpus | |
CN110413750A (en) | The method and apparatus for recalling standard question sentence according to user's question sentence | |
CN108763356A (en) | A kind of intelligent robot chat system and method based on the search of similar sentence | |
JPH11143875A (en) | Device and method for automatic word classification | |
Clark | Internal and External Factors A ecting Language Change: A computational model | |
CN105404618A (en) | Dialogue text data processing method and apparatus | |
CN109783615A (en) | Based on word to user's portrait method and system of Di Li Cray process |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181106 |
|
RJ01 | Rejection of invention patent application after publication |