CN108763476A - A kind of question and answer Data clean system based on part of speech weight calculation - Google Patents

A kind of question and answer Data clean system based on part of speech weight calculation Download PDF

Info

Publication number
CN108763476A
CN108763476A CN201810533314.6A CN201810533314A CN108763476A CN 108763476 A CN108763476 A CN 108763476A CN 201810533314 A CN201810533314 A CN 201810533314A CN 108763476 A CN108763476 A CN 108763476A
Authority
CN
China
Prior art keywords
question
speech weight
speech
sentence
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810533314.6A
Other languages
Chinese (zh)
Inventor
庄永军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Sanbao Innovation And Intelligence Co Ltd
Original Assignee
Shenzhen Sanbao Innovation And Intelligence Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sanbao Innovation And Intelligence Co Ltd filed Critical Shenzhen Sanbao Innovation And Intelligence Co Ltd
Priority to CN201810533314.6A priority Critical patent/CN108763476A/en
Publication of CN108763476A publication Critical patent/CN108763476A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses a kind of question and answer Data clean systems based on part of speech weight calculation, including question sentence word-dividing mode, part of speech weight computation module and data dump module, the question sentence word-dividing mode connects part of speech weight computation module, part of speech weight computation module is also connected with data dump module, after the present invention is by segmenting question sentence, calculate the participle weight of similar sentence, the problem of some repetitions can effectively be removed and not brief enough, accurate answer, not only improve the quality of question and answer data set, it can also reinforce allowing the important guarantee that feedback is obtained the problem of larger user base number.

Description

A kind of question and answer Data clean system based on part of speech weight calculation
Technical field
The present invention relates to a kind of Data clean system, specifically a kind of question and answer data cleansing system based on part of speech weight calculation System.
Background technology
In recent years, question answering system is largely extensively studied.So-called question answering system gives a problem as user, asks The system of answering can quickly carry out analyzing processing and respective feedback is brief, accurate answer.If according to systematic difference purpose and obtaining Data based on problem answers are taken, question answering system can be divided into question answering system, network question and answer based on fixed data library System and single text question answering system.And the question answering system wherein based on fixed data library is usually extensive true from what is pre-established It searched, fed back in text corpus, i.e., asked according to user, return to the answer of one problem of user.But current the type Question answering system performance capabilities, be largely dependent upon the scale of the database of the system, the reply of system is to know at this Know the answer with user's question matching searched in library.
Invention content
It is above-mentioned to solve the purpose of the present invention is to provide a kind of question and answer Data clean system based on part of speech weight calculation The problem of being proposed in background technology.
To achieve the above object, the present invention provides the following technical solutions:
A kind of question and answer Data clean system based on part of speech weight calculation, including question sentence word-dividing mode, part of speech weight computation module With data dump module, the question sentence word-dividing mode connects part of speech weight computation module, and part of speech weight computation module is also connected with number According to removing module.
A kind of question and answer data cleaning method based on part of speech weight calculation, comprises the steps of:
A, question and answer data set handle by the sequence of key to the issue word, obtain a series of Similar Problems lists;
B, word segmentation processing is carried out to two neighboring question sentence respectively;
C, according to adjacent two word segmentation results, the part of speech weight individually segmented is calculated;
D, the sum of part of speech weight of adjacent two whole sentences and the sum of the part of speech weight of shared word are calculated;
E, it calculates same words part of speech weight and closes the proportion closed in whole sentence problem part of speech weight, if proportion is two neighboring more than 0.8 Question sentence makes the processing for first sentence for deleting two neighboring question sentence.
Compared with prior art, the beneficial effects of the invention are as follows:After the present invention is by segmenting question sentence, similar sentence is calculated The problem of segmenting weight, capable of effectively removing some repetitions and not brief enough, accurate answer not only improve question and answer data The quality of collection, additionally it is possible to which reinforcement allows the important guarantee that feedback is obtained the problem of larger user base number.
Description of the drawings
Fig. 1 is a kind of question and answer Data clean system structure diagram based on part of speech weight calculation;
Fig. 2 is a kind of question and answer data cleaning method flow chart based on part of speech weight calculation.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
- 2 are please referred to Fig.1, a kind of question and answer Data clean system based on part of speech weight calculation comprising question sentence segments mould Block, part of speech weight computation module and data dump module.
Question sentence word-dividing mode:It is ranked up by the keyword of problem in question and answer data set, due to being based on keyword sequence, row Similar problem is all brought together in table, and is segmented respectively to two neighboring question sentence.
Part of speech weight computation module:According to the word segmentation result of adjacent two question sentence, the part of speech power individually segmented is first calculated separately Weight, secondly the sum of part of speech weight of the whole sentence of statistics and the sum of the part of speech weight of shared word, the final same words part of speech weight that calculates are closed In the proportion that whole sentence problem part of speech weight is closed.As adjacent question sentence participle after result be question1={ W1, W2 ... Wn } and Question2=W1, W2 ... Wn }, then the sum of its part of speech weight is:totalweight1= WPW1+ WPW2+…+ WPWn Be the weight that two adjacent question sentences share vocabulary with totalweight2=WPW1+ WPW2+ ...+WPWn, wherein WPWi, then it is adjacent Two calculate share word the sum of part of speech weights be:SimWeight=WPWi+ ...+WPWn are finally calculated and are shared vocabulary Ratio ratioweight=simWeight/totalWeight of the weight in total part of speech weight.
Data dump module:It is more than 0.8 two neighboring question sentence for proportion, is considered as repetition, while deleting two neighboring In addition to this first sentence of question sentence for the problem that the length of answer is long, is more than answer length the question and answer pair of 25 words The processing of deletion is made, question and answer data cleansing is completed.
A kind of question and answer data cleaning method based on part of speech weight calculation, comprises the steps of:
A, question and answer data set handle by the sequence of key to the issue word, obtain a series of Similar Problems lists;
B, word segmentation processing is carried out to two neighboring question sentence respectively;
C, according to adjacent two word segmentation results, the part of speech weight individually segmented is calculated;
D, the sum of part of speech weight of adjacent two whole sentences and the sum of the part of speech weight of shared word are calculated;
E, it calculates same words part of speech weight and closes the proportion closed in whole sentence problem part of speech weight, if proportion is two neighboring more than 0.8 Question sentence makes the processing for first sentence for deleting two neighboring question sentence.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiment being appreciated that.

Claims (2)

1. a kind of question and answer Data clean system based on part of speech weight calculation, including question sentence word-dividing mode, part of speech weight calculation mould Block and data dump module, which is characterized in that the question sentence word-dividing mode connects part of speech weight computation module, part of speech weight calculation Module is also connected with data dump module.
2. a kind of question and answer data cleaning method based on part of speech weight calculation, which is characterized in that comprise the steps of:
A, question and answer data set handle by the sequence of key to the issue word, obtain a series of Similar Problems lists;
B, word segmentation processing is carried out to two neighboring question sentence respectively;
C, according to adjacent two word segmentation results, the part of speech weight individually segmented is calculated;
D, the sum of part of speech weight of adjacent two whole sentences and the sum of the part of speech weight of shared word are calculated;
E, it calculates same words part of speech weight and closes the proportion closed in whole sentence problem part of speech weight, if proportion is two neighboring more than 0.8 Question sentence makes the processing for first sentence for deleting two neighboring question sentence.
CN201810533314.6A 2018-05-29 2018-05-29 A kind of question and answer Data clean system based on part of speech weight calculation Pending CN108763476A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810533314.6A CN108763476A (en) 2018-05-29 2018-05-29 A kind of question and answer Data clean system based on part of speech weight calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810533314.6A CN108763476A (en) 2018-05-29 2018-05-29 A kind of question and answer Data clean system based on part of speech weight calculation

Publications (1)

Publication Number Publication Date
CN108763476A true CN108763476A (en) 2018-11-06

Family

ID=64003865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810533314.6A Pending CN108763476A (en) 2018-05-29 2018-05-29 A kind of question and answer Data clean system based on part of speech weight calculation

Country Status (1)

Country Link
CN (1) CN108763476A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334331A (en) * 2019-05-30 2019-10-15 重庆金融资产交易所有限责任公司 Method, apparatus and computer equipment based on order models screening table

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120203776A1 (en) * 2011-02-09 2012-08-09 Maor Nissan System and method for flexible speech to text search mechanism
CN103020295A (en) * 2012-12-28 2013-04-03 新浪网技术(中国)有限公司 Problem label marking method and device
CN103049548A (en) * 2012-12-27 2013-04-17 安徽科大讯飞信息科技股份有限公司 FAQ (frequently asked questions) recognition system and method for electronic channel application
CN103870457A (en) * 2012-12-07 2014-06-18 北京百度网讯科技有限公司 Method and device for confirming priority of unanswered questions in question-and-answer platform
CN104572618A (en) * 2014-12-31 2015-04-29 哈尔滨工业大学深圳研究生院 Question-answering system semantic-based similarity analyzing method, system and application
CN105824798A (en) * 2016-03-03 2016-08-03 云南电网有限责任公司教育培训评价中心 Examination question de-duplicating method of examination question base based on examination question key word likeness
CN106547734A (en) * 2016-10-21 2017-03-29 上海智臻智能网络科技股份有限公司 A kind of question sentence information processing method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120203776A1 (en) * 2011-02-09 2012-08-09 Maor Nissan System and method for flexible speech to text search mechanism
CN103870457A (en) * 2012-12-07 2014-06-18 北京百度网讯科技有限公司 Method and device for confirming priority of unanswered questions in question-and-answer platform
CN103049548A (en) * 2012-12-27 2013-04-17 安徽科大讯飞信息科技股份有限公司 FAQ (frequently asked questions) recognition system and method for electronic channel application
CN103020295A (en) * 2012-12-28 2013-04-03 新浪网技术(中国)有限公司 Problem label marking method and device
CN104572618A (en) * 2014-12-31 2015-04-29 哈尔滨工业大学深圳研究生院 Question-answering system semantic-based similarity analyzing method, system and application
CN105824798A (en) * 2016-03-03 2016-08-03 云南电网有限责任公司教育培训评价中心 Examination question de-duplicating method of examination question base based on examination question key word likeness
CN106547734A (en) * 2016-10-21 2017-03-29 上海智臻智能网络科技股份有限公司 A kind of question sentence information processing method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHUNG-HSIEN WU ET AL.: "Semantic Segment Extraction and Matching for Internet FAQ Retrieval", 《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》 *
张志飞 等: "基于LDA主题模型的短文本分类方法", 《计算机应用》 *
彭月娥 等: "面向中文问答社区的问题去重技术研究", 《苏州科技学院学报(自然科学版)》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334331A (en) * 2019-05-30 2019-10-15 重庆金融资产交易所有限责任公司 Method, apparatus and computer equipment based on order models screening table

Similar Documents

Publication Publication Date Title
CN110543552B (en) Conversation interaction method and device and electronic equipment
Antiqueira et al. Strong correlations between text quality and complex networks features
KR101646547B1 (en) Interactive searching method and apparatus
CN103325061B (en) A kind of community discovery method and system
CN104778173B (en) Target user determination method, device and equipment
US20150356571A1 (en) Trending Topics Tracking
CN112487173B (en) Man-machine conversation method, device and storage medium
Eisenstein et al. Mapping the geographical diffusion of new words
CN109522420B (en) Method and system for acquiring learning demand
CN109684446B (en) Text semantic similarity calculation method and device
EP3940582A1 (en) Method for disambiguating between authors with same name on basis of network representation and semantic representation
CN110532368A (en) Question answering method, electronic equipment and computer readable storage medium
CN108280218A (en) A kind of flow system based on retrieval and production mixing question and answer
US20190095423A1 (en) Text recognition method and apparatus, and storage medium
US20100311020A1 (en) Teaching material auto expanding method and learning material expanding system using the same, and machine readable medium thereof
WO2019150583A1 (en) Question group extraction method, question group extraction device, and recording medium
CN108364066A (en) Artificial neural network chip and its application process based on N-GRAM and WFST models
CN108763476A (en) A kind of question and answer Data clean system based on part of speech weight calculation
CN107861937A (en) Update method, updating device and the more new procedures of paginal translation corpus
CN110413750A (en) The method and apparatus for recalling standard question sentence according to user's question sentence
CN108763356A (en) A kind of intelligent robot chat system and method based on the search of similar sentence
JPH11143875A (en) Device and method for automatic word classification
Clark Internal and External Factors A ecting Language Change: A computational model
CN105404618A (en) Dialogue text data processing method and apparatus
CN109783615A (en) Based on word to user's portrait method and system of Di Li Cray process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181106

RJ01 Rejection of invention patent application after publication