CN103440287A - Web question-answering retrieval system based on product information structuring - Google Patents

Web question-answering retrieval system based on product information structuring Download PDF

Info

Publication number
CN103440287A
CN103440287A CN2013103548884A CN201310354888A CN103440287A CN 103440287 A CN103440287 A CN 103440287A CN 2013103548884 A CN2013103548884 A CN 2013103548884A CN 201310354888 A CN201310354888 A CN 201310354888A CN 103440287 A CN103440287 A CN 103440287A
Authority
CN
China
Prior art keywords
information
product
question sentence
module
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013103548884A
Other languages
Chinese (zh)
Other versions
CN103440287B (en
Inventor
郝志峰
温雯
蔡瑞初
王鸿飞
张奇
张鑫
刘建明
王宗武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIMING SOFTWARE CO., LTD.
Guangdong University of Technology
Foshan University
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201310354888.4A priority Critical patent/CN103440287B/en
Publication of CN103440287A publication Critical patent/CN103440287A/en
Application granted granted Critical
Publication of CN103440287B publication Critical patent/CN103440287B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a Web question-answering retrieval system based on product information structuring, which comprises a user interface, product information crawling modules, information extraction modules, an invert index building module, a database interface, an information integration module, a question processing module and a database. The system can obtain the updates of on-line production information in real time, and can update the existing structured product data in the database or append new structured product data into the database in time by means of the information extraction modules and the information integration module, and thereby the system can be adapted to the change of on-line product information. In addition, the system can collect product information from a plurality of product information websites and utilize the information extraction modules and the information integration module to integrate the product information of the same product on different websites, judge conflicting information and supplement lacked information between different information sources in order to ensure the integrity and authenticity of retrieved information. The Web question-answering retrieval system based on product information structuring has high retrieval efficiency.

Description

A kind of Web dialogue retrieve system of product-based message structure
Technical field
The present invention relates to internet destructuring, semi-structured information extraction, modeling and search field, be specially a kind of Web dialogue retrieve system and method for product-based message structure, belong to the renovation technique of the Web dialogue retrieve system of product-based message structure.
Background technology
21 century is the informationalized epoch, and network has become people's indispensable part of living.Along with developing rapidly of internet, people grow with each passing day for the demand of the network information on the one hand, there is on the other hand the information of magnanimity on internet, yet due to inherent characteristics such as the large capacity in internet, dynamics, these magnanimity informations are scrappy often, and inorganizable property has also comprised a large amount of invalid datas.This has reduced the utilization ratio of people to the abundant information resource.This in order to solve " information overload " problem, many companies and research institution have turned to the research to automatically request-answering system.
Question answering system (Question Answering System, QA) is a kind of advanced form of information retrieval system.It can answer the problem that the user proposes with natural language with accurate, succinct natural language.Main cause that its research is risen is people to the demand of obtaining information quickly and accurately.Question answering system is a research direction that receives much attention and have the broad development prospect in current artificial intelligence and natural language processing field.
From ken, existing question answering system can be divided into " sealing field " and " Opening field " two type systematics.The sealing neighborhood system is absorbed in the problem of answering specific area, and the most question answering system belongs to the sealing neighborhood system.The Opening field system is wished the context of the problem of not limiting, and difficulty is relatively high.
Existing sealing field question answering system mainly contains: the application number of Kunming University of Science and Technology is 200810233734, denomination of invention is " tourism request-answer system answer abstracting method based on ontology inference ".The method concentrates on the research to tourism request-answer system answer abstracting method, at first manually defines concept, attribute and relation in tour field, and artificial constructed tour field ontology knowledge storehouse, finally the consistance of body is tested again; Next utilizes the semantic information in the ontology knowledge storehouse to carry out semantic disambiguation to user's question sentence; Then manually define the semantic rules in tour field; The Research of Question Analysis result of semantic-based disambiguation again, adopt the reasoning of corresponding semantic rules and method that information retrieval combines to extract answer in the ontology knowledge storehouse; Finally, according to different question sentence types, design corresponding answer extraction algorithm, improve responsiveness and the recall rate of system.
Can find out, in the method that this invention adopts, need a large amount of artificial interferences, comprise that the structure of knowledge base, the definition of concept attribute and the formulation of semantic rules all need artificial participation.Too much artificial participation can cause the increase of human cost, and needs to keep certain personnel system is safeguarded and upgraded.
Summary of the invention
The object of the invention is to consider the problems referred to above and a kind of integrality and authenticity of guaranteeing retrieving information is provided, and thering is the Web dialogue retrieve system of the product-based message structure of higher recall precision.
Technical scheme of the present invention is: the Web dialogue retrieve system of product-based message structure of the present invention, including user interface, product information crawls module, information extraction module, inverted index and sets up module, database interface, information integrate module, question sentence processing module, database, wherein
User interface, communicate by letter with the various of user for realizing the Web question answering system, comprises the relevant natural language question sentence of the product that obtains user's input and question sentence is passed to the question sentence processing module; Corresponding Search Results and related web page are returned to the user;
Product information crawls module, for according to the certain hour interval, webpage being crawled, and the webpage crawled is stored, and passes to information extraction module and is processed;
Information extraction module, for product information being crawled to the non-structured web page information that module crawls webpage, processed, these unstructured information are converted into to structured message, and connect by database interface and structurize product information data, the structured message of handling well is stored in database;
Inverted index is set up module, extracts key content for crawl the webpage that module crawls from product information, and these webpages are set up to inverted index;
Database interface, the access of implementation structure product data, the unified interface of upgrading database manipulation and access rights are controlled;
The information integrate module, for a plurality of Data Source structured messages of integrate information abstraction module output, and the structural data after integrating is connected with Database by database interface, is saved in database;
The question sentence processing module, be converted into structurized statement for the natural language question sentence by user's input, this module connects and obtains the natural language question sentence of user's input by user interface and user, and be connected with Database by database interface, use after transforming the statement obtained to be inquired about in database, and by user interface, the Query Result of statement is fed back to the user.
Above-mentioned question sentence processing module is transformed the natural language question sentence in two steps, at first use the Naive Bayes Classifier trained to be classified to the natural language question sentence, then adopt skip-chain CRF model to be identified and extract the named entity in the natural language question sentence.
Above-mentioned named entity is mobile phone title, mobile phone attribute.
Above-mentioned Skip-chain CRF model is to develop on the basis of linear condition random field (Linear CRF) model, is a kind of in condition random field (CRF) model.
In above-mentioned named entity recognition method, ignore conjunction " with ", the effect of "or" in sentence, set up the contact between former and later two words of conjunction in Skip-chain CRF model, help the raising of final precision; The model of cognition extracted for inquiring about the question sentence named entity, after adopting Skip-Chain CRF model to be learnt training set, acquisition is for named entity recognition and the judgment criterion of product information, and then question sentence is converted into to keyword and the product attribute of retrieval meaning.
Above-mentioned information integrate module first obtains a best property of attribute mapping table according to the property value information in two pending tables, is about in two tables have same meaning but name and may different attribute-name be mapped, and is convenient to next step integration work; Again according to the mapping table information creating object table obtained, rearrange according to the order of sequence respectively the row name of two tables, determine according to can uniquely determining the Major key of a record whether two corresponding record in table can compare, if equal think to compare, if can compare, the information in two tables merged or remove redundancy and process, result being inserted in object table, and the corresponding record in former table is carried out to mark; Finally unlabelled record is also inserted in object table one by one, obtain an object table through integrating; If multiple tables are arranged, process two tables at every turn, repeat said method and obtain net result.
The said goods information crawls module, for according to the certain hour interval, the webpage of introducing the digital product details on pconline, the large-scale digital website of bubble being crawled, and the webpage crawled is stored, and passes to information extraction module and is processed.
Above-mentioned question sentence processing module is converted into structurized SQL statement for the natural language question sentence by user's input, this module connects and obtains the natural language question sentence of user's input by user interface and user, and connect by database interface and structurize product information database, use after transforming the SQL statement obtained to be inquired about in database, and by user interface, the Query Result of SQL statement is fed back to the user.
The present invention is directed to the analytic system of destructuring, semi-structured product information, a plurality of source-informations with product are integrated, guarantee information true and perfect; Adopt sorting algorithm and named entity recognition algorithm that the natural language question sentence is converted into to the structured database query statement simultaneously; For the fine granularity sentiment analysis system of product review information, adopt a kind of algorithm based on case similarity to be integrated the separate sources information of identical product.
The above-mentioned algorithm based on case similarity is divided into mapping and merges the integration that two steps are carried out information, in mapping step, adopt algorithm based on case similarity to carry out similarity calculating to the corresponding element of two tables, two tables are merged according to the result of previous step at combining step; Fine granularity sentiment analysis system for product review information, at first question sentence is classified, then set up model of cognition the named entity in question sentence is extracted, finally according to the structure of first two steps, adopt corresponding rule that this natural language question sentence is converted into to SQL statement.
The present invention is based on the Web dialogue retrieve system of product information structure, have advantages of the following aspects: 1) the present invention has good adaptability to the product information changed on internet, the effectively regular information updating collection technique that native system proposes can carry out same timely collection to the variation of the product information on internet, can obtain in real time the latest development of product information on line, and by information extraction and integrate module, can be upgraded in time or be increased new structurize product data to existing structurize product data in database, thereby make system can adapt to the variation of product information on line.2) product information that the present invention collects is more complete and have a higher authenticity.The present invention is gathered product information from a plurality of infomediaries, and by information extraction and integrate module, product information to identical product on different web sites is integrated, information to contradiction is judged, the information lacked is carried out to the complementation between different aforementioned sources, thereby guaranteed information integrity and authenticity.3) the present invention has higher recall precision, from traditional information retrieval system, to return to the keyword related web page different, the present invention passes through the question sentence processing module when related web page information is provided, natural question sentence to user's input carries out a series of processing such as Question Classification, named entity recognition, the nature question sentence is converted into to structurized SQL statement, finally uses the SQL statement obtained to be inquired about and return accurately simple result to the user in database.The present invention is a kind of Web dialogue retrieve system of convenient and practical product-based message structure,, be a kind of advanced form of information retrieval, it can with accurately, the language of brief introduction answers the problem that the user proposes with natural language.
The accompanying drawing explanation
Fig. 1 is Web question answering system Organization Chart of the present invention;
Fig. 2 is the schematic diagram of realizing that inverted index of the present invention is set up module;
Fig. 3 is the schematic diagram of realizing of Data Integration module of the present invention;
Fig. 4 is the schematic diagram of realizing of question sentence processing module of the present invention;
Fig. 5 be in question sentence processing module of the present invention Question Classification realize schematic diagram;
Fig. 6 be in question sentence processing module of the present invention named entity recognition realize schematic diagram;
Fig. 7 be take the graph structure of the Linear-CRF model that the named entity task is example.
Embodiment
Embodiment:
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and, with reference to accompanying drawing, the present invention is described in more detail.
Fig. 1 shows the Web question answering system Organization Chart that the present invention is based on product information structure.
With reference to Fig. 1, Web question answering system of the present invention comprises that user interface, question sentence processing module, database interface, structurize product information database, information integrate module, information extraction module, product information crawl module, inverted index is set up module.
User interface, communicate by letter with the various of user for realizing the Web question answering system, comprises the relevant natural language question sentence of the product that obtains user's input and question sentence is passed to the question sentence processing module; Corresponding Search Results and related web page are returned to the user.
Product information crawls module, for according to the certain hour interval, the webpage of introducing the digital product details such as mobile phone, computer on the large-scale digital website such as pconline, bubble being crawled, and the webpage crawled is stored, pass to information extraction module and processed.
Information extraction module, processed for product information being crawled to the non-structured web page information that module crawls webpage, as the dominant frequency of mobile phone, screen size etc.These unstructured information are converted into to structured message, and connect by database interface and structurize product information data, the structured message of handling well is stored in database.
Inverted index is set up module, extracts key content for crawl the webpage that module crawls from product information, and these webpages are set up to inverted index.
Database interface, the unified interface of the database manipulations such as the access of implementation structure product data, renewal and access rights are controlled.
The information integrate module, for a plurality of Data Source structured messages of integrate information abstraction module output, and the structural data after integrating is connected with Database by database interface, is saved in database.The present invention first obtains a best property of attribute mapping table according to the information such as property value in two pending tables, is about in two tables have same meaning but name and may different attribute-name be mapped, and is convenient to next step integration work; According to the mapping table information creating object table obtained, rearrange according to the order of sequence respectively the row name of two tables again.Determine according to can uniquely determining the Major key of a record whether two corresponding record in table can compare, if equal think to compare, if can compare, the information in two tables merged or removed the processing such as redundancy, result is inserted in object table, and the corresponding record in former table is carried out to mark.Finally unlabelled record is also inserted in object table one by one, obtain an object table through integrating.If multiple tables are arranged, process two tables at every turn, repeat said method and obtain net result.
The question sentence processing module, be converted into structurized SQL statement for the natural language question sentence by user's input.This module connects and obtains the natural language question sentence of user's input by user interface and user, and connect by database interface and structurize product information database, use after transforming the SQL statement obtained to be inquired about in database, and by user interface, the Query Result of SQL statement is fed back to the user.The present invention is transformed the natural language question sentence in two steps, at first use the Naive Bayes Classifier trained to be classified to the natural language question sentence, then adopt skip-chain CRF model to be identified and extract as mobile phone title, mobile phone attribute etc. the named entity in the natural language question sentence.Skip-chain CRF model is in Linear CRF(linear conditions random field) develop on the basis of model, be the CRF(condition random field) a kind of in model.In named entity recognition method in the past, generally ignored conjunction as " with ", the effect of word in sentence such as "or", set up the contact between former and later two words of conjunction in Skip-chain CRF model, help the raising of final precision.
The present invention adopts the algorithm calculated based on similarity to be integrated the separate sources information of unified product.Because crawling module in product information, native system can crawl a plurality of digital products website information of carrying out, it is complete abundant that the purpose of this way is that the product information in order to guarantee to collect can be tried one's best, but, because different web sites may adopt different names or property value difference to same attribute, this causes the separate sources information of identical product may have the situation of redundancy or contradiction.The algorithm calculated based on similarity that the present invention adopts can be integrated the separate sources information of these redundancies or contradiction effectively, thereby has not only guaranteed the complete of data but also can guarantee that data have higher correctness.
The present invention adopts the method for Question Classification and named entity recognition that the natural language question sentence is converted into to structurized SQL statement.Question sentence is classified and can be carried out finer processing to question sentence, different classes of question sentence is adopted to different transformation rules, can improve the understandability of system to the natural language question sentence.Named entity in the natural language question sentence is identified the main body in question sentence or object are identified, and the main body in a Rational Solutions question sentence and object could carry out the question sentence conversion in conjunction with concrete transformation rule.The question sentence converting algorithm that the present invention adopts can be transformed the natural language question sentence of plurality of classes, and can guarantee higher accuracy rate.
In sum, the main modular of this system is that question sentence processing module, Data Integration module and inverted index are set up module.Below in conjunction with accompanying drawing, these three modules are carried out further introducing in detail.
Fig. 2 is the schematic diagram of realizing that inverted index is set up module.With reference to Fig. 2, this module realizes crawling the webpage that module crawls and extracting key content from product information, and these webpages are set up inverted index and stored.The construction process of index can be divided into three parts:
1) pretreatment stage, used Htmlparser to extract the key content information in webpage, removes the noise information in webpage, improves the accuracy rate of later stage retrieval.The Document object of the data construct Lucene that utilizes these to extract and corresponding Field object thereof.
2) analysis phase, by calling the addDocument(Document of index manager (IndexWriter)) method passes to Lucene by data and carries out index operation.When data are carried out to index process, at first Lucene can analyze data, makes it more to be applicable to indexed.
3) write index, after input data analysis is completed, result is write in index file, the input data are stored with the data structure of inverted index.
The schematic diagram of realizing that Fig. 3 is the Data Integration module.With reference to Fig. 3, this module realizes a plurality of Data Source structured messages of integrate information abstraction module output, and the structural data after integrating is stored in database.This module can be divided into two submodules:
1) obtain a best property of attribute mapping table according to the information such as property value in two pending tables, be about in two tables there is same meaning but name and may different attribute-name be mapped, be convenient to next step integration work.
2) the mapping table information creating object table obtained according to the 1st step, rearrange respectively the row name of two tables according to the order of sequence.Determine according to can uniquely determining the Major key of a record whether two corresponding record in table can compare, if equal think to compare, if can compare, the information in two tables merged or removed the processing such as redundancy, result is inserted in object table, and the corresponding record in former table is carried out to mark.Finally unlabelled record is also inserted in object table one by one, obtain an object table through integrating.
If multiple tables are arranged, process two tables at every turn, repeat above-mentioned steps 1 and step 2 and obtain net result.
The detailed step of setting up mapping table and integrate information is:
1) obtain the step of mapping table:
1. obtain two property value information in table, they are for example deposited in respectively, in result1 and result2,
result1=List<a 1,a 2,a 3,....,a m>,a i=<a i1,a i2,a i3,...a in>i=1,2,3,…m.
Wherein m is the columns of the attribute column of first table, and n is the line number of the attribute column of first table.Each row that are about in first table deposit respectively a in 1, a 2, a 3...., a min.In like manner can obtain:
result2=List<b 1,b 2,b 3,....,b m>,b i=<b i1,b i2,b i3,...b in>i=1,2,3,…m.
2. use the participle instrument imdict-chinese-analyzer of the Chinese Academy of Sciences to a 1, a 2, a 3...., a mand b 1, b 2, b 3...., b mafter carrying out participle, do not deposit result1SegmentFilter=List<a in 1', a' 2, a' 3..., a' m, a ithe a of '=< i' 1, a i' 2...., a i' kresult2egmentFilter=List<b 1', b' 2, b 3' ..., b' m, b ithe b of '=< i' 1, b i' 2...., b i' k?
3. respectively to a 1', a' 2, a' 3..., a' mget set, to b 1', b' 2, b 3' ..., b' mget set, remove the value repeated and deposit result1Set=List<a in 1' ', a' 2', a' 3' ..., a'' m, a ithe a of '=< i' ' 1, a i' ' 2...., a i' ' lil ia i' ' in the number of word
Result2Set=List<b 1' ', b' 2', b 3' ' ..., b' m', b i' '=<b i' ' 1, b i' ' 2...., b i' ' l' il' ib i' ' in the number of word
4. calculate in result1Set and result2Set element a in twos i' ' and b i' ' similarity:
A) if a i' ' and b i' ' the number difference of word less, directly to a i' ' and b i' ' carry out similarity calculating, calculating formula of similarity: wherein the same function calculates a i' ' and b i' ' there is the number of same words.Deposit result of calculation in M (i, j), to each i, try to achieve the j value corresponding with it, make M(i, j) maximum.This j value is most possibly with the i in first table, to be listed as corresponding row number in second table.If M(i, j) size be greater than a certain threshold value, think that i and j are corresponding, output to their correspondences in mapping table.
B) if a i' ' and b i' ' the number difference of word larger, need a i' ' and b i' ' in number the greater of word carry out pre-service, add up word frequency, by word frequency, sorted from high to low, then block in position, obtain a i' ' and b i' '.Go to step again A.
2) integrate the step of information in two tables:
1. utilize the major key property value of a record of unique identification (can) of two tables, the record that major key is identical carries out Data Integration, for example removes redundancy, perfect information, and conflict removals etc. carry out mark respectively by the record of processing in two tables.Record after processing is inserted in object table.
2. after the key assignments of first table of traversal to be recycled, find the record be not labeled in two tables, they are inserted into respectively in object table, so far, the integration of two tables completes.If integrate multiple tables, can according to the method described above, only need that the integration table obtained is used as to a common integration table for the treatment of and get final product.
The schematic diagram of realizing that Fig. 4 is the question sentence processing module.With reference to Fig. 4, this module realizes the natural language question sentence of user's input is converted into to structurized SQL statement.In this module, the conversion process of natural language question sentence is divided into three steps: text pre-service, Question Classification and named entity recognition.Mainly carry out the processing such as the participle of question sentence and part-of-speech tagging in the text pre-treatment step.Here introduce in detail Question Classification and named entity recognition step.
The schematic diagram of realizing that Fig. 5 is Question Classification of the present invention.With reference to Fig. 5, the present invention adopts NB Algorithm to be classified to the natural language question sentence, according to the maximal possibility estimation criterion, selects final classification results.Suppose that class set is combined into C={C 1, C 2...., C n, the result after the natural language participle of input is X={x 1, x 2... .., x m, x wherein ifor the word in question sentence, according to training to data belong to the probability of each class with following formula calculating question sentence:
P ( C i | X ) = P ( x i | C i ) &times; P ( x 2 | C i ) &times; . . . &times; P ( x m | C i ) P ( X ) , ( 1 < = i < = n )
Wherein for each natural language P (X), fix, therefore only need to calculate P (x 1| C i) * P (x 2| C i) * ... * P (x m| C i), select the class of maximum probability as final class.
The schematic diagram of realizing that Fig. 6 is named entity recognition of the present invention.With reference to Fig. 6, the present invention adopts a kind of CRF model of the skip-chain of having structure to be identified the named entity in question sentence.This model is the key point that the present invention carries out the conversion of natural language question sentence, therefore, below will introduce in detail structure, principle and the advantage of skip-chain CRF model.
We carry out the observation analysis discovery by the relevant question sentence of the product information to a large amount of, in many question sentences, two or more named entity titles can appear simultaneously, for example input question sentence for " which is better for Nokia5230 and Nokia N8? " or " Nokia 5230 and Nokia 5233 which good " the input question sentence be, and in this class question sentence, the entity title commonly used as " with ", the conjunction such as "or" is connected.Have so a kind of like this phenomenon, there is a strong possibility is also same entity title for the word if the word before conjunction is judged as the entity title after conjunction.This phenomenon has referred in work in the past, but does not propose solution well.The present invention has skip-chain CRF model by structure couples together the word before and after conjunction, thereby has considered the information in this class phenomenon in deterministic process, helps the raising of named entity recognition accuracy rate.In Fig. 6, mark T1 presentation-entity word front portion wherein, T2 presentation-entity word rear portion, O means other words.
Skip-chain CRF is a kind of special CRF model.CRF is a kind of non-directed graph model, and it carries out modeling to the conditional probability distribution of sequence mark on given characteristic set basis.Take the most basic Linear-CRF as example, and under the condition of given observation sequence, the conditional probability of flag sequence can formalized description be following form:
P ( Y | X ) = 1 Z ( X ) &Pi; i = 1 I &psi; i ( y i , y i = 1 , X )
Wherein, ψ ithe potential function in the non-directed graph model concept,
Figure 20131035488841000021
be length be I the regularization factor under flag sequence likely.Potential function ψ ican be decomposed into following form, wherein f kfundamental function for definition.
&psi; i ( y i , y i - 1 , X ) = exp { &Sigma; k &lambda; k * f k ( y i , y i - 1 , X , i ) }
Its corresponding graph model structure as shown in Figure 7, be take the named entity recognition task here as example, inputs pretreated text message, sets up its corresponding Linear-CRF model.Linear-CRF directly carries out modeling to the conditional probability of flag sequence, is different from Directed Graph Model horse model as hidden as HMM(), it does not need just can introduce abundant feature to doing independence assumption between feature; On the other hand, it also can regard the MEMM(maximum entropy Markov model of overall regularization as), and avoided the marking bias problem in MEMM.Therefore, Linear-CRF can obtain good effect when solving the sequence mark problem as the identification of named entity.
Skip-chain CRF is improved a kind of CRF model on the basis of Linear-CRF.As shown in the graph model of Skip-chain CRF in Fig. 6, its structure is except comprising the linear-chain between the Linear-CRF adjacent node, also introduced the skip-chain between former and later two words of conjunction, thereby increased the contact details between the word label of conjunction front and back on the basis of Linear-CRF.
The formalized description of Skip-chain CRF is as follows:
P ( Y | X ) = 1 Z ( Y ) &Pi; i = 1 I &Psi; i ( y i , y i - 1 , X ) &Pi; ( j , j + 2 ) &Element; S J &phi; j , j + 2 ( y i , y i + 2 , X )
Ψ wherein ibe defined in the potential function on the adjacent label node, φ j, j+2be defined in the potential function on skip-chain, S={ (j, j+2) } be the set of all skip-chain.Being defined as follows of they:
&Psi; i ( y i , y i - 1 , X ) = exp { &Sigma; k &lambda; k * f k ( y i , y i - 1 , X , i ) }
&phi; j , j + 2 ( y i , y i + 2 , X ) = exp { &Sigma; l &eta; l * f l ( y i , y i + 2 , X . j , j + 2 ) }
F wherein k(y i, y i-1, X, i) and be the fundamental function be defined on linear-chain, f l(y i, y i+2, X, j, j+2) and be the fundamental function be defined on skip-chain.
When model training, the present invention uses the skip-chain CRF model training of L-BFGS algorithm to launching, the parameter lambda in learning model kand η l.
Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (10)

1. the Web dialogue retrieve system of a product-based message structure, it is characterized in that including user interface, product information crawls module, information extraction module, inverted index and sets up module, database interface, information integrate module, question sentence processing module, database, wherein
User interface, communicate by letter with the various of user for realizing the Web question answering system, comprises the relevant natural language question sentence of the product that obtains user's input and question sentence is passed to the question sentence processing module; Corresponding Search Results and related web page are returned to the user;
Product information crawls module, for according to the certain hour interval, webpage being crawled, and the webpage crawled is stored, and passes to information extraction module and is processed;
Information extraction module, for product information being crawled to the non-structured web page information that module crawls webpage, processed, these unstructured information are converted into to structured message, and connect by database interface and structurize product information data, the structured message of handling well is stored in database;
Inverted index is set up module, extracts key content for crawl the webpage that module crawls from product information, and these webpages are set up to inverted index;
Database interface, the access of implementation structure product data, the unified interface of upgrading database manipulation and access rights are controlled;
The information integrate module, for a plurality of Data Source structured messages of integrate information abstraction module output, and the structural data after integrating is connected with Database by database interface, is saved in database;
The question sentence processing module, be converted into structurized statement for the natural language question sentence by user's input, this module connects and obtains the natural language question sentence of user's input by user interface and user, and be connected with Database by database interface, use after transforming the statement obtained to be inquired about in database, and by user interface, the Query Result of statement is fed back to the user.
2. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that above-mentioned question sentence processing module is transformed the natural language question sentence in two steps, at first use the Naive Bayes Classifier trained to be classified to the natural language question sentence, then adopt skip-chain CRF model to be identified and extract the named entity in the natural language question sentence.
3. the Web dialogue retrieve system of product-based message structure according to claim 1, is characterized in that above-mentioned named entity is mobile phone title, mobile phone attribute.
4. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that above-mentioned Skip-chain CRF model is to develop on the basis of linear condition random field (Linear CRF) model, is a kind of in condition random field (CRF) model.
5. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that in above-mentioned named entity recognition method, ignore conjunction " with ", the effect of "or" in sentence, set up the contact between former and later two words of conjunction in Skip-chain CRF model, helped the raising of final precision; The model of cognition extracted for inquiring about the question sentence named entity, after adopting Skip-Chain CRF model to be learnt training set, acquisition is for named entity recognition and the judgment criterion of product information, and then question sentence is converted into to keyword and the product attribute of retrieval meaning.
6. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that above-mentioned information integrate module first obtains a best property of attribute mapping table according to the property value information in two pending tables, be about in two tables there is same meaning but name and may different attribute-name be mapped, be convenient to next step integration work; Again according to the mapping table information creating object table obtained, rearrange according to the order of sequence respectively the row name of two tables, determine according to can uniquely determining the Major key of a record whether two corresponding record in table can compare, if equal think to compare, if can compare, the information in two tables merged or remove redundancy and process, result being inserted in object table, and the corresponding record in former table is carried out to mark; Finally unlabelled record is also inserted in object table one by one, obtain an object table through integrating; If multiple tables are arranged, process two tables at every turn, repeat said method and obtain net result.
7. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that the said goods information crawls module, for according to the certain hour interval, the webpage of introducing the digital product details on pconline, the large-scale digital website of bubble being crawled, and the webpage crawled is stored, pass to information extraction module and processed.
8. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that above-mentioned question sentence processing module is converted into structurized SQL statement for the natural language question sentence by user's input, this module connects and obtains the natural language question sentence of user's input by user interface and user, and connect by database interface and structurize product information database, use after transforming the SQL statement obtained to be inquired about in database, and by user interface, the Query Result of SQL statement is fed back to the user.
9. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that the analytic system for destructuring, semi-structured product information, a plurality of source-informations with product are integrated to guarantee information true and perfect; Adopt sorting algorithm and named entity recognition algorithm that the natural language question sentence is converted into to the structured database query statement simultaneously; For the fine granularity sentiment analysis system of product review information, adopt a kind of algorithm based on case similarity to be integrated the separate sources information of identical product.
10. the Web dialogue retrieve system of product-based message structure according to claim 9, it is characterized in that the above-mentioned algorithm based on case similarity is divided into mapping and merges the integration that two steps are carried out information, in mapping step, adopt algorithm based on case similarity to carry out similarity calculating to the corresponding element of two tables, two tables are merged according to the result of previous step at combining step; Fine granularity sentiment analysis system for product review information, at first question sentence is classified, then set up model of cognition the named entity in question sentence is extracted, finally according to the structure of first two steps, adopt corresponding rule that this natural language question sentence is converted into to SQL statement.
CN201310354888.4A 2013-08-14 2013-08-14 A kind of Web question and answer searching system based on product information structure Active CN103440287B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310354888.4A CN103440287B (en) 2013-08-14 2013-08-14 A kind of Web question and answer searching system based on product information structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310354888.4A CN103440287B (en) 2013-08-14 2013-08-14 A kind of Web question and answer searching system based on product information structure

Publications (2)

Publication Number Publication Date
CN103440287A true CN103440287A (en) 2013-12-11
CN103440287B CN103440287B (en) 2016-12-28

Family

ID=49693979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310354888.4A Active CN103440287B (en) 2013-08-14 2013-08-14 A kind of Web question and answer searching system based on product information structure

Country Status (1)

Country Link
CN (1) CN103440287B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102738A (en) * 2014-07-28 2014-10-15 百度在线网络技术(北京)有限公司 Entity library expansion method and device
CN105045909A (en) * 2015-08-11 2015-11-11 北京京东尚科信息技术有限公司 Method and device for recognizing commodity name from text
CN105260590A (en) * 2015-09-16 2016-01-20 西部天使(北京)健康科技有限公司 Method and system for combining multiple follow-up visit plans
CN105302841A (en) * 2014-07-31 2016-02-03 青岛海尔智能家电科技有限公司 Information integration apparatus, server and method
CN105786794A (en) * 2016-02-05 2016-07-20 青岛理工大学 Question-answer pair search method and community question-answer search system
CN106919563A (en) * 2015-12-24 2017-07-04 神州数码信息系统有限公司 A kind of cross-border issue of government affairs machine question answering system is classified, distributes automatically, the method for response
CN107741939A (en) * 2016-10-31 2018-02-27 腾讯科技(深圳)有限公司 A kind of recognition methods of info web and device
US9928269B2 (en) 2015-01-03 2018-03-27 International Business Machines Corporation Apply corrections to an ingested corpus
CN108182295A (en) * 2018-02-09 2018-06-19 重庆誉存大数据科技有限公司 A kind of Company Knowledge collection of illustrative plates attribute extraction method and system
CN109002501A (en) * 2018-06-29 2018-12-14 北京百度网讯科技有限公司 For handling method, apparatus, electronic equipment and the computer readable storage medium of natural language dialogue
CN109271459A (en) * 2018-09-18 2019-01-25 四川长虹电器股份有限公司 Chat robots and its implementation based on Lucene and grammer networks
CN111914087A (en) * 2020-07-30 2020-11-10 广州城市信息研究所有限公司 Public opinion analysis method
CN112507098A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 Question processing method, question processing device, electronic equipment, storage medium and program product
CN113987146A (en) * 2021-10-22 2022-01-28 国网江苏省电力有限公司镇江供电分公司 Dedicated novel intelligence of electric power intranet system of asking for answering
CN117132392A (en) * 2023-10-23 2023-11-28 蓝色火焰科技成都有限公司 Vehicle loan fraud risk early warning method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060206481A1 (en) * 2005-03-14 2006-09-14 Fuji Xerox Co., Ltd. Question answering system, data search method, and computer program
CN101373532A (en) * 2008-07-10 2009-02-25 昆明理工大学 FAQ Chinese request-answering system implementing method in tourism field
US20090327234A1 (en) * 2008-06-27 2009-12-31 Google Inc. Updating answers with references in forums
CN102262634A (en) * 2010-05-24 2011-11-30 北京大学深圳研究生院 Automatic questioning and answering method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060206481A1 (en) * 2005-03-14 2006-09-14 Fuji Xerox Co., Ltd. Question answering system, data search method, and computer program
US20090327234A1 (en) * 2008-06-27 2009-12-31 Google Inc. Updating answers with references in forums
CN101373532A (en) * 2008-07-10 2009-02-25 昆明理工大学 FAQ Chinese request-answering system implementing method in tourism field
CN102262634A (en) * 2010-05-24 2011-11-30 北京大学深圳研究生院 Automatic questioning and answering method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DMIR: "基于产品信息结构化的web问答检索系统V1.0", 《HTTP://DMIRLAB.COM/ACHIEVEMENT.PHP?ID=135》 *
黄健斌 等: "基于混合跳链条件随机场的异构Web记录集成方法", 《JOURNAL OF SOFTWARE 软件学报》 *
黄高辉: "汉语情感问题分析和比较类型情感问答方法的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102738A (en) * 2014-07-28 2014-10-15 百度在线网络技术(北京)有限公司 Entity library expansion method and device
CN105302841A (en) * 2014-07-31 2016-02-03 青岛海尔智能家电科技有限公司 Information integration apparatus, server and method
US10430405B2 (en) 2015-01-03 2019-10-01 International Business Machines Corporation Apply corrections to an ingested corpus
US9928269B2 (en) 2015-01-03 2018-03-27 International Business Machines Corporation Apply corrections to an ingested corpus
CN105045909A (en) * 2015-08-11 2015-11-11 北京京东尚科信息技术有限公司 Method and device for recognizing commodity name from text
CN105045909B (en) * 2015-08-11 2018-04-03 北京京东尚科信息技术有限公司 The method and apparatus that trade name is identified from text
CN105260590A (en) * 2015-09-16 2016-01-20 西部天使(北京)健康科技有限公司 Method and system for combining multiple follow-up visit plans
CN106919563A (en) * 2015-12-24 2017-07-04 神州数码信息系统有限公司 A kind of cross-border issue of government affairs machine question answering system is classified, distributes automatically, the method for response
CN105786794B (en) * 2016-02-05 2018-09-04 青岛理工大学 A kind of question and answer are to search method and community's question and answer searching system
CN105786794A (en) * 2016-02-05 2016-07-20 青岛理工大学 Question-answer pair search method and community question-answer search system
CN107741939A (en) * 2016-10-31 2018-02-27 腾讯科技(深圳)有限公司 A kind of recognition methods of info web and device
CN107741939B (en) * 2016-10-31 2020-05-12 腾讯科技(深圳)有限公司 Webpage information identification method and device
CN108182295A (en) * 2018-02-09 2018-06-19 重庆誉存大数据科技有限公司 A kind of Company Knowledge collection of illustrative plates attribute extraction method and system
CN108182295B (en) * 2018-02-09 2021-09-10 重庆电信系统集成有限公司 Enterprise knowledge graph attribute extraction method and system
CN109002501A (en) * 2018-06-29 2018-12-14 北京百度网讯科技有限公司 For handling method, apparatus, electronic equipment and the computer readable storage medium of natural language dialogue
CN109271459A (en) * 2018-09-18 2019-01-25 四川长虹电器股份有限公司 Chat robots and its implementation based on Lucene and grammer networks
CN111914087A (en) * 2020-07-30 2020-11-10 广州城市信息研究所有限公司 Public opinion analysis method
CN111914087B (en) * 2020-07-30 2023-09-19 广州城市信息研究所有限公司 Public opinion analysis method
CN112507098A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 Question processing method, question processing device, electronic equipment, storage medium and program product
CN112507098B (en) * 2020-12-18 2022-01-28 北京百度网讯科技有限公司 Question processing method, question processing device, electronic equipment, storage medium and program product
CN113987146A (en) * 2021-10-22 2022-01-28 国网江苏省电力有限公司镇江供电分公司 Dedicated novel intelligence of electric power intranet system of asking for answering
CN113987146B (en) * 2021-10-22 2023-01-31 国网江苏省电力有限公司镇江供电分公司 Dedicated intelligent question-answering system of electric power intranet
CN117132392A (en) * 2023-10-23 2023-11-28 蓝色火焰科技成都有限公司 Vehicle loan fraud risk early warning method and system
CN117132392B (en) * 2023-10-23 2024-01-30 蓝色火焰科技成都有限公司 Vehicle loan fraud risk early warning method and system

Also Published As

Publication number Publication date
CN103440287B (en) 2016-12-28

Similar Documents

Publication Publication Date Title
CN103440287B (en) A kind of Web question and answer searching system based on product information structure
CN110990590A (en) Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN107220237A (en) A kind of method of business entity&#39;s Relation extraction based on convolutional neural networks
CN111651447B (en) Intelligent construction life-span data processing, analyzing and controlling system
WO2020010834A1 (en) Faq question and answer library generalization method, apparatus, and device
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN111626568B (en) Knowledge base construction method and knowledge search method and system in natural disaster field
CN103425740A (en) IOT (Internet Of Things) faced material information retrieval method based on semantic clustering
Zhao et al. Research on information extraction of technical documents and construction of domain knowledge graph
Ahmad et al. A survey of searching and information extraction on a classical text using ontology-based semantics modeling: A case of Quran
CN116127084A (en) Knowledge graph-based micro-grid scheduling strategy intelligent retrieval system and method
CN107330111A (en) The search method and device of domain body based on common version body
CN112966053A (en) Knowledge graph-based marine field expert database construction method and device
CN113946686A (en) Electric power marketing knowledge map construction method and system
CN111710428A (en) Biomedical text representation method for modeling global and local context interaction
CN115840805A (en) Method for constructing intelligent question-answering system based on knowledge graph of computer science
CN114490964A (en) Soil fertility knowledge question-answering method, system, equipment and medium based on knowledge map
CN117094390A (en) Knowledge graph construction and intelligent search method oriented to ocean engineering field
Mohnot et al. Hybrid approach for Part of Speech Tagger for Hindi language
CN107908749A (en) A kind of personage&#39;s searching system and method based on search engine
CN115964468A (en) Rural information intelligent question-answering method and device based on multilevel template matching
CN116011564A (en) Entity relationship completion method, system and application for power equipment
CN115730078A (en) Event knowledge graph construction method and device for class case retrieval and electronic equipment
CN115905554A (en) Chinese academic knowledge graph construction method based on multidisciplinary classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190425

Address after: 528000 No. 18 Jiangwan Road, Chancheng District, Foshan, Guangdong.

Co-patentee after: Guangdong University of Technology

Patentee after: Foshan Science &. Technology College

Co-patentee after: BEIMING SOFTWARE CO., LTD.

Address before: 510006 Panyu District, Guangzhou, Guangdong, Panyu District, No. 100, West Ring Road, outside the city.

Patentee before: Guangdong University of Technology