CN103440287A

CN103440287A - Web question-answering retrieval system based on product information structuring

Info

Publication number: CN103440287A
Application number: CN2013103548884A
Authority: CN
Inventors: 郝志峰; 温雯; 蔡瑞初; 王鸿飞; 张奇; 张鑫; 刘建明; 王宗武
Original assignee: Guangdong University of Technology
Current assignee: BEIMING SOFTWARE CO., LTD.; Guangdong University of Technology; Foshan University
Priority date: 2013-08-14
Filing date: 2013-08-14
Publication date: 2013-12-11
Anticipated expiration: 2033-08-14
Also published as: CN103440287B

Abstract

The invention relates to a Web question-answering retrieval system based on product information structuring, which comprises a user interface, product information crawling modules, information extraction modules, an invert index building module, a database interface, an information integration module, a question processing module and a database. The system can obtain the updates of on-line production information in real time, and can update the existing structured product data in the database or append new structured product data into the database in time by means of the information extraction modules and the information integration module, and thereby the system can be adapted to the change of on-line product information. In addition, the system can collect product information from a plurality of product information websites and utilize the information extraction modules and the information integration module to integrate the product information of the same product on different websites, judge conflicting information and supplement lacked information between different information sources in order to ensure the integrity and authenticity of retrieved information. The Web question-answering retrieval system based on product information structuring has high retrieval efficiency.

Description

A kind of Web dialogue retrieve system of product-based message structure

Technical field

The present invention relates to internet destructuring, semi-structured information extraction, modeling and search field, be specially a kind of Web dialogue retrieve system and method for product-based message structure, belong to the renovation technique of the Web dialogue retrieve system of product-based message structure.

Background technology

21 century is the informationalized epoch, and network has become people's indispensable part of living.Along with developing rapidly of internet, people grow with each passing day for the demand of the network information on the one hand, there is on the other hand the information of magnanimity on internet, yet due to inherent characteristics such as the large capacity in internet, dynamics, these magnanimity informations are scrappy often, and inorganizable property has also comprised a large amount of invalid datas.This has reduced the utilization ratio of people to the abundant information resource.This in order to solve " information overload " problem, many companies and research institution have turned to the research to automatically request-answering system.

Question answering system (Question Answering System, QA) is a kind of advanced form of information retrieval system.It can answer the problem that the user proposes with natural language with accurate, succinct natural language.Main cause that its research is risen is people to the demand of obtaining information quickly and accurately.Question answering system is a research direction that receives much attention and have the broad development prospect in current artificial intelligence and natural language processing field.

From ken, existing question answering system can be divided into " sealing field " and " Opening field " two type systematics.The sealing neighborhood system is absorbed in the problem of answering specific area, and the most question answering system belongs to the sealing neighborhood system.The Opening field system is wished the context of the problem of not limiting, and difficulty is relatively high.

Existing sealing field question answering system mainly contains: the application number of Kunming University of Science and Technology is 200810233734, denomination of invention is " tourism request-answer system answer abstracting method based on ontology inference ".The method concentrates on the research to tourism request-answer system answer abstracting method, at first manually defines concept, attribute and relation in tour field, and artificial constructed tour field ontology knowledge storehouse, finally the consistance of body is tested again; Next utilizes the semantic information in the ontology knowledge storehouse to carry out semantic disambiguation to user's question sentence; Then manually define the semantic rules in tour field; The Research of Question Analysis result of semantic-based disambiguation again, adopt the reasoning of corresponding semantic rules and method that information retrieval combines to extract answer in the ontology knowledge storehouse; Finally, according to different question sentence types, design corresponding answer extraction algorithm, improve responsiveness and the recall rate of system.

Can find out, in the method that this invention adopts, need a large amount of artificial interferences, comprise that the structure of knowledge base, the definition of concept attribute and the formulation of semantic rules all need artificial participation.Too much artificial participation can cause the increase of human cost, and needs to keep certain personnel system is safeguarded and upgraded.

Summary of the invention

The object of the invention is to consider the problems referred to above and a kind of integrality and authenticity of guaranteeing retrieving information is provided, and thering is the Web dialogue retrieve system of the product-based message structure of higher recall precision.

Technical scheme of the present invention is: the Web dialogue retrieve system of product-based message structure of the present invention, including user interface, product information crawls module, information extraction module, inverted index and sets up module, database interface, information integrate module, question sentence processing module, database, wherein

User interface, communicate by letter with the various of user for realizing the Web question answering system, comprises the relevant natural language question sentence of the product that obtains user's input and question sentence is passed to the question sentence processing module; Corresponding Search Results and related web page are returned to the user;

Product information crawls module, for according to the certain hour interval, webpage being crawled, and the webpage crawled is stored, and passes to information extraction module and is processed;

Information extraction module, for product information being crawled to the non-structured web page information that module crawls webpage, processed, these unstructured information are converted into to structured message, and connect by database interface and structurize product information data, the structured message of handling well is stored in database;

Inverted index is set up module, extracts key content for crawl the webpage that module crawls from product information, and these webpages are set up to inverted index;

Database interface, the access of implementation structure product data, the unified interface of upgrading database manipulation and access rights are controlled;

The information integrate module, for a plurality of Data Source structured messages of integrate information abstraction module output, and the structural data after integrating is connected with Database by database interface, is saved in database;

The question sentence processing module, be converted into structurized statement for the natural language question sentence by user's input, this module connects and obtains the natural language question sentence of user's input by user interface and user, and be connected with Database by database interface, use after transforming the statement obtained to be inquired about in database, and by user interface, the Query Result of statement is fed back to the user.

Above-mentioned question sentence processing module is transformed the natural language question sentence in two steps, at first use the Naive Bayes Classifier trained to be classified to the natural language question sentence, then adopt skip-chain CRF model to be identified and extract the named entity in the natural language question sentence.

Above-mentioned named entity is mobile phone title, mobile phone attribute.

Above-mentioned Skip-chain CRF model is to develop on the basis of linear condition random field (Linear CRF) model, is a kind of in condition random field (CRF) model.

In above-mentioned named entity recognition method, ignore conjunction " with ", the effect of "or" in sentence, set up the contact between former and later two words of conjunction in Skip-chain CRF model, help the raising of final precision; The model of cognition extracted for inquiring about the question sentence named entity, after adopting Skip-Chain CRF model to be learnt training set, acquisition is for named entity recognition and the judgment criterion of product information, and then question sentence is converted into to keyword and the product attribute of retrieval meaning.

Above-mentioned information integrate module first obtains a best property of attribute mapping table according to the property value information in two pending tables, is about in two tables have same meaning but name and may different attribute-name be mapped, and is convenient to next step integration work; Again according to the mapping table information creating object table obtained, rearrange according to the order of sequence respectively the row name of two tables, determine according to can uniquely determining the Major key of a record whether two corresponding record in table can compare, if equal think to compare, if can compare, the information in two tables merged or remove redundancy and process, result being inserted in object table, and the corresponding record in former table is carried out to mark; Finally unlabelled record is also inserted in object table one by one, obtain an object table through integrating; If multiple tables are arranged, process two tables at every turn, repeat said method and obtain net result.

The said goods information crawls module, for according to the certain hour interval, the webpage of introducing the digital product details on pconline, the large-scale digital website of bubble being crawled, and the webpage crawled is stored, and passes to information extraction module and is processed.

Above-mentioned question sentence processing module is converted into structurized SQL statement for the natural language question sentence by user's input, this module connects and obtains the natural language question sentence of user's input by user interface and user, and connect by database interface and structurize product information database, use after transforming the SQL statement obtained to be inquired about in database, and by user interface, the Query Result of SQL statement is fed back to the user.

The present invention is directed to the analytic system of destructuring, semi-structured product information, a plurality of source-informations with product are integrated, guarantee information true and perfect; Adopt sorting algorithm and named entity recognition algorithm that the natural language question sentence is converted into to the structured database query statement simultaneously; For the fine granularity sentiment analysis system of product review information, adopt a kind of algorithm based on case similarity to be integrated the separate sources information of identical product.

The above-mentioned algorithm based on case similarity is divided into mapping and merges the integration that two steps are carried out information, in mapping step, adopt algorithm based on case similarity to carry out similarity calculating to the corresponding element of two tables, two tables are merged according to the result of previous step at combining step; Fine granularity sentiment analysis system for product review information, at first question sentence is classified, then set up model of cognition the named entity in question sentence is extracted, finally according to the structure of first two steps, adopt corresponding rule that this natural language question sentence is converted into to SQL statement.

The present invention is based on the Web dialogue retrieve system of product information structure, have advantages of the following aspects: 1) the present invention has good adaptability to the product information changed on internet, the effectively regular information updating collection technique that native system proposes can carry out same timely collection to the variation of the product information on internet, can obtain in real time the latest development of product information on line, and by information extraction and integrate module, can be upgraded in time or be increased new structurize product data to existing structurize product data in database, thereby make system can adapt to the variation of product information on line.2) product information that the present invention collects is more complete and have a higher authenticity.The present invention is gathered product information from a plurality of infomediaries, and by information extraction and integrate module, product information to identical product on different web sites is integrated, information to contradiction is judged, the information lacked is carried out to the complementation between different aforementioned sources, thereby guaranteed information integrity and authenticity.3) the present invention has higher recall precision, from traditional information retrieval system, to return to the keyword related web page different, the present invention passes through the question sentence processing module when related web page information is provided, natural question sentence to user's input carries out a series of processing such as Question Classification, named entity recognition, the nature question sentence is converted into to structurized SQL statement, finally uses the SQL statement obtained to be inquired about and return accurately simple result to the user in database.The present invention is a kind of Web dialogue retrieve system of convenient and practical product-based message structure,, be a kind of advanced form of information retrieval, it can with accurately, the language of brief introduction answers the problem that the user proposes with natural language.

The accompanying drawing explanation

Fig. 1 is Web question answering system Organization Chart of the present invention;

Fig. 2 is the schematic diagram of realizing that inverted index of the present invention is set up module;

Fig. 3 is the schematic diagram of realizing of Data Integration module of the present invention;

Fig. 4 is the schematic diagram of realizing of question sentence processing module of the present invention;

Fig. 5 be in question sentence processing module of the present invention Question Classification realize schematic diagram;

Fig. 6 be in question sentence processing module of the present invention named entity recognition realize schematic diagram;

Fig. 7 be take the graph structure of the Linear-CRF model that the named entity task is example.

Embodiment

Embodiment:

For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and, with reference to accompanying drawing, the present invention is described in more detail.

Fig. 1 shows the Web question answering system Organization Chart that the present invention is based on product information structure.

With reference to Fig. 1, Web question answering system of the present invention comprises that user interface, question sentence processing module, database interface, structurize product information database, information integrate module, information extraction module, product information crawl module, inverted index is set up module.

User interface, communicate by letter with the various of user for realizing the Web question answering system, comprises the relevant natural language question sentence of the product that obtains user's input and question sentence is passed to the question sentence processing module; Corresponding Search Results and related web page are returned to the user.

Product information crawls module, for according to the certain hour interval, the webpage of introducing the digital product details such as mobile phone, computer on the large-scale digital website such as pconline, bubble being crawled, and the webpage crawled is stored, pass to information extraction module and processed.

Information extraction module, processed for product information being crawled to the non-structured web page information that module crawls webpage, as the dominant frequency of mobile phone, screen size etc.These unstructured information are converted into to structured message, and connect by database interface and structurize product information data, the structured message of handling well is stored in database.

Inverted index is set up module, extracts key content for crawl the webpage that module crawls from product information, and these webpages are set up to inverted index.

Database interface, the unified interface of the database manipulations such as the access of implementation structure product data, renewal and access rights are controlled.

The information integrate module, for a plurality of Data Source structured messages of integrate information abstraction module output, and the structural data after integrating is connected with Database by database interface, is saved in database.The present invention first obtains a best property of attribute mapping table according to the information such as property value in two pending tables, is about in two tables have same meaning but name and may different attribute-name be mapped, and is convenient to next step integration work; According to the mapping table information creating object table obtained, rearrange according to the order of sequence respectively the row name of two tables again.Determine according to can uniquely determining the Major key of a record whether two corresponding record in table can compare, if equal think to compare, if can compare, the information in two tables merged or removed the processing such as redundancy, result is inserted in object table, and the corresponding record in former table is carried out to mark.Finally unlabelled record is also inserted in object table one by one, obtain an object table through integrating.If multiple tables are arranged, process two tables at every turn, repeat said method and obtain net result.

The question sentence processing module, be converted into structurized SQL statement for the natural language question sentence by user's input.This module connects and obtains the natural language question sentence of user's input by user interface and user, and connect by database interface and structurize product information database, use after transforming the SQL statement obtained to be inquired about in database, and by user interface, the Query Result of SQL statement is fed back to the user.The present invention is transformed the natural language question sentence in two steps, at first use the Naive Bayes Classifier trained to be classified to the natural language question sentence, then adopt skip-chain CRF model to be identified and extract as mobile phone title, mobile phone attribute etc. the named entity in the natural language question sentence.Skip-chain CRF model is in Linear CRF(linear conditions random field) develop on the basis of model, be the CRF(condition random field) a kind of in model.In named entity recognition method in the past, generally ignored conjunction as " with ", the effect of word in sentence such as "or", set up the contact between former and later two words of conjunction in Skip-chain CRF model, help the raising of final precision.

The present invention adopts the algorithm calculated based on similarity to be integrated the separate sources information of unified product.Because crawling module in product information, native system can crawl a plurality of digital products website information of carrying out, it is complete abundant that the purpose of this way is that the product information in order to guarantee to collect can be tried one's best, but, because different web sites may adopt different names or property value difference to same attribute, this causes the separate sources information of identical product may have the situation of redundancy or contradiction.The algorithm calculated based on similarity that the present invention adopts can be integrated the separate sources information of these redundancies or contradiction effectively, thereby has not only guaranteed the complete of data but also can guarantee that data have higher correctness.

The present invention adopts the method for Question Classification and named entity recognition that the natural language question sentence is converted into to structurized SQL statement.Question sentence is classified and can be carried out finer processing to question sentence, different classes of question sentence is adopted to different transformation rules, can improve the understandability of system to the natural language question sentence.Named entity in the natural language question sentence is identified the main body in question sentence or object are identified, and the main body in a Rational Solutions question sentence and object could carry out the question sentence conversion in conjunction with concrete transformation rule.The question sentence converting algorithm that the present invention adopts can be transformed the natural language question sentence of plurality of classes, and can guarantee higher accuracy rate.

In sum, the main modular of this system is that question sentence processing module, Data Integration module and inverted index are set up module.Below in conjunction with accompanying drawing, these three modules are carried out further introducing in detail.

Fig. 2 is the schematic diagram of realizing that inverted index is set up module.With reference to Fig. 2, this module realizes crawling the webpage that module crawls and extracting key content from product information, and these webpages are set up inverted index and stored.The construction process of index can be divided into three parts:

1) pretreatment stage, used Htmlparser to extract the key content information in webpage, removes the noise information in webpage, improves the accuracy rate of later stage retrieval.The Document object of the data construct Lucene that utilizes these to extract and corresponding Field object thereof.

2) analysis phase, by calling the addDocument(Document of index manager (IndexWriter)) method passes to Lucene by data and carries out index operation.When data are carried out to index process, at first Lucene can analyze data, makes it more to be applicable to indexed.

3) write index, after input data analysis is completed, result is write in index file, the input data are stored with the data structure of inverted index.

The schematic diagram of realizing that Fig. 3 is the Data Integration module.With reference to Fig. 3, this module realizes a plurality of Data Source structured messages of integrate information abstraction module output, and the structural data after integrating is stored in database.This module can be divided into two submodules:

1) obtain a best property of attribute mapping table according to the information such as property value in two pending tables, be about in two tables there is same meaning but name and may different attribute-name be mapped, be convenient to next step integration work.

2) the mapping table information creating object table obtained according to the 1st step, rearrange respectively the row name of two tables according to the order of sequence.Determine according to can uniquely determining the Major key of a record whether two corresponding record in table can compare, if equal think to compare, if can compare, the information in two tables merged or removed the processing such as redundancy, result is inserted in object table, and the corresponding record in former table is carried out to mark.Finally unlabelled record is also inserted in object table one by one, obtain an object table through integrating.

If multiple tables are arranged, process two tables at every turn, repeat above-mentioned steps 1 and step 2 and obtain net result.

The detailed step of setting up mapping table and integrate information is:

1) obtain the step of mapping table:

1. obtain two property value information in table, they are for example deposited in respectively, in result1 and result2,

result1=List<a ₁,a ₂,a ₃,....,a _m>,a _i=<a _i1,a _i2,a _i3,...a _in>i=1,2,3,…m.

Wherein m is the columns of the attribute column of first table, and n is the line number of the attribute column of first table.Each row that are about in first table deposit respectively a in ₁, a ₂, a ₃...., a _min.In like manner can obtain:

result2=List<b ₁,b ₂,b ₃,....,b _m>,b _i=<b _i1,b _i2,b _i3,...b _in>i=1,2,3,…m.

2. use the participle instrument imdict-chinese-analyzer of the Chinese Academy of Sciences to a ₁, a ₂, a ₃...., a _mand b ₁, b ₂, b ₃...., b _mafter carrying out participle, do not deposit result1SegmentFilter=List<a in ₁', a' ₂, a' ₃..., a' _m, a _ithe a of '=< _i' ₁, a _i' ₂...., a _i' _kresult2egmentFilter=List<b ₁', b' ₂, b ₃' ..., b' _m, b _ithe b of '=< _i' ₁, b _i' ₂...., b _i' _k?

3. respectively to a ₁', a' ₂, a' ₃..., a' _mget set, to b ₁', b' ₂, b ₃' ..., b' _mget set, remove the value repeated and deposit result1Set=List<a in ₁' ', a' ₂', a' ₃' ..., a'' _m, a _ithe a of '=< _i' ' ₁, a _i' ' ₂...., a _i' ' _lil _ia _i' ' in the number of word

Result2Set=List<b ₁' ', b' ₂', b ₃' ' ..., b' _m', b _i' '=<b _i' ' ₁, b _i' ' ₂...., b _i' ' _l' _il' _ib _i' ' in the number of word

4. calculate in result1Set and result2Set element a in twos _i' ' and b _i' ' similarity:

A) if a _i' ' and b _i' ' the number difference of word less, directly to a _i' ' and b _i' ' carry out similarity calculating, calculating formula of similarity: wherein the same function calculates a _i' ' and b _i' ' there is the number of same words.Deposit result of calculation in M (i, j), to each i, try to achieve the j value corresponding with it, make M(i, j) maximum.This j value is most possibly with the i in first table, to be listed as corresponding row number in second table.If M(i, j) size be greater than a certain threshold value, think that i and j are corresponding, output to their correspondences in mapping table.

B) if a _i' ' and b _i' ' the number difference of word larger, need a _i' ' and b _i' ' in number the greater of word carry out pre-service, add up word frequency, by word frequency, sorted from high to low, then block in position, obtain a _i' ' and b _i' '.Go to step again A.

2) integrate the step of information in two tables:

1. utilize the major key property value of a record of unique identification (can) of two tables, the record that major key is identical carries out Data Integration, for example removes redundancy, perfect information, and conflict removals etc. carry out mark respectively by the record of processing in two tables.Record after processing is inserted in object table.

2. after the key assignments of first table of traversal to be recycled, find the record be not labeled in two tables, they are inserted into respectively in object table, so far, the integration of two tables completes.If integrate multiple tables, can according to the method described above, only need that the integration table obtained is used as to a common integration table for the treatment of and get final product.

The schematic diagram of realizing that Fig. 4 is the question sentence processing module.With reference to Fig. 4, this module realizes the natural language question sentence of user's input is converted into to structurized SQL statement.In this module, the conversion process of natural language question sentence is divided into three steps: text pre-service, Question Classification and named entity recognition.Mainly carry out the processing such as the participle of question sentence and part-of-speech tagging in the text pre-treatment step.Here introduce in detail Question Classification and named entity recognition step.

The schematic diagram of realizing that Fig. 5 is Question Classification of the present invention.With reference to Fig. 5, the present invention adopts NB Algorithm to be classified to the natural language question sentence, according to the maximal possibility estimation criterion, selects final classification results.Suppose that class set is combined into C={C ₁, C ₂...., C _n, the result after the natural language participle of input is X={x ₁, x ₂... .., x _m, x wherein _ifor the word in question sentence, according to training to data belong to the probability of each class with following formula calculating question sentence:

P (C_{i} | X) = \frac{P (x_{i} | C_{i}) \times P (x_{2} | C_{i}) \times . . . \times P (x_{m} | C_{i})}{P (X)}, (1 < = i < = n)

Wherein for each natural language P (X), fix, therefore only need to calculate P (x ₁| C _i) * P (x ₂| C _i) * ... * P (x _m| C _i), select the class of maximum probability as final class.

The schematic diagram of realizing that Fig. 6 is named entity recognition of the present invention.With reference to Fig. 6, the present invention adopts a kind of CRF model of the skip-chain of having structure to be identified the named entity in question sentence.This model is the key point that the present invention carries out the conversion of natural language question sentence, therefore, below will introduce in detail structure, principle and the advantage of skip-chain CRF model.

We carry out the observation analysis discovery by the relevant question sentence of the product information to a large amount of, in many question sentences, two or more named entity titles can appear simultaneously, for example input question sentence for " which is better for Nokia5230 and Nokia N8? " or " Nokia 5230 and Nokia 5233 which good " the input question sentence be, and in this class question sentence, the entity title commonly used as " with ", the conjunction such as "or" is connected.Have so a kind of like this phenomenon, there is a strong possibility is also same entity title for the word if the word before conjunction is judged as the entity title after conjunction.This phenomenon has referred in work in the past, but does not propose solution well.The present invention has skip-chain CRF model by structure couples together the word before and after conjunction, thereby has considered the information in this class phenomenon in deterministic process, helps the raising of named entity recognition accuracy rate.In Fig. 6, mark T1 presentation-entity word front portion wherein, T2 presentation-entity word rear portion, O means other words.

Skip-chain CRF is a kind of special CRF model.CRF is a kind of non-directed graph model, and it carries out modeling to the conditional probability distribution of sequence mark on given characteristic set basis.Take the most basic Linear-CRF as example, and under the condition of given observation sequence, the conditional probability of flag sequence can formalized description be following form:

P (Y | X) = \frac{1}{Z (X)} Π_{i = 1}^{I} ψ_{i} (y_{i}, y_{i = 1}, X)

Wherein, ψ _ithe potential function in the non-directed graph model concept,

be length be I the regularization factor under flag sequence likely.Potential function ψ _ican be decomposed into following form, wherein f _kfundamental function for definition.

ψ_{i} (y_{i}, y_{i - 1}, X) = \exp {\underset{k}{Σ} λ_{k} * f_{k} (y_{i}, y_{i - 1}, X, i)}

Its corresponding graph model structure as shown in Figure 7, be take the named entity recognition task here as example, inputs pretreated text message, sets up its corresponding Linear-CRF model.Linear-CRF directly carries out modeling to the conditional probability of flag sequence, is different from Directed Graph Model horse model as hidden as HMM(), it does not need just can introduce abundant feature to doing independence assumption between feature; On the other hand, it also can regard the MEMM(maximum entropy Markov model of overall regularization as), and avoided the marking bias problem in MEMM.Therefore, Linear-CRF can obtain good effect when solving the sequence mark problem as the identification of named entity.

Skip-chain CRF is improved a kind of CRF model on the basis of Linear-CRF.As shown in the graph model of Skip-chain CRF in Fig. 6, its structure is except comprising the linear-chain between the Linear-CRF adjacent node, also introduced the skip-chain between former and later two words of conjunction, thereby increased the contact details between the word label of conjunction front and back on the basis of Linear-CRF.

The formalized description of Skip-chain CRF is as follows:

P (Y | X) = \frac{1}{Z (Y)} Π_{i = 1}^{I} Ψ_{i} (y_{i}, y_{i - 1}, X) Π_{(j, j + 2) &Element; S}^{J} φ_{j, j + 2} (y_{i}, y_{i + 2}, X)

Ψ wherein _ibe defined in the potential function on the adjacent label node, φ _{j, j+2}be defined in the potential function on skip-chain, S={ (j, j+2) } be the set of all skip-chain.Being defined as follows of they:

Ψ_{i} (y_{i}, y_{i - 1}, X) = \exp {\underset{k}{Σ} λ_{k} * f_{k} (y_{i}, y_{i - 1}, X, i)}

φ_{j, j + 2} (y_{i}, y_{i + 2}, X) = \exp {\underset{l}{Σ} η_{l} * f_{l} (y_{i}, y_{i + 2}, X . j, j + 2)}

F wherein _k(y _i, y _i-1, X, i) and be the fundamental function be defined on linear-chain, f _l(y _i, y _i+2, X, j, j+2) and be the fundamental function be defined on skip-chain.

When model training, the present invention uses the skip-chain CRF model training of L-BFGS algorithm to launching, the parameter lambda in learning model _kand η _l.

Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the Web dialogue retrieve system of a product-based message structure, it is characterized in that including user interface, product information crawls module, information extraction module, inverted index and sets up module, database interface, information integrate module, question sentence processing module, database, wherein

2. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that above-mentioned question sentence processing module is transformed the natural language question sentence in two steps, at first use the Naive Bayes Classifier trained to be classified to the natural language question sentence, then adopt skip-chain CRF model to be identified and extract the named entity in the natural language question sentence.

3. the Web dialogue retrieve system of product-based message structure according to claim 1, is characterized in that above-mentioned named entity is mobile phone title, mobile phone attribute.

4. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that above-mentioned Skip-chain CRF model is to develop on the basis of linear condition random field (Linear CRF) model, is a kind of in condition random field (CRF) model.

5. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that in above-mentioned named entity recognition method, ignore conjunction " with ", the effect of "or" in sentence, set up the contact between former and later two words of conjunction in Skip-chain CRF model, helped the raising of final precision; The model of cognition extracted for inquiring about the question sentence named entity, after adopting Skip-Chain CRF model to be learnt training set, acquisition is for named entity recognition and the judgment criterion of product information, and then question sentence is converted into to keyword and the product attribute of retrieval meaning.

6. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that above-mentioned information integrate module first obtains a best property of attribute mapping table according to the property value information in two pending tables, be about in two tables there is same meaning but name and may different attribute-name be mapped, be convenient to next step integration work; Again according to the mapping table information creating object table obtained, rearrange according to the order of sequence respectively the row name of two tables, determine according to can uniquely determining the Major key of a record whether two corresponding record in table can compare, if equal think to compare, if can compare, the information in two tables merged or remove redundancy and process, result being inserted in object table, and the corresponding record in former table is carried out to mark; Finally unlabelled record is also inserted in object table one by one, obtain an object table through integrating; If multiple tables are arranged, process two tables at every turn, repeat said method and obtain net result.

7. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that the said goods information crawls module, for according to the certain hour interval, the webpage of introducing the digital product details on pconline, the large-scale digital website of bubble being crawled, and the webpage crawled is stored, pass to information extraction module and processed.

8. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that above-mentioned question sentence processing module is converted into structurized SQL statement for the natural language question sentence by user's input, this module connects and obtains the natural language question sentence of user's input by user interface and user, and connect by database interface and structurize product information database, use after transforming the SQL statement obtained to be inquired about in database, and by user interface, the Query Result of SQL statement is fed back to the user.

9. the Web dialogue retrieve system of product-based message structure according to claim 1, it is characterized in that the analytic system for destructuring, semi-structured product information, a plurality of source-informations with product are integrated to guarantee information true and perfect; Adopt sorting algorithm and named entity recognition algorithm that the natural language question sentence is converted into to the structured database query statement simultaneously; For the fine granularity sentiment analysis system of product review information, adopt a kind of algorithm based on case similarity to be integrated the separate sources information of identical product.

10. the Web dialogue retrieve system of product-based message structure according to claim 9, it is characterized in that the above-mentioned algorithm based on case similarity is divided into mapping and merges the integration that two steps are carried out information, in mapping step, adopt algorithm based on case similarity to carry out similarity calculating to the corresponding element of two tables, two tables are merged according to the result of previous step at combining step; Fine granularity sentiment analysis system for product review information, at first question sentence is classified, then set up model of cognition the named entity in question sentence is extracted, finally according to the structure of first two steps, adopt corresponding rule that this natural language question sentence is converted into to SQL statement.