CN103886034B

CN103886034B - A kind of method and apparatus of inquiry input information that establishing index and matching user

Info

Publication number: CN103886034B
Application number: CN201410079818.7A
Authority: CN
Inventors: 方高林
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2014-03-05
Filing date: 2014-03-05
Publication date: 2019-03-19
Anticipated expiration: 2034-03-05
Also published as: CN103886034A

Abstract

The object of the present invention is to provide a kind of for establishing the method and apparatus of the inquiry input information of index and matching user；According to text information, determines structured message and extract descriptor；According to the corresponding theme of descriptor, corresponding label word is determined；It establishes and indexes for the descriptor and label word.Further, descriptor and label word are obtained to the inquiry input information analysis of user's input, and carries out matching inquiry in the index of aforementioned foundation accordingly, obtain candidate text information；According to the semantic matching degree of candidate text information and inquiry input information, the determining target text information to match with inquiry input information.Compared with prior art, the present invention is based on encyclopaedia class or other Internet resources knowledge, carry out the extraction of theme, title, form effective description to resources and knowledge content, so that more efficient to the semantic search of this kind of resources and knowledge, meet the complicated description search need that user can not accurately be reached using antistop list, improves the usage experience of user.

Description

A kind of method and apparatus of inquiry input information that establishing index and matching user

Technical field

The present invention relates to field of computer technology more particularly to a kind of inquiry for establishing index and matching user are defeated Enter the technology of information.

Background technique

People often do not know during using search engine and input which type of keyword to express oneself Idea may input the descriptive words and phrases of a pile, such as: 1) get up morning vomiting, usually palpitation and short breath, weakness of limbs, is What disease symptoms? 2) song profile thought fondly of to lover is expressed? 3) include " saying that forgets wealth and rank " song 4) eat kindling Pot sing song be in any film, what who said? 4) verse 5 studied hard is described) difficulty of conducting oneself, doing woman's difficulty is who says, What complete saying? there are also the expression contents that some users may input some clause complexity, such as some personages Classification, user may ask " Anhui come out emperor and President which has? ", " the Politburo Standing Committee member in current government Shanxi Introduce " etc..Search engine is difficult to search Suitable results in this case.

It analyzes from reason, is indexed this is because search engine generally used now mainly establishes title (title). Although these search engines are generally also established content and indexed, due to adjusting the factors such as power, lead to some high-quality knowledge descriptions Part is difficult to show well.For example, existing search engine is usually only for some resources-type such as songs, film information It is that index is established to song title and movie name, in this case, when user does not remember title of the song or movie name, but only remembers When firmly wherein the lyrics, lines brief introduction or fraction describe, existing search engine can not just carry out effective search inquiry. These situations equally occur in classes such as novel, poem, distich, blessing language, personage, TV play, novel, sentence, Chinese idiom, diseases In other resource.

Encyclopaedia class resources and knowledge is usually that will lead to arrange in general search in this way to index is established centered on word In sequence algorithm, it is difficult the non-keyword sequence appeared in title in front.However in fact, since encyclopaedia class resource is known The knowledge authority of knowledge can be good at meeting the needs of users if these data are come front.For example, for encyclopaedia In disease, if being labelled and being indexed to symptom, according to user be depicted come symptom can be well Corresponding resources and knowledge is supplied to user.

Therefore, existing resource knowledge how is efficiently used, establishes index and match acquisition and the inquiry of user input for it The corresponding target text information of information, becomes one of the most urgent problems to be solved by those skilled in the art.

Summary of the invention

The object of the present invention is to provide a kind of for establishing the method and dress of the inquiry input information of index and matching user It sets.

According to an aspect of the invention, there is provided a kind of method for establishing index based on text information, wherein Method includes the following steps:

A therefrom determines structured message according to text information；

B extracts descriptor from the structured message；

C theme according to corresponding to the descriptor determines mark corresponding with the theme from the text information Sign word；

D is that the descriptor and the label word are established and indexed.

According to another aspect of the present invention, a kind of inquiry according to aforementioned established index matching user is additionally provided The method for inputting information, wherein method includes the following steps:

The inquiry that a obtains user's input inputs information；

B carries out theme to inquiry input information and label is analyzed, to obtain corresponding to the inquiry input information Descriptor and label word；

C carries out matching inquiry in aforementioned established index according to the descriptor and label word, with acquisition and institute State the candidate text information that inquiry input information matches；

For d according to the semantic matching degree of the candidate text information and the inquiry input information, determination and the inquiry are defeated Enter the target text information that information matches.

According to another aspect of the invention, it additionally provides a kind of for establishing the index foundation of index based on text information Equipment, wherein the equipment includes:

Information determining means, for therefrom determining structured message according to text information；

Subject distillation device, for extracting descriptor from the structured message；

Label determining device, for the theme according to corresponding to the descriptor, the determining and institute from the text information State the corresponding label word of theme；

Index establishes device, for establishing and indexing for the descriptor and the label word.

In accordance with a further aspect of the present invention, a kind of inquiry according to aforementioned established index matching user is additionally provided Input the matching unit of information, wherein the equipment includes:

Acquisition device is inquired, the inquiry for obtaining user's input inputs information；

Information analysis apparatus, for carrying out theme and label analysis to inquiry input information, to obtain the inquiry Input descriptor corresponding to information and label word；

Matching inquiry device, for according to the descriptor and label word, such as progress in aforementioned the index established With inquiry, to obtain the candidate text information to match with the inquiry input information；

Text determining device, for inputting the semantic matching degree of information according to the candidate text information and the inquiry, The determining target text information to match with the inquiry input information.

In accordance with a further aspect of the present invention, it additionally provides a kind of for establishing the inquiry input letter of index and matching user The system of breath, including index establish equipment and foregoing matching unit as previously described.

Compared with prior art, the present invention therefrom determines structured message according to text information；Believe from the structuring Descriptor is extracted in breath；According to theme corresponding to the descriptor, determination is opposite with the theme from the text information The label word answered；It establishes and indexes for the descriptor and the label word.Further, the present invention obtains looking into for user's input Ask input information；Theme is carried out to inquiry input information and label is analyzed, it is right to obtain inquiry input information institute The descriptor and label word answered；According to the descriptor and label word, matching inquiry is carried out in aforementioned established index, To obtain the candidate text information to match with the inquiry input information；According to the candidate text information and the inquiry Input the semantic matching degree of information, the determining target text information to match with the inquiry input information.

The present invention is based on encyclopaedia class resources and knowledge or other by the resources and knowledge of Web Mining, theme, mark are carried out to it The extraction of topic forms effective description to resources and knowledge content, more preferably

Ground shows this kind of high-quality resource knowledge, so that it is more efficient to the semantic search of this kind of resources and knowledge, meet and uses The complicated description search need that family can not accurately be reached using antistop list, improves the usage experience of user.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, its of the invention Its feature, objects and advantages will become more apparent upon:

Fig. 1 shows the equipment schematic diagram for indexing based on text information foundation of one aspect according to the present invention；

Fig. 2 shows in accordance with a preferred embodiment of the present invention for establishing the equipment signal of index based on text information Figure；

Fig. 3 shows the equipment schematic diagram of the inquiry input information for matching user according to a further aspect of the present invention；

Fig. 4 shows the method flow diagram for indexing based on text information foundation of another aspect according to the present invention；

Fig. 5 shows the method flow diagram for indexing based on text information foundation of another aspect according to the present invention.

The same or similar appended drawing reference represents the same or similar component in attached drawing.

Specific embodiment

Present invention is further described in detail with reference to the accompanying drawing.

Fig. 1 shows the equipment schematic diagram for indexing based on text information foundation of one aspect according to the present invention.Index Establishing equipment 1 includes that information determining means 101, subject distillation device 102, label determining device 103 and index establish device 104。

Wherein, information determining means 101 therefrom determine structured message according to text information.Specifically, the information is true Device 101 is determined for example by the interaction with data source, such as encyclopaedia data, text information is obtained, in turn, by this article This information carries out structuring, and directory information, subdirectory information as included in analysis text information etc. therefrom determines knot Structure information.

For example, information determining means 101 by with the interaction of Baidupedia, the encyclopaedias data such as interact encyclopaedia, obtain these The resources and knowledge of encyclopaedia class, using as text information, in turn, which carries out structure to text information Change, for example, analyzing the corresponding catalogue of each resources and knowledge and subdirectory, such as resources and knowledge for " disease ", analyzes it The corresponding directories or subdirectories of symptom, corresponding directories or subdirectories for the treatment of method etc..

For another example, information determining means 101 excavate resources and knowledge by way of data mining from internet, to make Structuring is carried out to determine structured message to text information in turn for text information.For example, the information determining means 101 by the excavation to vertical class resource website, therefrom obtains the symptom description of disease and disease, treatment method, speciality The information such as hospital.Each resource carries out tissue using disease as ID.Such as, the seed of some candidates is provided according to classification first Word, such as disease provide coronary heart disease, myocarditis, gastritis etc., obtain common website url in the top according to search result, The structure of its website is analyzed, coronary heart disease, the symptom of coronary heart disease, the treatment method of coronary heart disease, coronary heart disease are therefrom extracted Speciality hospital information, and above- mentioned information are integrated into coronary heart disease this kind of " disease ", by the coronary disease in a manner of tissue Disease forms business card, is stored.Then should " coronary heart disease " can be used as final text information, and its corresponding " disease of coronary heart disease It is corresponding then to can be used as text information for the information such as shape ", " treatment method of coronary heart disease ", " information of the speciality hospital of coronary heart disease " Structured message.

Those skilled in the art will be understood that the mode of above-mentioned determining structured message is only for example, other it is existing or The mode for the determination structured message being likely to occur from now on is such as applicable to the present invention, should also be included in the scope of the present invention Within, and be incorporated herein by reference.

Subject distillation device 102 extracts descriptor from the structured message.Specifically, the subject distillation device 102 It is led according to structured message determined by information determining means 101, such as by subject classification device or other scheduled extractions The mode of epigraph extracts descriptor from the structured message.

Here, the purpose for extracting descriptor is to extract the theme for indicating text information from text information, thus Service is calculated to establish semantic indexing and subsequent semantic matches.

Preferably, it further includes theme training device (not shown) which, which establishes equipment 1, the theme training device according to Predetermined theme system obtains training corpus corresponding with the predetermined theme system；According to the training corpus, training master Inscribe classifier；Wherein, the subject distillation device 102 extracts institute from the structured message according to the subject classification device State descriptor.

Specifically, theme training device determines predetermined theme system, for example, the theme training device is according to a large amount of networks The statistical result for searching for the search sequence of user's input determines the common search need of web search user, and combines normal at present Classification system, such as encyclopaedia, the existing system such as know, determining has the subject classification system of certain demand, and is made For predetermined theme system.In turn, which obtains and the predetermined theme system phase according to the predetermined theme system Corresponding training corpus, for example, it is assumed that there is corresponding station location marker " medical treatment & health internal medicine " in article, then the data are recognized For the training corpus for being disease category.Then, the theme training device is according to the training corpus, training subject classification device, example Such as, by training corpus, one svm disaggregated model of training, using as subject classification device.

Then, the subject classification device that subject distillation device 102 is trained according to the theme training device, self-structureization letter Descriptor is extracted in breath.For example, the subject distillation device 102 believes the structurings such as " coronary heart disease " word and its symptom, treatment method Breath inputs the subject classification device, so that obtaining the theme is " disease ".For another example, for new encyclopaedia business card, subject distillation dress It sets 102 and is inputted the subject classification device, such as svm classifier, to obtain theme corresponding to the classification of the encyclopaedia business card.

Preferably, the subject distillation device 102 can also carry out synonymous expression extension to the theme of the extraction, for example, will lead It inscribes " disease " and carries out synonymous expression extension, increase a synonymous theme " disease ".

Those skilled in the art will be understood that the mode of said extracted descriptor is only for example, other are existing or from now on The mode for the extraction descriptor being likely to occur such as is applicable to the present invention, should also be included within the scope of protection of the present invention, and This is incorporated herein by reference.

The theme according to corresponding to the descriptor of label determining device 103, from the text information determine with it is described The corresponding label word of theme.Specifically, the label determining device 103 is according to the extracted theme of subject distillation device 102 Theme corresponding to word and the descriptor determines label word corresponding with the theme from text information, for example, for The text information that disease is the theme, label determining device 103 determine following label word corresponding with the theme: nervous gas Short, uncomfortable in chest, diarrhea, vomiting, weakness of limbs etc..

Preferably, the label determining device 103 include candidate determination unit (not shown), centre word determination unit (not Show) and tag determination unit (not shown).Specifically, candidate's determination unit theme according to corresponding to the descriptor, At least one candidate label word corresponding with the theme is determined from the text information, for example, candidate's determination unit Unitary, binary, ternary word statistics are carried out by the page data of tissue of vocabulary to all, extraction is appeared in greater than certain amount The word of page data, as candidate label word.

Then, centre word determination unit determines corresponding centre word according at least one described candidate label word.Then, For tag determination unit according at least one described candidate label word at a distance from the centre word, determination is opposite with the theme The label word answered.

For example, the label data that centre word determination unit is counted according to front, all candidate label words are merged, These candidate label words count under line, statistic processes is as follows: by such as using whole network data in extensive text, In the conllinear frequency of document in statistical data.The phase between them is calculated according to the following formula for any two candidate's label word Like degree:

Here, PMI (w ', w₁) indicate w'w₁Between mutual information score value, be defined asP (w) it indicates by the probability of statistics word w.

Then, centre word determination unit needs to analyze which domain of text information according to theme, determination, e.g., disease The symptom classification of disease, poem itself and explain part, the description section of personage etc..In turn, it therefrom extracts all in candidate Then these words are formed a center by the word and corresponding synonym occurred in label word, as this, at least one is waited Select the corresponding centre word of label word.

Then, tag determination unit calculates each at least one candidate label word at a distance from the centre word, example Such as, it is assumed that this, which sentences T, indicates centre word, then candidate label word can be calculate by the following formula acquisition at a distance from the centre word:

Here, Num (T) indicates the number of word included in centre word.

Then, according to this, at least one candidate label word is determined and is somebody's turn to do at a distance from the centre word tag determination unit The corresponding label word of theme, for example, using at a distance from the centre word be less than predetermined threshold candidate label word as with the master Inscribe corresponding label word.

Preferably, as shown in figure 3, when tag determination unit does one with candidate's label ranking at a distance from centre word Between sequence, if the slope of ranking variation is greater than predetermined slope threshold value, subsequent node is truncated, such as the ranking in Fig. 3 the 5 points to the 6th point.

Here, the slope threshold value is for example set by counting the overall distribution experience of score.

Those skilled in the art will be understood that the mode of above-mentioned determining label word is only for example, other are existing or from now on The mode for being likely to occur calibration label word really is such as applicable to the present invention, should also be included within the scope of protection of the present invention, and This is incorporated herein by reference.

It is highly preferred that the centre word determination unit is according to predetermined filtering rule, at least one described candidate label word It is filtered processing, to obtain at least one through filtering treated candidate label word；According to it is described at least one through filtering at Candidate label word after reason, determines the centre word；Wherein, the predetermined filtering rule is based on following at least any one come really It is fixed:

The part of speech of at least one candidate label word；

The word rule of at least one candidate label word；

The co-occurrence ratio of at least one candidate the label word and the theme.

Specifically, during counting to candidate label word, noise may be introduced, therefore, it is necessary to mark to candidate Label word is filtered processing, and centre word determination unit carries out at least one described candidate label word according to predetermined filtering rule Filtration treatment, to obtain at least one through filtering treated candidate label word.

For example, the part of speech of the centre word determination unit at least one candidate label word according to this, to this, at least one is candidate Label word is filtered processing, and e.g., to this, at least one candidate label word carries out head-word and the filtering of tail word.

For another example, the word rule of the centre word determination unit at least one candidate label word according to this, to this at least one Candidate label word is filtered processing, and e.g., the lead-in of candidate's label word is unlikely to be the words such as " ", " doing ", " quilt ", " ratio ", Tail word is unlikely to be the words such as " when ", " arriving ", " obtaining ".

For another example, the co-occurrence ratio of the centre word determination unit at least one candidate label word and the theme according to this, to this At least one candidate label word is filtered processing, and e.g., the centre word determination unit is in searching statistical log and the whole network In title, the co-occurrence ratio of at least one candidate the label word and theme is counted, is only just protected with what the theme co-occurrence was crossed It stays, alternatively, retaining the candidate label word with the co-occurrence of the theme than being greater than predetermined threshold.

Preferably, the centre word determination unit is according to the predetermined filtering rule of the above-mentioned any two of combination or comprehensively considers complete The predetermined filtering rule in three, portion, to this, at least one candidate label word is filtered processing.

Then, centre word determination unit according to it is described at least one through filtering treated candidate label word, determine described in Centre word.

Those skilled in the art will be understood that above-mentioned predetermined filtering rule is only for example, other are existing or from now on may The predetermined filtering rule occurred is such as applicable to the present invention, should also be included within the scope of protection of the present invention, and herein with reference Mode is incorporated herein.

It is that the descriptor and the label word are established and indexed that index, which establishes device 104,.Specifically, index establishes device 104 according to label word determined by the extracted descriptor of key phrases extraction device 102 and the label determining device 102, is The descriptor and label word establish index.

For example, it is assumed that the corresponding document of coronary heart disease is ID1, the corresponding different degree in the document is WC1 (x), as x can be with Equal to " disease ", " palpitation and short breath " etc., the corresponding document of myocarditis is ID2, and the corresponding document of gastritis is ID3, and apoplexy is corresponding Document is ID4.Index establishes device 104 and establishes unified inverted index to descriptor and label word in the following manner:

Disease-ID1(WC1 (x)), ID2(WC2 (x)), ID3(WC3 (x)) and, ID4 (WC4 (x))

Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))

Shortness of breath and palpitation-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))

Vomiting-ID3(WC3 (x)), ID4(WC4 (x))

Spit-ID3(WC3 (x)), ID4(WC4 (x))

Preferably, it further includes normalized device (not shown) that index, which establishes equipment 1, if the normalized device label Word includes the label word of multiple semantic congruences, determines the normalization result of the label word of the multiple semantic congruence；Wherein, institute Stating index and establishing device 104 is that the descriptor, the label word and the normalization result establish index.

Specifically, it may include the label word of multiple semantic congruences in the corresponding label word of descriptor " disease ", such as " spitting " " nausea and vomiting " i.e. semantic congruence, then normalized device determines that the normalization result of two label words is " vomiting "；With Afterwards, it is the descriptor " disease ", label word " spitting ", " nausea and vomiting " and normalization result " vomiting " that index, which establishes device 104, Establish index.

Those skilled in the art will be understood that the mode of above-mentioned foundation index is only for example, other are existing or from now on may be used Can occur foundation index mode be such as applicable to the present invention, should also be included within the scope of protection of the present invention, and herein with Way of reference is incorporated herein.

It is indexed in general, establishing index and being established both for keyword, here, index establishes equipment 1 also to descriptor, mark It signs word and its normalization result establishes index, to realize that the inquiry input information of user is preferably matched with resources and knowledge.

Preferably, it indexes and constantly works between each device for establishing equipment 1.Specifically, information determines dress 101 are set according to text information, therefrom determines structured message；Subject distillation device 102 extracts master from the structured message Epigraph；The theme according to corresponding to the descriptor of label determining device 103, the determining and master from the text information Inscribe corresponding label word；It is that the descriptor and the label word are established and indexed that index, which establishes device 104,.Here, ability Field technique personnel should be understood that " lasting " refers to that index establishes each device of equipment 1 respectively according to the work of setting or real-time adjustment Operation mode requires to carry out the determination of structured message, the extraction of descriptor, the determination of label word and the foundation of index, until should Index establishes equipment 1 and stops determining structured message in a long time.

Here, index establishes equipment 1 according to text information, structured message is therefrom determined；From in the structured message Extract descriptor；According to theme corresponding to the descriptor, determination is corresponding with the theme from the text information Label word；It establishes and indexes for the descriptor and the label word.Index establish equipment be based on encyclopaedia class resources and knowledge or its He carries out the extraction of theme, title to it by the resources and knowledge of Web Mining, forms effectively retouching to resources and knowledge content It states, preferably shows this kind of high-quality resource knowledge, so that it is subsequent more efficient to the semantic search of this kind of resources and knowledge, meet The complicated description search need that user can not accurately be reached using antistop list, improves the usage experience of user.

Fig. 2 shows the equipment schematic diagrams of the inquiry input information for matching user according to a further aspect of the present invention. Matching unit 2 includes inquiry acquisition device 201, information analysis apparatus 202, matching inquiry device 203 and text determining device 204。

Wherein, the inquiry that inquiry acquisition device 201 obtains user's input inputs information.Specifically, user by with user The interaction of equipment has input inquiry input information, and inquiry acquisition device 201 is by calling application provided by the user equipment Routine interface (API) passes through and calls the dynamic pages technologies such as JSP, ASP or PHP, alternatively, passing through other communications arranged Mode obtains the inquiry input information of user input.

Here, inquiry input information includes but is not limited to that user passes through text input, voice input, image input etc. The inquiry that different input modes are submitted inputs information.

Those skilled in the art will be understood that the mode of above-mentioned acquisition inquiry input information is only for example, other are existing Or the mode for the acquisition inquiry input information being likely to occur from now on is such as applicable to the present invention, should also be included in protection of the present invention Within range, and it is incorporated herein by reference.

Information analysis apparatus 202 carries out theme to inquiry input information and label is analyzed, defeated to obtain the inquiry Enter descriptor corresponding to information and label word.Specifically, information analysis apparatus 202 is to acquired in the inquiry acquisition device 201 Inquiry input information carry out theme and label and analyze, for example, by the way that the inquiry input aforementioned training of information input is obtained Subject classification device, obtain the inquiry input information corresponding to descriptor；The information analysis apparatus 202 inputs the user Inquiry input information carry out label analysis, obtain corresponding label word.Here, the information analysis apparatus 202 is defeated to the inquiry Enter mode and aforementioned label determining device 103 the determination mode of label word of text information of label analysis of information it is identical or It is similar, therefore details are not described herein again, and is incorporated herein by reference.

Matching inquiry device 203 establishes the rope of the foundation of device 104 according to the descriptor and label word, in aforementioned index Draw middle carry out matching inquiry, to obtain the candidate text information to match with the inquiry input information.Specifically, matching is looked into The inquiry for asking the user according to acquired in the inquiry acquisition device 201 of device 203 input inputs information, establishes in aforementioned index Matching inquiry is carried out in the index that device 104 is established, such as by all matching or the matched mode in part, obtaining hit should The text information of descriptor corresponding to inquiry input information, or hit label word corresponding to inquiry input information Text information, to input the candidate text information that information matches as with the inquiry.

For example, it is assumed that user input query input information is " palpitation and short breath ", inquiry acquisition device 201 obtains the user The inquiry of input inputs information " palpitation and short breath "；Information analysis apparatus 202 carries out label analysis to inquiry input information, obtains Label word be " palpitation and short breath ", aforementioned index establishes index that device 104 establishes the label word " palpitation and short breath " such as Under:

Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))

Wherein, ID1, ID2, ID4 respectively indicate include label word " palpitation and short breath " text information id number, WC1 (x), WC2 (x), WC4 (x) then respectively indicate label word " palpitation and short breath " different degree in these text informations respectively.

The then label word " palpitation and short breath " according to corresponding to the inquiry of user input information of matching inquiry device 203, Index, which is established, carries out matching inquiry in the index that device 104 is established, such as according to above-mentioned index, obtain inquiry input information Candidate text information corresponding to " palpitation and short breath " --- text information ID1, ID2 and ID4.

Those skilled in the art will be understood that the mode of above-mentioned matching inquiry is only for example, other are existing or from now on may be used The mode of matching inquiry that can occur such as is applicable to the present invention, should also be included within the scope of protection of the present invention, and herein with Way of reference is incorporated herein.

Text determining device 204 inputs the semantic matching degree of information according to the candidate text information and the inquiry, really The fixed target text information to match with the inquiry input information.

Specifically, there are certain semantic matching degree, the semantic matches between candidate text information and inquiry input information Degree can be obtained by calculating, or further be inputted by calculating the corresponding index word set of candidate's text information and the inquiry Matching degree between matching word set corresponding to information obtains.Text determining device 204 is according to candidate's text information and user Inquiry input information semantic matching degree, the determining target text information to match with inquiry input information, such as by language The highest candidate text information of adopted matching degree is as the target text information to match with inquiry input information, alternatively, by language Adopted matching degree is greater than the candidate text information of predetermined matching degree threshold value as the target text to match with inquiry input information Information.

Here, the predetermined matching degree threshold value is to match for judging whether candidate text information inputs information with inquiry Semantic matching degree, value can be preset fixation, can also be adjusted according to the actual situation.

Preferably, text determining device further includes that matching primitives unit (not shown) and text determination unit (are not shown Out).The matching primitives unit calculates the semantic matching degree of the candidate text information and the inquiry input information；Text is true Order member is according to the semantic matching degree, in conjunction with predetermined matching degree threshold value, the determining mesh to match with the inquiry input information Mark text information.

For example, the matching primitives unit according to existing matching degree calculation method, calculates candidate's text information and user Inquiry input information semantic matching degree；When the semantic matching degree is greater than the predetermined matching degree threshold value, then the text determines single Member is using candidate's text information as the target text information to match with inquiry input information.

Preferably, the text determining device also index word set according to corresponding to candidate text information and inquiry input letter The corresponding matching word set of breath, to determine and target text information corresponding to inquiry input information.Specifically, candidate text This information has corresponding index word set, and as assumed, the corresponding theme of candidate's text information ID1 is " coronary heart disease " in upper example, Corresponding index terms includes " disease ", " palpitation and short breath " etc., then index word set composed by these index terms is the candidate Index word set corresponding to text information ID1.User inquiry input information also have corresponding matching word set, for example, by pair Inquiry input information obtains matching word after carrying out word segmentation processing, then set composed by the matching word is defeated as the inquiry Enter the corresponding matching word set of information, such as assumes that the inquiry input information of user's input is " palpitation and short breath vomiting ", matching unit 1 After carrying out word segmentation processing to inquiry input information, obtain matching word " palpitation and short breath " and " vomiting ", then two matching word institutes The set of composition is the corresponding matching word set of inquiry input information.Text determining device 204 is according to the index word set and is somebody's turn to do Match word set, the target text information that the determining inquiry input information with the user matches, for example, the matching word will be hit Text information corresponding to the index word set of most matching words is concentrated, as the target text to match with inquiry input information This information；Alternatively, the quantity for hitting matching word is greater than text information corresponding to the index word set of predetermined quantity threshold value, make To input the target text information that information matches with the inquiry.

For example, the corresponding index word set of ID1 includes index for candidate text information ID1, ID2 and ID4 in upper example Word " disease ", " palpitation and short breath "；The corresponding index word set of ID2 includes index terms " palpitation and short breath ", " vomiting ", " disease "；ID4 Corresponding index word set includes index terms " palpitation and short breath ".Then for the inquiry input information of user's input, " palpitation and short breath is vomitted Spit ", matching word is " palpitation and short breath ", " vomiting ", and the corresponding index word set of ID2 hits the inquiry and inputs information corresponding With matching word most in word set, then using candidate's text information ID2 as the target most to match with inquiry input information Text information, or, it is assumed that predetermined quantity threshold value is 0, then rope corresponding to above-mentioned candidate text information ID1, ID2 and ID4 The quantity for drawing the matching word that word set hits matching word concentration is all larger than the predetermined quantity threshold value, then above-mentioned candidate text information ID1, ID2 and ID4 are as the target text information to match with inquiry input information.The matching unit 2 is supplied to the use When family, it can be ranked up according to the height of different degree of the corresponding index terms in candidate's text information.

Those skilled in the art will be understood that the mode of above-mentioned determining target text information is only for example, other are existing Or be likely to occur the mode of the text information that sets the goal really from now on and be such as applicable to the present invention, it should also be included in protection of the present invention Within range, and it is incorporated herein by reference.

Preferably, it constantly works between each device of matching unit 2.Specifically, acquisition device is inquired 201 inquiries for obtaining user's input input information；Information analysis apparatus 202 carries out theme and mark to inquiry input information Label analysis, to obtain descriptor and label word corresponding to the inquiry input information；Matching inquiry device 203 is according to described Descriptor and label word carry out matching inquiry in the index that aforementioned index establishes the foundation of device 104, are looked into obtaining with described Ask the candidate text information that input information matches；Text determining device 204 is according to the candidate text information and the inquiry Input the semantic matching degree of information, the determining target text information to match with the inquiry input information.Here, this field Technical staff should be understood that " lasting " refers to each device of matching unit 2 respectively according to the operating mode of setting or real-time adjustment Believe it is required that carrying out the acquisition of inquiry input information, theme and label analysis, the matching inquiry of candidate text information and target text The determination of breath, until the matching unit 2 stops obtaining in a long time the inquiry input information of user's input.

Here, index is established between equipment 1 and each device of matching unit 2 and cooperated, inputted with realizing based on user Inquiry input information, matching obtains corresponding target text information；Based on encyclopaedia class resources and knowledge or other pass through The resources and knowledge of Web Mining carries out the extraction of theme, title to it, forms effective description to resources and knowledge content, more preferably Ground shows this kind of high-quality resource knowledge, so that more efficient to the semantic search of this kind of resources and knowledge, meeting user can not be quasi- The complicated description search need really reached using antistop list, improves the usage experience of user.

Preferably, the descriptor and label the word domain that may also be viewed as two different, respectively corresponds subject area and label Domain, the matching inquiry device 203 are aforementioned corresponding to the subject area and label field respectively according to the descriptor and label word Matching inquiry is carried out in index, to obtain the candidate text information to match with the inquiry input information.

Specifically, matching inquiry device 203 inputs information to the inquiry that user inputs according to information analysis apparatus 202 Descriptor obtained and label word are analyzed, using dividing domain matched mode, respectively corresponding to the subject area and label field Matching inquiry is carried out in index, to obtain candidate text information.

Here, the subject area and label field can carry out analysis acquisition by inputting information to the inquiry, for example, to The inquiry of family input inputs information, is analyzed using subject classification device above-mentioned the inquiry input information that user inputs, is obtained Obtain subject categories.

Here, index corresponding to subject area and label field is that aforementioned index establishes the index that device 104 is established, root According to the label established before, the extraction of label word is carried out to the inquiry input information of user's input, it is such as defeated for the inquiry is included in Enter in information and inside tag set, is then extracted.Then, using label word and subject categories to corresponding The candidate for draw inverted entry in the unified index of theme and label, using the document comprising the subject categories or label as Candidate text information corresponding with inquiry input information, participates in subsequent calculating.

Preferably, it is also possible to consider weights corresponding to the subject area and label field for the matching inquiry device 203, in correspondence Index in carry out matching inquiry, comprehensively consider the subject area and the corresponding weight of label field, it is final to obtain candidate text envelope Breath.

Preferably, the text determining device 204 matching word according to included by the matching word set, in the candidate Index terms corresponding to text information, which is concentrated, determines that target indexes word set, wherein the target index word set hits the matching Most matching word in word set；If the target index word set is greater than predetermined threshold with the similarity for matching word set, by institute Text information corresponding to target index word set is stated as the target text information to match with the inquiry input information.

Specifically, the index word set according to corresponding to candidate text information of text determining device 204 hit matching word is concentrated The quantity of matching word will hit the most index word set of matching word quantity as target and index word set；Then, the text determines Device 204 calculate the target index word set with match the similarity of word set, for example, calculate separately target index word set with match In word set, similarity between the index terms of hit and corresponding matching word, then by the side such as being simply added or being weighted and averaged Formula calculates the similarity that the target indexes word set with matches word set, and when the similarity is greater than predetermined threshold, the text is determined Device is believed text information corresponding to target index word set as the target text to match with inquiry input information Breath.

Here, the predetermined threshold is to index word set and the similarity for matching word set according to target, judge whether target rope Draw similarity threshold of the corresponding text information of word set as target text information, value can be fixed, can also be according to reality Border situation adjusts.

Preferably, matching unit 2 further includes word set determining device (not shown).Wherein, word set determining device is looked into described It askes input information and carries out word segmentation processing, obtain the participle after the word segmentation processing；The participle is filled with the information analysis It sets 202 descriptor obtained and label word merges processing, to obtain matching word corresponding with inquiry input information Collection, wherein the matching word concentrates included word as matching word.Then, the matching primitives unit is according to described With index word set corresponding to word set and the candidate text information, calculates the candidate text information and the inquiry inputs The semantic matching degree of information.

Specifically, word set determining device carries out at participle the input information of inquiry acquired in the inquiry acquisition device 201 Reason, to obtain participle after word segmentation processing, preferably, the word set determining device can also be to being segmented after the word segmentation processing The filtration treatments such as stop words are removed, and then obtain final participle；Then, the word set determining device is according to obtained It is merged processing, de-redundancy processing with the descriptor obtained of aforementioned information analytical equipment 202 and label word by participle Deng, with finally obtain with the corresponding matching word set of inquiry input information, and using the matching word concentrate included word as Matching word corresponding with inquiry input information.

Then, matching primitives unit index word set according to corresponding to the matching word set and the candidate text information, Calculate the semantic matching degree of the candidate text information and the inquiry input information.

It is highly preferred that the matching unit 2 further includes aftertreatment device (not shown).The aftertreatment device is to described Matching word carries out subsequent processing, to update the matching word set；Wherein, the subsequent processing includes following at least any one:

It determines mutual synonymous matching word included in the matching word, the mutually synonymous matching word is merged For the subset of the matching word set.

Synonymous extension is carried out to the matching word, the synonym obtained after synonymous extension is determined as with the matching word The subset of the matching word set.

Specifically, aftertreatment device carries out the matching word that matching word determined by word set determining device is concentrated subsequent Processing, to update the matching word set.For example, aftertreatment device determines mutual synonymous included in the matching word With word, the mutually synonymous matching word is merged into the subset of the matching word set.Due to may include in matching word mutually These mutually synonymous matching words are merged into this by synonymous matching word, such as " vomiting " and " spitting ", the aftertreatment device Subset with word set.

For example, it is assumed that the inquiry input information of user's input is Q, word set determining device carries out inquiry input information Word segmentation processing, after removing the filtration treatments such as stop words, the matching word set representations in label field are Q={ a, b, c, d, e }, In, a, b, c, d, e is respectively that the matching word concentrates included matching word；Assuming that matching word a and b therein is mutually synonymous Matching word, then matching word a and b are merged into the subset of the matching word set by aftertreatment device, then the matching word set is more Newly it is expressed as Q={ { a, b }, c, d, e }.Then, follow up device such as matching inquiry device 203 carries out subsequent matching inquiry behaviour Make.

For another example, aftertreatment device also carries out synonymous extension, the synonym that will be obtained after synonymous extension to the matching word It is determined as the subset for matching word set with the matching word.Specifically, aftertreatment device can also input information to the inquiry The matching word that corresponding matching word is concentrated carries out synonymous extension, is such as extended to " palpitation and short breath " for " shortness of breath and palpitation " is synonymous, with Afterwards, which is determined as the synonym obtained after the synonymous extension and the matching word son of the matching word set Collection.

Example is connected, for the matching word set Q after synonymous merging={ { a, b }, c, d, e }, which may be used also Synonymous extension is carried out to the matching word set, extension obtains the synonym of matching word abcde therein, and will be after the synonymous extension Obtained synonym and the matching word are determined as the subset of the matching word set, for example, the matching word set Q is through multiple synonymous extension Afterwards, following expression is obtained:

Then, matching inquiry device 203 is according to the matching word set, index establish in the index that device 104 is established into Row matching inquiry, for example, being included by inverted indexCandidate text information.

Concentrate the index terms set representations of most matching words for C assuming that matching word will be hit, then C are as follows:

Wherein, C indicates the maximum of synonymous hitw_1iThe set of words of corresponding position Semantic mapping

Then matching primitives unit index word set according to corresponding to the matching word set and the candidate text information, meter Calculate the semantic matching degree of the candidate text information and the inquiry input information.

Semantic matching degree between Q and C can be calculate by the following formula:

Wherein,Indicate wordWeight, here use (log(TF)+1) * log(N/DF) indicate；Match (T_Q,T_C) indicate whether index word set, matching word set match with theme.

Here, Match (T_Q,T_C) it is corresponding value can define, such as assume the index word set, matching word set matched with theme, Then Match (T_Q,T_C) value be 1, be otherwise 0.5.

Then, it is assumed that the semantic matches angle value being calculated is greater than predetermined threshold, then text determination unit is by the index Text information corresponding to word set is as the target text information to match with inquiry input information.

Fig. 4 shows the method flow diagram for indexing based on text information foundation of another aspect according to the present invention.

In step S401, index establishes equipment 1 according to text information, therefrom determines structured message.Specifically, exist In step S401, index establishes equipment 1 for example by the interaction with data source, such as encyclopaedia data, obtains text information, In turn, by carrying out structuring to text information, the directory information as included in analysis text information, subdirectory letter Breath etc. therefrom determines structured message.

For example, in step S401, index establish equipment 1 by with the friendship of Baidupedia, the encyclopaedias data such as interact encyclopaedia Mutually, the resources and knowledge for obtaining these encyclopaedia classes, using as text information, in turn, in step S401, it is right that index establishes equipment 1 Text information carries out structuring, for example, the corresponding catalogue of each resources and knowledge and subdirectory are analyzed, such as " disease " Resources and knowledge, analyze the corresponding directories or subdirectories of its symptom, the corresponding directories or subdirectories for the treatment of method etc..

For another example, in step S401, index establishes equipment 1 by way of data mining, excavates and provides funds from internet Source knowledge, in turn, to carry out structuring to text information to determine structured message as text information.For example, in step In rapid S401, index establishes equipment 1 by the excavation to vertical class resource website, therefrom obtains the symptom of disease and disease Description, treatment method, speciality the information such as hospital.Each resource carries out tissue using disease as ID.Such as, first according to classification The seed words of some candidates, such as disease are provided, coronary heart disease, myocarditis, gastritis etc. are provided, are obtained according to search result common Website url in the top, analyzes the structure of its website, therefrom extracts coronary heart disease, the symptom of coronary heart disease, coronary disease Treatment method, the information of the speciality hospital of coronary heart disease of disease, and above- mentioned information are integrated into coronary heart disease this kind of " disease ", with The coronary heart disease is formed business card by the mode of tissue, is stored.Then being somebody's turn to do " coronary heart disease " can be used as final text information, and The information such as its corresponding " symptom of coronary heart disease ", " treatment method of coronary heart disease ", " information of the speciality hospital of coronary heart disease ", then It can be used as the corresponding structured message of text information.

In step S402, index establishes equipment 1 and extracts descriptor from the structured message.Specifically, in step In S402, index establishes equipment 1 according to structured message identified in step S401, such as by subject classification device, or Other scheduled modes for extracting descriptor, extract descriptor from the structured message.

Preferably, this method further includes that step S405(is not shown), in step S405, index establishes equipment 1 according to pre- Determine theme system, obtains training corpus corresponding with the predetermined theme system；According to the training corpus, training theme Classifier；Wherein, in step S402, index establishes equipment 1 according to the subject classification device, from the structured message Extract the descriptor.

Specifically, in step S405, index establishes equipment 1 and determines predetermined theme system, for example, in step S405, Index establishes the statistical result for the search sequence that equipment 1 is inputted according to a large amount of web search users, determines that web search user is normal Search need, and combine currently used classification system, such as encyclopaedia, the existing system such as know, determining to have centainly needs The subject classification system asked, and as predetermined theme system.In turn, in step S405, index establishes 1 basis of equipment The predetermined theme system obtains training corpus corresponding with the predetermined theme system, for example, it is assumed that there is correspondence in article Station location marker " medical treatment & health internal medicine ", then the data are considered as the training corpus of disease category.Then, in step S405 In, index establishes equipment 1 according to the training corpus, training subject classification device, for example, training a svm by training corpus Disaggregated model, using as subject classification device.

Then, in step S402, index establishes equipment 1 according to the subject classification device trained in step S405, certainly Descriptor is extracted in structured message.For example, in step S402, index establish equipment 1 by " coronary heart disease " word and its symptom, The structured messages such as treatment method input the subject classification device, so that obtaining the theme is " disease ".For another example, for newcomer Encyclopaedia business card, in step S402, index establishes equipment 1 and is inputted the subject classification device, such as svm classifier, to obtain Theme corresponding to the classification of the encyclopaedia business card.

Preferably, index, which establishes equipment 1, can also carry out synonymous expression extension to the theme of the extraction in step S402, For example, theme " disease " is carried out synonymous expression extension, increase a synonymous theme " disease ".

In step S403, index establishes the theme according to corresponding to the descriptor of equipment 1, from the text information Middle determination label word corresponding with the theme.Specifically, in step S403, index establishes equipment 1 according in step Theme corresponding to extracted descriptor and the descriptor in S402, determination is corresponding with the theme from text information Label word, for example, for the text information that disease is the theme, in step S403, it is determining with the master that index establishes equipment 1 Inscribe corresponding following label word: palpitation and short breath, uncomfortable in chest, diarrhea, vomiting, weakness of limbs etc..

Preferably, step S403 further includes that sub-step S403a(is not shown), sub-step S403b (not shown) and sub-step S403c(is not shown).Specifically, in sub-step S403a, index establishes the master according to corresponding to the descriptor of equipment 1 Topic determines at least one candidate label word corresponding with the theme from the text information, for example, in sub-step In S403a, index establishes equipment 1 and carries out unitary, binary, ternary word statistics by the page data of tissue of vocabulary to all, Extraction appears in the word greater than certain amount page data, as candidate label word.

Then, in sub-step S403b, index establishes equipment 1 according at least one described candidate label word, determination pair The centre word answered.Then, in sub-step S403c, index establishes equipment 1 according at least one described candidate label word and institute The distance of centre word is stated, determines label word corresponding with the theme.

For example, index establishes the label data that equipment 1 is counted according to front, by all candidates in sub-step S403b Label word merges, and count under lines to these candidate label words, statistic processes is as follows: by extensive text, Whole network data is such as used, in the conllinear frequency of document in statistical data.For any two candidate's label word, according to the following formula, meter Calculate the similarity between them:

Here, PMI (w ', w₁) indicate w'w₁Between mutual information score value, be defined as P (w) is indicated by the probability of statistics word w.

Then, in sub-step S403b, index establishes equipment 1 according to theme, determine need to text information which Domain is analyzed, e.g., the symptom classification of disease, poem itself and explain part, the description section of personage etc..In turn, from It is middle to extract all words and corresponding synonym occurred in candidate label word, these words are then formed into a center, At least one corresponding centre word of candidate's label word as this.

Then, in sub-step S403c, index establish equipment 1 calculate each in this at least one candidate label word with The distance of the centre word, for example, it is assumed that this, which sentences T, indicates centre word, then candidate label word can pass through at a distance from the centre word Following formula, which calculates, to be obtained:

Here, Num (T) indicates the number of word included in centre word.

Then, in sub-step S403c, index establish equipment 1 according to this at least one candidate label word and the centre word Distance, corresponding with theme label word is determined, for example, will be at a distance from the centre word less than the candidate of predetermined threshold Label word is as label word corresponding with the theme.

Preferably, as shown in figure 3, index establishes equipment 1 with candidate's label ranking and center in sub-step S403c The distance of word does a time series, if the slope of ranking variation is greater than predetermined slope threshold value, subsequent node is cut It removes, such as the 5th point to the 6th point of the ranking in Fig. 3.

It is highly preferred that index establishes equipment 1 according to predetermined filtering rule, to described at least one in sub-step S403b A candidate's label word is filtered processing, to obtain at least one through filtering treated candidate's label word；According to it is described at least One, through filtering treated candidate label word, determines the centre word；Wherein, the predetermined filtering rule be based on down toward Lack any one to determine:

The part of speech of at least one candidate label word；

The word rule of at least one candidate label word；

The co-occurrence ratio of at least one candidate the label word and the theme.

Specifically, during counting to candidate label word, noise may be introduced, therefore, it is necessary to mark to candidate Label word is filtered processing, and in sub-step S403b, index establishes equipment 1 according to predetermined filtering rule, to described at least one A candidate's label word is filtered processing, to obtain at least one through filtering treated candidate's label word.

For example, index establishes the part of speech of at least one the candidate label word according to this of equipment 1, right in sub-step S403b At least one candidate label word is filtered processing, and e.g., to this, at least one candidate label word carries out head-word and tail word mistake Filter.

For another example, in sub-step S403b, index establishes the word rule of at least one the candidate label word according to this of equipment 1 Then, to this, at least one candidate label word is filtered processing, e.g., the lead-in of candidate's label word be unlikely to be " ", The words such as " doing ", " quilt ", " ratio ", tail word are unlikely to be the words such as " when ", " arriving ", " obtaining ".

For another example, in sub-step S403b, index establish equipment 1 according to this at least one candidate label word and the theme Co-occurrence ratio, to this, at least one candidate label word is filtered processing, e.g., in sub-step S403b, indexes and establishes equipment 1 In searching statistical log and in the whole network title, the co-occurrence ratio of at least one candidate the label word and theme is counted, only Just retained with what the theme co-occurrence was crossed, alternatively, retaining the candidate label with the co-occurrence of the theme than being greater than predetermined threshold Word.

Preferably, index establishes equipment 1 according in conjunction with the predetermined filtering rule of above-mentioned any two in sub-step S403b Or comprehensively consider all three predetermined filtering rules, to this, at least one candidate label word is filtered processing.

Then, in sub-step S403b, index establish equipment 1 according to it is described at least one through filtering that treated is candidate Label word determines the centre word.

In step s 404, it is that the descriptor and the label word are established and indexed that index, which establishes equipment 1,.Specifically, exist In step S404, index is established equipment 1 and is determined according to descriptor extracted in step S402, and in step S402 Label word, establish index for the descriptor and label word.

For example, it is assumed that the corresponding document of coronary heart disease is ID1, the corresponding different degree in the document is WC1 (x), as x can be with Equal to " disease ", " palpitation and short breath " etc., the corresponding document of myocarditis is ID2, and the corresponding document of gastritis is ID3, and apoplexy is corresponding Document is ID4.In step S404, index establishes equipment 1 and establishes the unified row of falling to descriptor and label word in the following manner Index:

Disease-ID1(WC1 (x)), ID2(WC2 (x)), ID3(WC3 (x)) and, ID4 (WC4 (x))

Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))

Shortness of breath and palpitation-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))

Vomiting-ID3(WC3 (x)), ID4(WC4 (x))

Spit-ID3(WC3 (x)), ID4(WC4 (x))

Preferably, this method further includes that step S406(is not shown), in step S406, if the label word includes more The label word of a semantic congruence, index establish the normalization result that equipment 1 determines the label word of the multiple semantic congruence；Its In, in step s 404, it is that the descriptor, the label word and the normalization result establish rope that index, which establishes equipment 1, Draw.

Specifically, it may include the label word of multiple semantic congruences in the corresponding label word of descriptor " disease ", such as " spitting " " nausea and vomiting " i.e. semantic congruence, then in step S406, index establishes the normalization that equipment 1 determines two label words As a result it is " vomiting "；Then, in step s 404, index establishes equipment 1 and is the descriptor " disease ", label word " spitting ", " dislikes Index is established in heart vomiting " and normalization result " vomiting ".

Preferably, it indexes and constantly works between each step for establishing equipment 1.Specifically, in step S401 In, index establishes equipment 1 according to text information, therefrom determines structured message；In step S402, index establishes equipment 1 certainly Descriptor is extracted in the structured message；In step S403, index establishes equipment 1 according to corresponding to the descriptor Theme determines label word corresponding with the theme from the text information；In step s 404, index establishes equipment 1 It establishes and indexes for the descriptor and the label word.Here, it will be understood by those skilled in the art that " lasting " refers to that index is established Each step of equipment 1 requires to carry out determination, the master of structured message respectively according to the operating mode of setting or real-time adjustment Extraction, the determination of label word and the foundation of index of epigraph, until the index establishes equipment 1 and stops determining in a long time Structured message.

In step S501, the inquiry that matching unit 2 obtains user's input inputs information.Specifically, user by with The interaction of family equipment has input inquiry input information, and in step S501, matching unit 2 is by calling the user equipment to be mentioned The application programming interfaces (API) of confession pass through and call the dynamic pages technologies such as JSP, ASP or PHP, alternatively, passing through other The communication mode of agreement obtains the inquiry input information of user input.

In step S502, matching unit 2 carries out theme to inquiry input information and label is analyzed, to obtain State descriptor and label word corresponding to inquiry input information.Specifically, in step S502, matching unit 2 is in step Acquired inquiry input information carries out theme in S501 and label is analyzed, for example, before by the way that the inquiry is inputted information input Training subject classification device obtained is stated, descriptor corresponding to inquiry input information is obtained；In step S502, matching Equipment 2 carries out label analysis to the inquiry input information that the user inputs, and obtains corresponding label word.Here, in step S502 In, matching unit 2 establishes equipment 1 in step S403 to the mode and aforementioned index of the label analysis of inquiry input information Determine that the mode of the label word of text information is same or like, therefore details are not described herein again, and includes by reference In this.

In step S503, matching unit 2 is established equipment 1 in aforementioned index and is established according to the descriptor and label word Index in carry out matching inquiry, to obtain and the candidate text information that matches of inquiry input information.Specifically, exist In step S503, matching unit 2 inputs information according to the inquiry that acquired user in step S501 inputs, in aforementioned rope Draw and carry out matching inquiry in the index for establishing the foundation of equipment 1, such as by all matching or the matched mode in part, is ordered In the inquiry input information corresponding to descriptor text information, or hit the inquiry input information corresponding to label The text information of word, to input the candidate text information that information matches as with the inquiry.

For example, it is assumed that user input query input information is " palpitation and short breath ", in step S501, matching unit 2 is obtained The inquiry of user input inputs information " palpitation and short breath "；In step S502, matching unit 2 to the inquiry input information into The label word of row label analysis, acquisition is " palpitation and short breath ", and aforementioned index is established equipment 1 and built to the label word " palpitation and short breath " Vertical index is as follows:

Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))

Then in step S503, the label word according to corresponding to the inquiry of user input information of matching unit 2 is " nervous Shortness of breath " is established in index and carries out matching inquiry in the index that equipment 1 is established, such as according to above-mentioned index, it is defeated to obtain the inquiry Enter candidate text information corresponding to information " palpitation and short breath " --- text information ID1, ID2 and ID4.

In step S504, matching unit 2 inputs semantic of information according to the candidate text information and the inquiry With degree, the determining target text information to match with the inquiry input information.

Specifically, there are certain semantic matching degree, the semantic matches between candidate text information and inquiry input information Degree can be obtained by calculating, or further be inputted by calculating the corresponding index word set of candidate's text information and the inquiry Matching degree between matching word set corresponding to information obtains.In step S504, matching unit 2 is according to candidate's text information It is determining that the target text information that information matches is inputted with the inquiry with the semantic matching degree of the inquiry input information of user, Such as using the highest candidate text information of semantic matching degree as the target text information to match with inquiry input information, or The candidate text information that semantic matching degree is greater than predetermined matching degree threshold value is inputted what information matched as with the inquiry by person Target text information.

Preferably, step S504 further includes that sub-step S504a(is not shown) and sub-step S504b(be not shown).In son In step S504a, matching unit 2 calculates the semantic matching degree of the candidate text information and the inquiry input information；In son In step S504a, matching unit 2 is according to the semantic matching degree, determining defeated with the inquiry in conjunction with predetermined matching degree threshold value Enter the target text information that information matches.

For example, matching unit 2 calculates candidate text according to existing matching degree calculation method in sub-step S504a The semantic matching degree of this information and the inquiry input information of user；When the semantic matching degree be greater than the predetermined matching degree threshold value, then In sub-step S504b, matching unit 2 is using candidate's text information as the target text to match with inquiry input information This information.

Preferably, in step S504, the also index word set according to corresponding to candidate text information of matching unit 2 with Matching word set corresponding to inquiry input information, to determine and target text information corresponding to inquiry input information.Specifically Ground, candidate text information have corresponding index word set, and as assumed, the corresponding theme of candidate's text information ID1 is " hat in upper example Heart trouble ", corresponding index terms include " disease ", " palpitation and short breath " etc., then word set is indexed composed by these index terms is For index word set corresponding to candidate's text information ID1.The inquiry input information of user also has corresponding matching word set, example Such as, matching word is obtained by inputting after information carries out word segmentation processing to the inquiry, then conduct will be gathered composed by the matching word The inquiry inputs the corresponding matching word set of information, such as assumes that the inquiry input information of user's input is " palpitation and short breath vomiting ", Matching unit 1 to the inquiry input information carry out word segmentation processing after, obtain matching word " palpitation and short breath " and " vomiting ", then this two Set composed by a matching word is the corresponding matching word set of inquiry input information.In step S504, matching unit 2 According to the index word set and the matching word set, the target text information that the determining inquiry input information with the user matches, For example, concentrating text information corresponding to the index word set of most matching words for the matching word is hit, inputted as with the inquiry The target text information that information matches；Alternatively, the quantity for hitting matching word to be greater than to the index word set institute of predetermined quantity threshold value Corresponding text information, as the target text information to match with inquiry input information.

Preferably, it constantly works between each step of matching unit 2.Specifically, in step S501, The inquiry that matching unit 2 obtains user's input inputs information；In step S502, matching unit 2 inputs information to the inquiry It carries out theme and label is analyzed, to obtain descriptor and label word corresponding to the inquiry input information；In step S503 In, matching unit 2 is matched in the index that aforementioned index establishes the foundation of equipment 1 according to the descriptor and label word Inquiry, to obtain the candidate text information to match with the inquiry input information；In step S504,2 basis of matching unit The semantic matching degree of candidate's text information and the inquiry input information, it is determining to match with inquiry input information Target text information.Here, it will be understood by those skilled in the art that " lasting " refer to each step of matching unit 2 respectively according to The operating mode of setting or real-time adjustment requires to carry out the acquisition of inquiry input information, theme and label analysis, candidate text The determination of the matching inquiry of information and target text information, until to stop acquisition user in a long time defeated for the matching unit 2 The inquiry input information entered.

Here, index is established between equipment 1 and each step of matching unit 2 and cooperated, inputted with realizing based on user Inquiry input information, matching obtains corresponding target text information；Based on encyclopaedia class resources and knowledge or other pass through The resources and knowledge of Web Mining carries out the extraction of theme, title to it, forms effective description to resources and knowledge content, more preferably Ground shows this kind of high-quality resource knowledge, so that more efficient to the semantic search of this kind of resources and knowledge, meeting user can not be quasi- The complicated description search need really reached using antistop list, improves the usage experience of user.

Preferably, the descriptor and label the word domain that may also be viewed as two different, respectively corresponds subject area and label Domain, in step S503, matching unit 2 according to the descriptor and label word, respectively corresponding to the subject area and label field before It states in index and carries out matching inquiry, to obtain the candidate text information to match with the inquiry input information.

Specifically, in step S503, matching unit 2 inputs letter to the inquiry that user inputs according in step S502 The analysis of breath descriptor obtained and label word, it is right in the subject area and label field institute respectively using dividing domain matched mode Matching inquiry is carried out in the index answered, to obtain candidate text information.

Here, index corresponding to subject area and label field is that aforementioned index establishes the index that equipment 1 is established, according to The label established before inputs information to the inquiry of user's input and carries out the extraction of label word, is such as directed to and is included in inquiry input In information and inside tag set, then extracted.Then, label word and subject categories to corresponding master are utilized The candidate for draw inverted entry in topic and the unified index of label, using the document comprising the subject categories or label as with The inquiry inputs the corresponding candidate text information of information, participates in subsequent calculating.

Preferably, in step S503, matching unit 2 it is also possible to consider weight corresponding to the subject area and label field, Matching inquiry is carried out in corresponding index, comprehensively considers the subject area and the corresponding weight of label field, it is final to obtain candidate text Information.

Preferably, in step S504, the matching word according to included by the matching word set of matching unit 2, in the time It selects index terms corresponding to text information to concentrate and determines target index word set, wherein described the word set hit of target index described With matching word most in word set；If the target index word set is greater than predetermined threshold with the similarity for matching word set, will Text information corresponding to the target index word set is as the target text information to match with the inquiry input information.

Specifically, in step S504, the hit of the index word set according to corresponding to candidate text information of matching unit 2 Quantity with matching word in word set will hit the most index word set of matching word quantity as target and index word set；Then, exist In step S504, matching unit 2 calculate the target index word set with match the similarity of word set, for example, calculating separately target rope Draw word set and matching word to concentrate, the similarity between the index terms of hit and corresponding matching word, then by being simply added or The modes such as weighted average calculate the similarity that the target indexes word set with matches word set, when the similarity is greater than predetermined threshold When, in step S504, matching unit 2 inputs letter using text information corresponding to target index word set as with the inquiry The matched target text information of manner of breathing.

Preferably, this method further includes that step S505(is not shown).In step S505, matching unit 2 is to the inquiry It inputs information and carries out word segmentation processing, obtain the participle after the word segmentation processing；The participle is existed with the matching unit 2 Descriptor obtained in step S502 and label word merge processing, corresponding with inquiry input information to obtain Match word set, wherein the matching word concentrates included word as matching word.Then, in sub-step S504a, matching The index word set according to corresponding to the matching word set and the candidate text information of equipment 2, calculates the candidate text information With the semantic matching degree of the inquiry input information.

Specifically, in step S505, matching unit 2 carries out inquiry input information acquired in step S501 Word segmentation processing, to obtain the participle after word segmentation processing, preferably, matching unit 2 can also be to the participle in step S505 Participle is obtained after processing and is removed the filtration treatments such as stop words, and then obtains final participle；Then, in step S505, Matching unit 2 is according to participle obtained, by itself and the descriptor obtained in step S502 of matching unit 2 and label word Processing, de-redundancy processing etc. are merged, finally to obtain matching word set corresponding with inquiry input information, and should Matching word concentrates included word as matching word corresponding with inquiry input information.

Then, in sub-step S504a, matching unit 2 is right according to the matching word set and the candidate text information institute The index word set answered calculates the semantic matching degree of the candidate text information and the inquiry input information.

It is highly preferred that this method further includes, step S506(is not shown).In step S506, matching unit 2 is to described Subsequent processing is carried out with word, to update the matching word set；Wherein, the subsequent processing includes following at least any one:

Specifically, in step S506, matching that matching unit 2 concentrates matching word identified in step S505 Word carries out subsequent processing, to update the matching word set.For example, matching unit 2 determines in the matching word in step S506 The mutually synonymous matching word is merged into the subset of the matching word set by included mutual synonymous matching word.Due to May include mutually synonymous matching word in matching word, such as " vomiting " and " spitting ", in step S506, matching unit 2 by this A little mutually synonymous matching words merge into the subset of the matching word set.

For example, it is assumed that the inquiry input information of user's input is Q, in step S505, matching unit 2 is defeated to the inquiry Enter information and carry out word segmentation processing, after the removal filtration treatments such as stop words, the matching word set representations in label field be Q=a, B, c, d, e }, wherein a, b, c, d, e are respectively that the matching word concentrates included matching word；Assuming that matching word a and b therein It is mutually synonymous matching word, then in step S506, matching word a and b are merged into the matching word set by matching unit 2 Subset, then matching word set update are expressed as Q={ { a, b }, c, d, e }.Then, subsequent step such as step S503 is carried out subsequent Matching inquiry operation.

For another example, in step S506, matching unit 2 also carries out synonymous extension to the matching word, will obtain after synonymous extension To synonym be determined as the subset for matching word set with the matching word.Specifically, in step S506, matching unit 2 The matching word that the corresponding matching word of information is concentrated can be also inputted to the inquiry carries out synonymous extension, it is such as that " shortness of breath and palpitation " is synonymous It is extended to " palpitation and short breath ", then, in step S506, matching unit 2 is by the synonym obtained after the synonymous extension and is somebody's turn to do Matching word is determined as the subset of the matching word set.

Example is connected, for the matching word set Q after synonymous merging={ { a, b }, c, d, e }, in step S506, matching is set Standby 2 can also carry out synonymous extension to the matching word set, extend and obtain matching word a, b, c therein, the synonym of d, e, and should The synonym and the matching word that obtain after synonymous extension are determined as the subset of the matching word set, for example, the matching word set Q is through more After secondary synonymous extension, following expression is obtained:

Then, in step S503, matching unit 2 establishes the rope that equipment 1 is established according to the matching word set, in index Draw middle carry out matching inquiry, for example, being included by inverted indexCandidate text information.

Then in sub-step S504a, matching unit 2 is according to corresponding to the matching word set and the candidate text information Index word set, calculate the semantic matching degree of the candidate text information and the inquiry input information.

Then, it is assumed that the semantic matches angle value being calculated is greater than predetermined threshold, then in sub-step S504b, matching Equipment 2 is using text information corresponding to the index word set as the target text information to match with inquiry input information.

It should be noted that the present invention can be carried out in the assembly of software and/or software and hardware, for example, can adopt With specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, software program of the invention can be executed to implement the above steps or functions by processor.Similarly, of the invention Software program (including relevant data structure) can be stored in computer readable recording medium, for example, RAM store Device, magnetic or optical driver or floppy disc and similar devices.In addition, hardware can be used in fact in some steps of the invention or function It is existing, for example, as the circuit cooperated with processor thereby executing each step or function.

In addition, a part of the invention can be applied to computer program product, such as computer program instructions, when it When being computer-executed, by the operation of the computer, it can call or provide according to the method for the present invention and/or technical side Case.And the program instruction of method of the invention is called, it is possibly stored in fixed or moveable recording medium, and/or It is transmitted by the data flow in broadcast or other signal-bearing mediums, and/or is stored in and is instructed according to described program In the working storage of the computer equipment of operation.Here, according to one embodiment of present invention including a device, the dress It sets including the memory for storing computer program instructions and the processor for executing program instructions, wherein when the calculating When machine program instruction is executed by the processor, side of the device operation based on aforementioned multiple embodiments according to the present invention is triggered Method and/or technical solution.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, nothing By from the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by institute Attached claim rather than above description limit, it is intended that will fall within the meaning and scope of the equivalent elements of the claims All changes be included in the present invention.It should not treat any reference in the claims as limiting related right It is required that.Furthermore, it is to be understood that one word of " comprising " does not exclude other units or steps, odd number is not excluded for plural number.It is old in device claim The multiple units or device stated can also be implemented through software or hardware by a unit or device.The first, the second etc. Word is used to indicate names, and is not indicated any particular order.

Claims

1. a kind of method for establishing index based on text information, wherein method includes the following steps:

A analyzes directory information and/or subdirectory information included in text information, carries out structure to the text information of acquisition Change, therefrom determines structured message；

B extracts descriptor from the structured message；

C1 theme according to corresponding to the descriptor determines corresponding with the theme at least one from the text information A candidate's label word；

C2 determines corresponding centre word according at least one described candidate label word；

C3 at a distance from the centre word, determines label corresponding with the theme according at least one described candidate label word Word；

D is that the descriptor and the label word are established and indexed.

2. according to the method described in claim 1, wherein, this method further include:

According to predetermined theme system, training corpus corresponding with the predetermined theme system is obtained；

According to the training corpus, training subject classification device；

Wherein, the step B includes:

According to the subject classification device, the descriptor is extracted from the structured message.

3. according to the method described in claim 1, wherein, the step C2 includes:

According to predetermined filtering rule, processing is filtered at least one described candidate label word, to obtain at least one process Filter treated candidate label word；

According to it is described at least one through filtering treated candidate label word, determine the centre word；

Wherein, the predetermined filtering rule is determined based on following at least any one:

The part of speech of at least one candidate label word；

The word rule of at least one candidate label word；

The co-occurrence ratio of at least one candidate the label word and the theme.

4. according to the method in any one of claims 1 to 3, wherein this method further include:

If the label word includes the label word of multiple semantic congruences, the normalizing of the label word of the multiple semantic congruence is determined Change result；

Wherein, the step D includes:

Index is established for the descriptor, the label word and the normalization result.

5. the method for the inquiry input information for the index matching user that one kind is established according to claim 1, wherein this method packet Include following steps:

The inquiry that a obtains user's input inputs information；

B carries out theme to inquiry input information and label is analyzed, to obtain theme corresponding to the inquiry input information Word and label word；

C carries out matching inquiry according to the descriptor and label word in the index that such as claim 1 is established, with obtain with The candidate text information that the inquiry input information matches；

D inputs the semantic matching degree of information according to the candidate text information and the inquiry, determining to believe with inquiry input The matched target text information of manner of breathing.

6. according to the method described in claim 5, wherein, the step d includes:

D1 calculates the semantic matching degree of the candidate text information and the inquiry input information；

D2 is according to the semantic matching degree, in conjunction with predetermined matching degree threshold value, the determining mesh to match with the inquiry input information Mark text information.

7. according to the method described in claim 6, wherein, this method further include:

Word segmentation processing is carried out to inquiry input information, obtains the participle after the word segmentation processing；

The participle is merged into processing with descriptor and label word obtained in step b, it is defeated with the inquiry to obtain Enter the corresponding matching word set of information, wherein the matching word concentrates included word as matching word；

Wherein, the step d1 includes:

According to index word set corresponding to the matching word set and the candidate text information, the candidate text information is calculated With the semantic matching degree of the inquiry input information.

8. according to the method described in claim 7, wherein, this method further include:

Subsequent processing is carried out to the matching word, to update the matching word set；

Wherein, the subsequent processing includes following at least any one:

It determines mutual synonymous matching word included in the matching word, the mutually synonymous matching word is merged into institute State the subset of matching word set；

Synonymous extension is carried out to the matching word, the synonym obtained after synonymous extension and the matching word are determined as described Match the subset of word set.

9. a kind of index for establishing index based on text information establishes equipment, wherein the equipment includes:

Information determining means, for analyzing directory information and/or subdirectory information included in text information, to the text of acquisition This information carries out structuring, therefrom determines structured message；

Label determining device, comprising:

Candidate determination unit, for the theme according to corresponding to the descriptor, the determining and master from the text information Inscribe at least one corresponding candidate label word；

Centre word determination unit, for determining corresponding centre word according at least one described candidate label word；

Tag determination unit, for the centre word at a distance from, determined according at least one described candidate label word with it is described The corresponding label word of theme；

10. index according to claim 9 establishes equipment, wherein the equipment further includes theme training device, is used for:

According to the training corpus, training subject classification device；

Wherein, the subject distillation device is used for:

11. index according to claim 9 establishes equipment, wherein the centre word determination unit is used for:

The part of speech of at least one candidate label word；

The word rule of at least one candidate label word；

The co-occurrence ratio of at least one candidate the label word and the theme.

12. the index according to any one of claim 9 to 11 establishes equipment, wherein the equipment further include:

Normalized device determines the multiple semantic congruence if including the label word of multiple semantic congruences for the label word Label word normalization result；

Wherein, the index is established device and is used for:

13. a kind of matching unit of the inquiry input information of the index matching user established according to claim 9, wherein should Equipment includes:

Matching inquiry device, for being carried out in the index that such as claim 10 is established according to the descriptor and label word Matching inquiry, to obtain the candidate text information to match with the inquiry input information；

Text determining device is determined for the semantic matching degree according to the candidate text information and the inquiry input information The target text information to match with the inquiry input information.

14. equipment according to claim 13, wherein the text determining device includes:

Matching primitives unit, for calculating the semantic matching degree of the candidate text information and the inquiry input information；

Text determination unit, for according to the semantic matching degree, in conjunction with predetermined matching degree threshold value, the determining and inquiry to be inputted The target text information that information matches.

15. equipment according to claim 14, wherein the equipment further includes word set determining device, is used for:

The participle is merged into processing with information analysis apparatus descriptor obtained and label word, to obtain and The inquiry inputs the corresponding matching word set of information, wherein the matching word concentrates included word as matching word；

Wherein, the matching primitives unit is used for:

16. equipment according to claim 15, wherein the equipment further includes aftertreatment device, is used for:

Wherein, the subsequent processing includes following at least any one:

17. a kind of system for establishing the inquiry input information of index and matching user, including as appointed in claim 9 to 12 Index described in one establishes equipment, and the matching unit as described in any one of claim 13 to 16.