CN103886034A - Method and equipment for building indexes and matching inquiry input information of user - Google Patents

Method and equipment for building indexes and matching inquiry input information of user Download PDF

Info

Publication number
CN103886034A
CN103886034A CN201410079818.7A CN201410079818A CN103886034A CN 103886034 A CN103886034 A CN 103886034A CN 201410079818 A CN201410079818 A CN 201410079818A CN 103886034 A CN103886034 A CN 103886034A
Authority
CN
China
Prior art keywords
word
label
candidate
index
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410079818.7A
Other languages
Chinese (zh)
Other versions
CN103886034B (en
Inventor
方高林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201410079818.7A priority Critical patent/CN103886034B/en
Publication of CN103886034A publication Critical patent/CN103886034A/en
Application granted granted Critical
Publication of CN103886034B publication Critical patent/CN103886034B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and equipment for building indexes and matching inquiry input information of a user. According to text information, structural information is determined and subject words are extracted; according to a subject corresponding to the subject words, label words corresponding to the subject are determined; and the indexes are built form the subject words and the label words. Moreover, the subject words and the label words are obtained through analysis of the inquiry input information input by the user, matching inquiry is carried out in the built indexes, and candidate text information is obtained; according to the semantic matching degree of the candidate text information and the inquiry input information, and target text information matched with the inquiry input information is determined. Compared with the prior art, on the basis of encyclopedia or other network resource knowledge, extracting of subjects and titles is carried out, effective description of resource knowledge content is formed, and accordingly semantic searching of the resource knowledge is more efficient, the searching requirements of the user for complicated descriptions which the user cannot accurately express by means of key words are met, and use experience of the user is improved.

Description

A kind of method and apparatus of the inquiry input message of setting up index and match user
Technical field
The present invention relates to field of computer technology, relate in particular to a kind of for setting up the technology of inquiry input message of index and match user.
Background technology
People are in the process of use search engine, often do not know to input which type of keyword and express the idea of oneself, it may input the descriptive words and phrases of a pile, for example: 1) get up to vomit morning, palpitation and short breath at ordinary times, weakness of limbs, it is what disease symptoms? 2) express the song profile of thinking fondly of to lover? 3) comprise " say what forget wealth and rank " song 4) eating chafing dish and singing which film song is in, who says? 4) describe the verse 5 of studying hard) difficulty of conducting oneself, doing woman's difficulty is who says, what does is complete saying? also have some users may input the expression content of some formula complexity, for example, for some personage's classifications, user may ask " which Anhui emperor and President out have? " " Politburo Standing Committee member in current government Shanxi introduces " etc.Search engine is difficult to search suitable result in this case.
Analyze from reason, this is because present general search engine is mainly that title (title) is set up to index.Although these search engines are also set up index to content conventionally, due to factors such as the power of tune, cause some high-quality knowledge description parts to be difficult to well show.For example, resources-type as the information such as song, film for some, existing search engine is just set up index to song title and movie name conventionally, in this case, when user does not remember title of the song or movie name, but while only remembeing that wherein the lyrics, lines brief introduction or fraction are described, existing search engine just cannot carry out effective search inquiry.These situations occur in the resource of the classifications such as novel, poem, distich, blessing language, personage, TV play, novel, sentence, Chinese idiom, disease equally.
Encyclopaedia class resources and knowledge, normally to set up index centered by word, will cause in general searching order algorithm like this, can difficult the non-keyword appearing in title can be sorted above.But in fact, due to the knowledge authority of encyclopaedia class resources and knowledge, if before these data are come, can be good at meeting user's demand.For example, for the disease in encyclopaedia, if symptom is labelled and indexed, just can well corresponding resources and knowledge be offered to user according to user profile symptom out.
Therefore, how effectively to utilize existing resource knowledge, set up for it index the coupling acquisition target text information corresponding with user's inquiry input message, become those skilled in the art and need one of problem of solution badly.
Summary of the invention
The object of this invention is to provide a kind of for setting up the method and apparatus of inquiry input message of index and match user.
According to an aspect of the present invention, provide a kind of for set up the method for index based on text message, wherein, the method comprises the following steps:
A, according to text message, therefrom determines structured message;
B extracts descriptor in described structured message;
C, according to the corresponding theme of described descriptor, determines the label word corresponding with described theme in described text message;
D is that index set up in described descriptor and described label word.
According to a further aspect in the invention, also provide a kind of according to the method for the inquiry input message of aforementioned set up index match user, wherein, the method comprises the following steps:
A obtains the inquiry input message of user's input;
B carries out theme and label analysis to described inquiry input message, to obtain the corresponding descriptor of described inquiry input message and label word;
C, according to described descriptor and label word, carries out matching inquiry in aforementioned set up index, to obtain the candidate's text message matching with described inquiry input message;
D, according to the semantic matches degree of described candidate's text message and described inquiry input message, determines the target text information matching with described inquiry input message.
According to another aspect of the invention, also provide a kind of for set up the index apparatus for establishing of index based on text message, wherein, this equipment comprises:
Information determining device, for according to text message, therefrom determines structured message;
Theme extraction element, for extracting descriptor from described structured message;
Label determining device for according to the corresponding theme of described descriptor, is determined the label word corresponding with described theme in described text message;
Index apparatus for establishing, is used to described descriptor and described label word to set up index.
In accordance with a further aspect of the present invention, also provide a kind of according to the matching unit of the inquiry input message of aforementioned set up index match user, wherein, this equipment comprises:
Inquiry acquisition device, for obtaining the inquiry input message of user's input;
Information analysis apparatus, for described inquiry input message is carried out to theme and label analysis, to obtain the corresponding descriptor of described inquiry input message and label word;
Matching inquiry device for according to described descriptor and label word, carries out matching inquiry in the index of setting up as claim 10, to obtain the candidate's text message matching with described inquiry input message;
Text determining device, for according to the semantic matches degree of described candidate's text message and described inquiry input message, determines the target text information matching with described inquiry input message.
In accordance with a further aspect of the present invention, also provide a kind of and comprised foregoing index apparatus for establishing for setting up the system of inquiry input message of index and match user, and foregoing matching unit.
Compared with prior art, the present invention, according to text message, therefrom determines structured message; In described structured message, extract descriptor; According to the corresponding theme of described descriptor, in described text message, determine the label word corresponding with described theme; For index set up in described descriptor and described label word.Further, the present invention obtains the inquiry input message of user's input; Described inquiry input message is carried out to theme and label analysis, to obtain the corresponding descriptor of described inquiry input message and label word; According to described descriptor and label word, in aforementioned set up index, carry out matching inquiry, to obtain the candidate's text message matching with described inquiry input message; According to the semantic matches degree of described candidate's text message and described inquiry input message, determine the target text information matching with described inquiry input message.
The present invention is based on encyclopaedia class resources and knowledge, or other are by the resources and knowledge of Web Mining, it is carried out to the extraction of theme, title, form the effective description to resources and knowledge content, represent better this class high-quality resource knowledge, make the semantic search of this class resources and knowledge more efficient, meet user cannot accurately use keyword express complexity search need is described, promoted user's experience.
Accompanying drawing explanation
By reading the detailed description that non-limiting example is done of doing with reference to the following drawings, it is more obvious that other features, objects and advantages of the present invention will become:
Fig. 1 illustrate according to one aspect of the invention for set up the equipment schematic diagram of index based on text message;
Fig. 2 illustrate in accordance with a preferred embodiment of the present invention for set up the equipment schematic diagram of index based on text message;
Fig. 3 illustrates the equipment schematic diagram of the inquiry input message for match user according to a further aspect of the present invention;
Fig. 4 illustrate according to the present invention another aspect for set up the method flow diagram of index based on text message;
Fig. 5 illustrate according to the present invention another aspect for set up the method flow diagram of index based on text message.
In accompanying drawing, same or analogous Reference numeral represents same or analogous parts.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
Fig. 1 illustrate according to one aspect of the invention for set up the equipment schematic diagram of index based on text message.Index apparatus for establishing 1 comprises information determining device 101, theme extraction element 102, label determining device 103 and index apparatus for establishing 104.
Wherein, information determining device 101, according to text message, is therefrom determined structured message.Particularly, for example mutual by with data source of this information determining device 101, as encyclopaedia data etc., obtain text message, and then, by text information is carried out to structuring, as directory information, the sub-directory information etc. analyzing in text information to be comprised, therefrom determine structured message.
For example, mutual by with the encyclopaedia such as Baidupedia, interactive encyclopaedia data of information determining device 101, obtain the resources and knowledge of these encyclopaedia classes, using as text message, and then, this information determining device 101 is carried out structuring to text information, for example, analyze catalogue and sub-directory that each resources and knowledge is corresponding, as the resources and knowledge for " disease ", analyze catalogue corresponding to its symptom or sub-directory, catalogue or sub-directory etc. that methods for the treatment of is corresponding.
And for example, information determining device 101, by the mode of data mining, is excavated resources and knowledge from internet, using as text message, and then, text information is carried out to structuring to determine structured message.For example, this information determining device 101 is by the excavation to vertical class resource website, therefrom obtains the information such as symptom description, methods for the treatment of, the hospital of speciality of disease and disease.Each resource is organized as ID using disease.As, first provide some candidates' seed word according to classification, for example disease, provide coronary heart disease, myocarditis, gastritis etc., obtain the forward website url of common rank according to Search Results, the structure of its website is analyzed, therefrom extract the information of the speciality hospital of methods for the treatment of, the coronary heart disease of symptom, the coronary heart disease of coronary heart disease, coronary heart disease, and above-mentioned information is integrated in this class of coronary heart disease " disease ", in the mode of organizing, this coronary heart disease is formed to business card, store.Should " coronary heart disease " can be used as final text message, and the information such as " symptom of coronary heart disease " of its correspondence, " methods for the treatment of of coronary heart disease ", " information of the speciality hospital of coronary heart disease " can be used as structured message corresponding to text information.
Those skilled in the art will be understood that the mode of above-mentioned definite structured message is only for giving an example; other existing or modes that may occur from now on fixed structure information are really as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Theme extraction element 102 extracts descriptor in described structured message.Particularly, this theme extraction element 102 is according to the determined structured message of information determining device 101, for example, by subject classification device, or the mode of other predetermined extraction descriptor, certainly in this structured message, extract descriptor.
At this, the object of extracting descriptor is to extract from text message the theme that represents text information, thereby for setting up semantic indexing and follow-up semantic matches calculation services.
Preferably, this index apparatus for establishing 1 also comprises theme trainer (not shown), and this theme trainer, according to predetermined theme system, obtains the corpus corresponding with described predetermined theme system; According to described corpus, training subject classification device; Wherein, described theme extraction element 102, according to described subject classification device, extracts described descriptor in described structured message.
Particularly, theme trainer is determined predetermined theme system, for example, this theme trainer is according to the statistics of the search sequence of a large amount of web search user inputs, determine the search need that web search user is conventional, and in conjunction with current conventional taxonomic hierarchies, such as encyclopaedia, the existing system such as know, determine the subject classification system with certain demand, and set it as predetermined theme system.And then, this theme trainer, according to this predetermined theme system, obtains the corpus corresponding with this predetermined theme system, for example, suppose in article, there is corresponding station location marker " medical treatment & health internal medicine ", these data are considered to the corpus of disease category.Subsequently, this theme trainer is according to this corpus, and training subject classification device, for example, by corpus, trains a svm disaggregated model, using as subject classification device.
Then, the subject classification device that theme extraction element 102 is trained according to this theme trainer, extracts descriptor in self-structure information.For example, structured messages such as " coronary heart disease " word and symptom, methods of treatments is inputted this subject classification device by this theme extraction element 102, themes as " disease " thereby obtain this.And for example, for new encyclopaedia business card, theme extraction element 102 is inputted this subject classification device, as svm sorter, thereby obtains the corresponding theme of classification of this encyclopaedia business card.
Preferably, this theme extraction element 102 also can carry out synonymous expression expansion to the theme of this extraction, for example, theme " disease " is carried out to synonymous expression expansion, increases a synonym theme " disease ".
Those skilled in the art will be understood that the mode of said extracted descriptor is only for giving an example; the mode of other extraction descriptor existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Label determining device 103, according to the corresponding theme of described descriptor, is determined the label word corresponding with described theme in described text message.Particularly, the descriptor that this label determining device 103 is extracted according to theme extraction element 102, and the corresponding theme of this descriptor, in text information, determine the label word corresponding with this theme, for example, the text message being the theme for disease, label determining device 103 is determined the following label word corresponding with this theme: palpitation and short breath, uncomfortable in chest, diarrhoea, vomiting, weakness of limbs etc.
Preferably, described label determining device 103 comprises candidate's determining unit (not shown), centre word determining unit (not shown) and label determining unit (not shown).Particularly, this candidate's determining unit is according to the corresponding theme of described descriptor, in described text message, determine at least one candidate label word corresponding with described theme, for example, this candidate's determining unit is to all monobasic, binary, ternary word statistics of carrying out take vocabulary as the page data of tissue, extract the word that is greater than now some page datas, as candidate's label word.
Subsequently, centre word determining unit, according to described at least one candidate's label word, is determined corresponding centre word.Then, label determining unit, according to the distance of described at least one candidate's label word and described centre word, is determined the label word corresponding with described theme.
For example, centre word determining unit, according to the label data of adding up above, merges all candidate's label words, these candidate's label words are carried out adding up under line, statistic processes is as follows: by extensive text, as adopted whole network data, in statistics in the conllinear frequency of document.For any two candidate's label words, according to following formula, calculate the similarity between them:
Sim ( w 1 , w 2 ) = Σ w ′ PMI ( w ′ , w 1 ) PMI ( w ′ , w 2 ) Σ w ′ PMI ( w ′ , w 1 ) 2 Σ w ′ PMI ( w ′ , w 2 ) 2
At this, and PMI (w ', w 1) expression w'w 1between mutual information score value, be defined as
Figure BDA0000473323290000072
p (w) represents by the probability of statistics word w.
Subsequently, centre word determining unit is according to theme, determine and need to analyze which territory of text message, as the symptom classification of, disease, poem itself and explain part, personage's description part etc.And then, therefrom extract all words that occur and corresponding synonym in candidate's label word, then by these word composition Yi Ge centers, as centre word corresponding to this at least one candidate's label word.
Then, label determining unit is calculated in this at least one candidate's label word the distance of each and this centre word, for example, supposes that this sentences T and represent centre word, and the distance of candidate's label word and this centre word can be calculated acquisition by following formula:
Dis ( x ) = Σ w ∈ T Sim ( w , x ) / Num ( T )
At this, Num (T) represents the number of the word comprising in centre word.
Subsequently, this label determining unit is according to the distance of this at least one candidate's label word and this centre word, determine the label word corresponding with this theme, for example, using the candidate's label word that is less than predetermined threshold with the distance of this centre word as the label word corresponding with this theme.
Preferably, as shown in Figure 3, label determining unit is done a time series with the distance of this candidate's label rank and centre word, if the slope that rank changes is greater than predetermined slope threshold value, follow-up node is by amputation, if the rank in Fig. 3 the 5th o'clock is to the 6th point.
At this, this slope threshold value for example by the population distribution of statistics score, set by experience.
Those skilled in the art will be understood that the mode of above-mentioned definite label word is only for giving an example; other are existing or may occur from now on really calibrating the mode of signing word as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
More preferably, described centre word determining unit, according to predetermined filtering rule, is carried out filtration treatment to described at least one candidate's label word, to obtain at least one candidate's label word after treatment after filtration; According to described at least one candidate's label word after treatment after filtration, determine described centre word; Wherein, described predetermined filtering rule based on below at least any one determine:
The part of speech of-described at least one candidate's label word;
The word rule of-described at least one candidate's label word;
The co-occurrence ratio of-described at least one candidate's label word and described theme.
Particularly, in the process that candidate's label word is added up, may introduce noise, therefore, need to carry out filtration treatment to candidate's label word, centre word determining unit, according to predetermined filtering rule, is carried out filtration treatment to described at least one candidate's label word, to obtain at least one candidate's label word after treatment after filtration.
For example, this centre word determining unit, according to the part of speech of this at least one candidate's label word, is carried out filtration treatment to this at least one candidate's label word, as, this at least one candidate's label word is carried out to head-word and the filtration of tail word.
And for example, this centre word determining unit is according to the word rule of this at least one candidate's label word, this at least one candidate's label word is carried out to filtration treatment, as, the lead-in of this candidate's label word can not be " ", the word such as " doing ", " quilt ", " ratio ", tail word can not be " when ", the word such as " arriving ", " obtaining ".
For another example, this centre word determining unit is the co-occurrence ratio with described theme according to this at least one candidate's label word, this at least one candidate's label word is carried out to filtration treatment, as, this centre word determining unit is in searching statistical daily record and in the whole network title, adds up the co-occurrence ratio of this at least one candidate's label word and theme, only has and just being retained that this theme co-occurrence is crossed, or, retain with the co-occurrence of this theme than the candidate's label word that is greater than predetermined threshold.
Preferably, this centre word determining unit, according in conjunction with above-mentioned any two predetermined filtering rules or consider whole three predetermined filtering rules, is carried out filtration treatment to this at least one candidate's label word.
Subsequently, centre word determining unit, according to described at least one candidate's label word after treatment after filtration, is determined described centre word.
Those skilled in the art will be understood that above-mentioned predetermined filtering rule is only for giving an example, and other predetermined filtering rules existing or that may occur from now on, as applicable to the present invention, also should be included in protection domain of the present invention, and are contained in this at this with way of reference.
Index apparatus for establishing 104 is set up index for described descriptor and described label word.Particularly, the descriptor that index apparatus for establishing 104 extracts according to descriptor extraction element 102, and the determined label word of this label determining device 102, for index set up in this descriptor and label word.
For example, suppose that document corresponding to coronary heart disease is ID1, correspondence importance degree in the document is WC1 (x), as x can equal " disease ", " palpitation and short breath " etc., document corresponding to myocarditis is ID2, and document corresponding to gastritis is ID3, and document corresponding to apoplexy is ID4.Index apparatus for establishing 104 is set up unified inverted index to descriptor and label word in the following manner:
Disease-ID1(WC1 (x)), ID2(WC2 (x)), ID3(WC3 (x)) and, ID4(WC4 (x))
Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Palpitation-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Vomiting-ID3(WC3 (x)), ID4(WC4 (x))
Tell-ID3(WC3 (x)), ID4(WC4 (x))
Preferably, index apparatus for establishing 1 also comprises normalized device (not shown), if the described label word of this normalized device comprises the label word of multiple semantic congruences, determines the normalization result of the label word of described multiple semantic congruences; Wherein, described index apparatus for establishing 104 is that described descriptor, described label word and described normalization result are set up index.
Particularly, may comprise the label word of multiple semantic congruences in the label word of descriptor " disease " correspondence, if " telling " and " n and V " is semantic congruence, normalized device determines that the normalization result of these two label words is " vomiting "; Subsequently, index apparatus for establishing 104 for this descriptor " disease ", label word " are told ", " n and V " and normalization result " vomiting " set up index.
Those skilled in the art will be understood that the above-mentioned mode of setting up index is only for giving an example; other existing or modes of setting up index that may occur are from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Conventionally, setting up index is all to set up index for keyword, and at this, index apparatus for establishing 1 is also set up index to descriptor, label word and normalization result thereof, thereby the inquiry input message that realizes user is better mated with resources and knowledge.
Preferably, between each device of index apparatus for establishing 1, be constant work.Particularly, information determining device 101, according to text message, is therefrom determined structured message; Theme extraction element 102 extracts descriptor in described structured message; Label determining device 103, according to the corresponding theme of described descriptor, is determined the label word corresponding with described theme in described text message; Index apparatus for establishing 104 is set up index for described descriptor and described label word.At this, it will be understood by those skilled in the art that each device that " continuing " refer to index apparatus for establishing 1 requires to carry out respectively the determining and the foundation of index of extraction, label word of the determining of structured message, descriptor according to the mode of operation of setting or adjust in real time, until this index apparatus for establishing 1 stops determining structured message in a long time.
At this, index apparatus for establishing 1, according to text message, is therefrom determined structured message; In described structured message, extract descriptor; According to the corresponding theme of described descriptor, in described text message, determine the label word corresponding with described theme; For index set up in described descriptor and described label word.Index apparatus for establishing is based on encyclopaedia class resources and knowledge, or other are by the resources and knowledge of Web Mining, it is carried out to the extraction of theme, title, form the effective description to resources and knowledge content, represent better this class high-quality resource knowledge, make the follow-up semantic search to this class resources and knowledge more efficient, meet user and cannot accurately use the complexity of keyword expression to describe search need, promoted user's experience.
Fig. 2 illustrates the equipment schematic diagram of the inquiry input message for match user according to a further aspect of the present invention.Matching unit 2 comprises inquiry acquisition device 201, information analysis apparatus 202, matching inquiry device 203 and text determining device 204.
Wherein, inquiry acquisition device 201 obtains the inquiry input message of user's input.Particularly, mutual by with subscriber equipment of user, input inquiry input message, the application programming interfaces (API) of inquiry acquisition device 201 by calling this subscriber equipment and provide, by calling dynamic page technology such as JSP, ASP or PHP, or, by the communication mode of other agreements, obtain the inquiry input message of this user's input.
At this, this inquiry input message includes but not limited to the inquiry input message that user submits to by different input modes such as word input, phonetic entry, image inputs.
Those skilled in the art will be understood that the above-mentioned mode of obtaining inquiry input message is only for giving an example; other existing or modes of obtaining inquiry input message that may occur are from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Information analysis apparatus 202 carries out theme and label analysis to described inquiry input message, to obtain the corresponding descriptor of described inquiry input message and label word.Particularly, the inquiry input message that information analysis apparatus 202 obtains this inquiry acquisition device 201 is carried out theme and label analysis, for example, by this inquiry input message is inputted to the subject classification device that aforementioned training obtains, obtain the corresponding descriptor of this inquiry input message; This information analysis apparatus 202 carries out label analysis to the inquiry input message of this user's input, obtains corresponding label word.At this, the mode that this information analysis apparatus 202 is analyzed the label of this inquiry input message is identical or similar with the mode of the label word of aforementioned label determining device 103 definite text messages, so locate to repeat no more, and mode is by reference contained in this.
Matching inquiry device 203, according to described descriptor and label word, carries out matching inquiry at aforementioned index apparatus for establishing 104 in the index of setting up, to obtain the candidate's text message matching with described inquiry input message.Particularly, the inquiry input message of user's input that matching inquiry device 203 obtains according to this inquiry acquisition device 201, in the index of setting up at aforementioned index apparatus for establishing 104, carry out matching inquiry, for example, by the mode that all coupling or part are mated, the text message of the corresponding descriptor of this inquiry input message is hit in acquisition, or hit the text message of the corresponding label word of this inquiry input message, using the candidate's text message as matching with this inquiry input message.
For example, suppose that user input query input message is for " palpitation and short breath ", inquiry acquisition device 201 obtains the inquiry input message " palpitation and short breath " of this user's input; Information analysis apparatus 202 carries out label analysis to this inquiry input message, and the label word of acquisition is " palpitation and short breath ", and the index that aforementioned index apparatus for establishing 104 is set up this label word " palpitation and short breath " is as follows:
Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Wherein, ID1, ID2, ID4 represent respectively the id number of the text message that includes label word " palpitation and short breath ", and WC1 (x), WC2 (x), WC4 (x) represent respectively label word " palpitation and short breath " importance degree in these text messages respectively.
Matching inquiry device 203 is according to the corresponding label word of this user's inquiry input message " palpitation and short breath ", in the index of setting up at index apparatus for establishing 104, carry out matching inquiry, as according to above-mentioned index, obtain the corresponding candidate's text message of this inquiry input message " palpitation and short breath "---text message ID1, ID2 and ID4.
Those skilled in the art will be understood that the mode of above-mentioned matching inquiry is only for giving an example; the mode of other matching inquiries existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Text determining device 204, according to the semantic matches degree of described candidate's text message and described inquiry input message, is determined the target text information matching with described inquiry input message.
Particularly, between candidate's text message and inquiry input message, there is certain semantic matches degree, this semantic matches degree can obtain by calculating, or further obtains with the corresponding matching degree of mating between word set of this inquiry input message by calculating the index word set that this candidate's text message is corresponding.Text determining device 204 is according to the semantic matches degree of this candidate's text message and user's inquiry input message, determine the target text information matching with this inquiry input message, as using candidate's text message the highest semantic matches degree as the target text information matching with this inquiry input message, or candidate's text message that semantic matches degree is greater than to predetermined matching degree threshold value is as the target text information matching with this inquiry input message.
At this, this predetermined matching degree threshold value is the semantic matches degree for judging that whether candidate's text message matches with inquiry input message, and its value can be preset fixing, also can adjust according to actual conditions.
Preferably, text determining device also comprises coupling computing unit (not shown) and text determining unit (not shown).This coupling computing unit calculates the semantic matches degree of described candidate's text message and described inquiry input message; Text determining unit, according to described semantic matches degree, in conjunction with predetermined matching degree threshold value, is determined the target text information matching with described inquiry input message.
For example, this coupling computing unit, according to existing matching degree computing method, calculates the semantic matches degree of this candidate's text message and user's inquiry input message; When this semantic matches degree is greater than this predetermined matching degree threshold value, text determining unit is using this candidate's text message as the target text information matching with this inquiry input message.
Preferably, text determining device also can, according to the corresponding index word set of candidate's text message and the corresponding word set of mating of inquiry input message, be determined and the corresponding target text information of this inquiry input message.Particularly, candidate's text message has corresponding index word set, as suppose corresponding the theming as of candidate's text message ID1 " coronary heart disease " in upper example, its corresponding index terms comprises " disease ", " palpitation and short breath " etc., and the index word set that these index terms form is the corresponding index word set of this candidate's text message ID1.User's inquiry input message also has corresponding coupling word set, for example, by this inquiry input message being carried out to obtain coupling word after word segmentation processing, the set again this coupling word being formed is as coupling word set corresponding to this inquiry input message, the inquiry input message that user inputs as supposed is " palpitation and short breath vomiting ", matching unit 1 carries out after word segmentation processing this inquiry input message, obtain coupling word " palpitation and short breath " and " vomiting ", these two mate the set that forms of word and are coupling word set corresponding to this inquiry input message.Text determining device 204 is mated word set according to this index word set with this, determine the target text information matching with this user's inquiry input message, for example, will hit the corresponding text messages of index word set of coupling words at most in this coupling word set, as the target text information matching with this inquiry input message; Or, the quantity of hitting coupling word is greater than to the corresponding text message of index word set of predetermined quantity threshold value, as the target text information matching with this inquiry input message.
For example, for candidate's text message ID1, ID2 and ID4 in upper example, the index word set that ID1 is corresponding comprises index terms " disease ", " palpitation and short breath "; The index word set that ID2 is corresponding comprises index terms " palpitation and short breath ", " vomiting ", " disease "; The index word set that ID4 is corresponding comprises index terms " palpitation and short breath ".The inquiry input message " palpitation and short breath vomiting " of inputting for user, its coupling word is " palpitation and short breath ", " vomiting ", the index word set that ID2 is corresponding is hit maximum coupling word in coupling word set corresponding to this inquiry input message, using this candidate's text message ID2 as the target text information matching most with this inquiry input message, or, suppose that predetermined quantity threshold value is 0, above-mentioned candidate's text message ID1, the quantity that the corresponding index word set of ID2 and ID4 is hit the coupling word in this coupling word set is all greater than this predetermined quantity threshold value, above-mentioned candidate's text message ID1, ID2 and ID4 are all as the target text information matching with this inquiry input message.When this matching unit 2 offers this user, can be according to corresponding index terms the height of the importance degree in this candidate's text message sort.
Those skilled in the art will be understood that the mode of above-mentioned definite target text information is only for giving an example; the mode of other existing or text messages that may occur from now on really setting the goal is as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, between each device of matching unit 2, be constant work.Particularly, inquiry acquisition device 201 obtains the inquiry input message of user's input; Information analysis apparatus 202 carries out theme and label analysis to described inquiry input message, to obtain the corresponding descriptor of described inquiry input message and label word; Matching inquiry device 203, according to described descriptor and label word, carries out matching inquiry at aforementioned index apparatus for establishing 104 in the index of setting up, to obtain the candidate's text message matching with described inquiry input message; Text determining device 204, according to the semantic matches degree of described candidate's text message and described inquiry input message, is determined the target text information matching with described inquiry input message.At this, it will be understood by those skilled in the art that the definite of matching inquiry that each device that " continuing " refer to matching unit 2 requires to inquire about the obtaining of input message, theme and label analysis, candidate's text message according to the mode of operation of setting or adjust in real time respectively and target text information, until this matching unit 2 stops obtaining the inquiry input message that user inputs in a long time.
At this, between index apparatus for establishing 1 and each device of matching unit 2, cooperatively interact, to realize the inquiry input message based on user's input, coupling obtains corresponding with it target text information; Based on encyclopaedia class resources and knowledge, or other are by the resources and knowledge of Web Mining, it is carried out to the extraction of theme, title, form the effective description to resources and knowledge content, represent better this class high-quality resource knowledge, make the semantic search of this class resources and knowledge more efficient, meet user cannot accurately use keyword express complexity search need is described, promoted user's experience.
Preferably, two different territories also can be regarded as in described descriptor and label word, corresponding subject area and label field respectively, described matching inquiry device 203 is according to this descriptor and label word, in subject area and the corresponding aforementioned index of label field, carry out matching inquiry respectively, to obtain the candidate's text message matching with described inquiry input message.
Particularly, descriptor and label word that the analysis of matching inquiry device 203 inquiry input message to user's input according to information analysis apparatus 202 obtains, adopt a point mode for territory coupling, in this subject area and the corresponding index of label field, carry out matching inquiry respectively, to obtain candidate's text message.
At this, this subject area and label field can, by this inquiry input message is analyzed to acquisition, for example, to the inquiry input message of user's input, utilize aforesaid subject classification device to analyze the inquiry input message of user's input, obtain subject categories.
At this, the corresponding index of subject area and label field is the index that aforementioned index apparatus for establishing 104 is set up, according to the label of setting up before, inquiry input message to user's input is carried out the extraction of label word, as for being included in this inquiry input message and in tag set the inside, extracted.Then, utilize label word and subject categories to unify to draw in index the candidate of inverted entry to corresponding theme and label, using the document that comprises this subject categories or label as candidate's text message corresponding with this inquiry input message, participate in subsequent calculations.
Preferably, this matching inquiry device 203 also can be considered the corresponding weight of this subject area and label field, carries out matching inquiry in corresponding index, considers weight corresponding to this subject area and label field, finally obtains candidate's text message.
Preferably, the coupling word that described text determining device 204 is included according to described coupling word set, concentrate and determine target index word set at the corresponding index terms of described candidate's text message, wherein, described target index word set is hit coupling words maximum in described coupling word set; If described target index word set is greater than predetermined threshold with described similarity of mating word set, using the corresponding text message of described target index word set as the target text information matching with described inquiry input message.
Particularly, text determining device 204 hits according to the corresponding index word set of candidate's text message the quantity that coupling is mated word in word set, will hit index word set that coupling word quantity is maximum as target index word set; Subsequently, text determining device 204 is calculated this target index word set and the similarity of mating word set, for example, calculate respectively target index word set and mate in word set, similarity between the index terms hitting and corresponding coupling word, again by modes such as simple addition or weighted means, calculate this target index word set and the similarity of mating word set, in the time that this similarity is greater than predetermined threshold, text determining device is using the corresponding text message of this target index word set as the target text information matching with this inquiry input message.
At this, this predetermined threshold is according to target index word set and the similarity of mating word set, judges whether text message that target index terms set pair the is answered similarity threshold as target text information, and its value can be fixed, and also can adjust according to actual conditions.
Preferably, matching unit 2 also comprises word set determining device (not shown).Wherein, word set determining device is carried out word segmentation processing to described inquiry input message, obtains the participle after described word segmentation processing; Descriptor and label word that described participle and described information analysis apparatus 202 are obtained merge processing, and to obtain the coupling word set corresponding with described inquiry input message, wherein, word included in described coupling word set is as coupling word.Subsequently, described coupling computing unit, according to described coupling word set and the corresponding index word set of described candidate's text message, calculates the semantic matches degree of described candidate's text message and described inquiry input message.
Particularly, the inquiry input message that word set determining device is obtained this inquiry acquisition device 201 is carried out word segmentation processing, to obtain the participle after word segmentation processing, preferably, this word set determining device also can be removed the filtration treatment such as stop words to obtaining participle after this word segmentation processing, and then obtains final participle; Subsequently, this word set determining device is according to obtained participle, descriptor and label word that itself and aforementioned information analytical equipment 202 are obtained merge processing, de-redundancy processing etc., finally to obtain the coupling word set corresponding with this inquiry input message, and using word included in this coupling word set as the coupling word corresponding with this inquiry input message.
Subsequently, coupling computing unit, according to described coupling word set and the corresponding index word set of described candidate's text message, calculates the semantic matches degree of described candidate's text message and described inquiry input message.
More preferably, this matching unit 2 also comprises aftertreatment device (not shown).This aftertreatment device is carried out subsequent treatment to described coupling word, to upgrade described coupling word set; Wherein, described subsequent treatment comprises following at least any one:
-determine the coupling word of mutual synonym included in described coupling word, the coupling word of described mutual synonym is merged into the subset of described coupling word set.
-described coupling word is carried out to synonym expansion, by the synonym obtaining after synonym expansion and described subset of mating word and be defined as described coupling word set.
Particularly, aftertreatment device is carried out subsequent treatment to the coupling word in the determined coupling word set of word set determining device, to upgrade this coupling word set.For example, aftertreatment device is determined the coupling word of mutual synonym included in described coupling word, the coupling word of described mutual synonym is merged into the subset of described coupling word set.Owing to may comprising the coupling word of mutual synonym in coupling word, as " vomiting " and " telling ", this aftertreatment device is merged into the coupling word of these mutual synonyms the subset of this coupling word set.
For example, suppose that the inquiry input message of user's input is Q, word set determining device is carried out word segmentation processing to this inquiry input message, after removing the filtration treatment such as stop words, the coupling word set in label field is expressed as Q={a, b, c, d, e}, wherein, a, b, c, d, e is respectively coupling word included in this coupling word set; Suppose that coupling word a and b are wherein the coupling words of mutual synonym, aftertreatment device is merged into this coupling word a and b the subset of this coupling word set, and this coupling word set updating form is shown Q={{a, b}, c, d, e}.Subsequently, follow up device operates as matching inquiry device 203 carries out follow-up matching inquiry.
And for example, aftertreatment device is also carried out synonym expansion to described coupling word, by the synonym obtaining after synonym expansion and described subset of mating word and be defined as described coupling word set.Particularly, aftertreatment device also can be carried out synonym expansion to the coupling word in coupling word set corresponding to this inquiry input message, as " palpitation " synonym is expanded to " palpitation and short breath ", subsequently, this aftertreatment device mates with this synonym obtaining after expansion of this synonym word and is defined as the subset of this coupling word set.
Connect example, for the coupling word set Q={{a after synonym merges, b}, c, d, e}, this aftertreatment device also can be carried out synonym expansion to this coupling word set, and expansion obtains the synonym of coupling word abcde wherein, and the synonym obtaining after the expansion of this synonym is mated to word with this and is defined as the subset of this coupling word set, for example, this coupling word set Q, after repeatedly synonym is expanded, obtains following expression:
Q = { ( w 11 1 , w 11 2 . . . w 11 k ) , ( w 12 1 , w 12 2 . . . w 1 2 k ) , . . . , ( w 1 m 1 , w 1 m 2 . . . w 1 m k ) }
Subsequently, matching inquiry device 203, according to this coupling word set, carries out matching inquiry at index apparatus for establishing 104 in the index of setting up, and for example, through inverted index, acquisition comprises
Figure BDA0000473323290000182
candidate's text message.
Suppose that by the index terms set representations that hits coupling words maximum in coupling word set be C, C is:
C = { ( w 21 1 , w 21 2 . . . w 21 k ) , ( w 22 1 , w 22 2 . . . w 22 k ) , . . . , ( w 2 n 1 , w 2 n 2 . . . w 2 n k ) }
Wherein, C represents the maximum that synonym hits
Figure BDA0000473323290000184
w 1ithe set of words of corresponding position Semantic mapping
Figure BDA0000473323290000185
Mate computing unit according to described coupling word set and the corresponding index word set of described candidate's text message, calculate the semantic matches degree of described candidate's text message and described inquiry input message.
Semantic matches degree between Q and C can calculate by following formula:
R ( Q , C ) = Σ w 1 k j = w 2 k j ( W Q ( w 1 k i ) * W C ( w 2 k j ) ) Σ t = 1 . . . m Wgy ( w 1 k t ) 2 Σ j = 1 . . . n Wgt ( w 2 k j ) 2 * Match ( T Q , T C )
Wherein,
Figure BDA0000473323290000187
represent word
Figure BDA0000473323290000188
weight, use (log(TF)+1 here) * log(N/DF) represent; Match (T q, T c) represent whether index word set, coupling word set mate with theme.
At this, Match (T q, T c) corresponding value definable, as supposed, this index word set, coupling word set mate with theme, Match (T q, T c) value be 1, otherwise be 0.5.
Subsequently, suppose that this semantic matches degree value calculating is greater than predetermined threshold, text determining unit is using the corresponding text message of this index word set as the target text information matching with this inquiry input message.
Fig. 4 illustrate according to the present invention another aspect for set up the method flow diagram of index based on text message.
In step S401, index apparatus for establishing 1, according to text message, is therefrom determined structured message.Particularly, in step S401, for example mutual by with data source of index apparatus for establishing 1, as encyclopaedia data etc., obtain text message, and then, by text information is carried out to structuring, as directory information, the sub-directory information etc. analyzing in text information to be comprised, therefrom determine structured message.
For example, in step S401, mutual by with the encyclopaedia such as Baidupedia, interactive encyclopaedia data of index apparatus for establishing 1, obtain the resources and knowledge of these encyclopaedia classes, using as text message, and then, in step S401, index apparatus for establishing 1 carries out structuring to text information, for example, analyze catalogue and sub-directory that each resources and knowledge is corresponding, as the resources and knowledge for " disease ", analyze catalogue corresponding to its symptom or sub-directory, catalogue or sub-directory etc. that methods for the treatment of is corresponding.
And for example, in step S401, index apparatus for establishing 1, by the mode of data mining, is excavated resources and knowledge from internet, using as text message, and then, text information is carried out to structuring to determine structured message.For example, in step S401, index apparatus for establishing 1 is by the excavation to vertical class resource website, therefrom obtains the information such as symptom description, methods for the treatment of, the hospital of speciality of disease and disease.Each resource is organized as ID using disease.As, first provide some candidates' seed word according to classification, for example disease, provide coronary heart disease, myocarditis, gastritis etc., obtain the forward website url of common rank according to Search Results, the structure of its website is analyzed, therefrom extract the information of the speciality hospital of methods for the treatment of, the coronary heart disease of symptom, the coronary heart disease of coronary heart disease, coronary heart disease, and above-mentioned information is integrated in this class of coronary heart disease " disease ", in the mode of organizing, this coronary heart disease is formed to business card, store.Should " coronary heart disease " can be used as final text message, and the information such as " symptom of coronary heart disease " of its correspondence, " methods for the treatment of of coronary heart disease ", " information of the speciality hospital of coronary heart disease " can be used as structured message corresponding to text information.
Those skilled in the art will be understood that the mode of above-mentioned definite structured message is only for giving an example; other existing or modes that may occur from now on fixed structure information are really as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
In step S402, index apparatus for establishing 1 extracts descriptor in described structured message.Particularly, in step S402, index apparatus for establishing 1 basis determined structured message in step S401, for example, by subject classification device, or the mode of other predetermined extraction descriptor, certainly in this structured message, extract descriptor.
At this, the object of extracting descriptor is to extract from text message the theme that represents text information, thereby for setting up semantic indexing and follow-up semantic matches calculation services.
Preferably, the method also comprises that step S405(is not shown), in step S405, index apparatus for establishing 1, according to predetermined theme system, obtains the corpus corresponding with described predetermined theme system; According to described corpus, training subject classification device; Wherein, in step S402, index apparatus for establishing 1, according to described subject classification device, extracts described descriptor in described structured message.
Particularly, in step S405, index apparatus for establishing 1 is determined predetermined theme system, for example, and in step S405, index apparatus for establishing 1 is according to the statistics of the search sequence of a large amount of web search user inputs, determine the search need that web search user is conventional, and in conjunction with current conventional taxonomic hierarchies, such as encyclopaedia, the existing system such as know, determine the subject classification system with certain demand, and set it as predetermined theme system.And then in step S405, index apparatus for establishing 1 is according to this predetermined theme system, obtain the corpus corresponding with this predetermined theme system, for example, suppose in article, there is corresponding station location marker " medical treatment & health internal medicine ", these data are considered to the corpus of disease category.Subsequently, in step S405, index apparatus for establishing 1 is according to this corpus, and training subject classification device, for example, by corpus, trains a svm disaggregated model, using as subject classification device.
Then,, in step S402, index apparatus for establishing 1, according to the subject classification device of training in step S405, extracts descriptor in self-structure information.For example, in step S402, structured messages such as " coronary heart disease " word and symptom, methods of treatments is inputted this subject classification device by index apparatus for establishing 1, themes as " disease " thereby obtain this.And for example, for new encyclopaedia business card, in step S402, index apparatus for establishing 1 is inputted this subject classification device, as svm sorter, thereby obtains the corresponding theme of classification of this encyclopaedia business card.
Preferably, in step S402, index apparatus for establishing 1 also can carry out synonymous expression expansion to the theme of this extraction, for example, theme " disease " is carried out to synonymous expression expansion, increases a synonym theme " disease ".
Those skilled in the art will be understood that the mode of said extracted descriptor is only for giving an example; the mode of other extraction descriptor existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
In step S403, index apparatus for establishing 1, according to the corresponding theme of described descriptor, is determined the label word corresponding with described theme in described text message.Particularly, in step S403, index apparatus for establishing 1 is according to the descriptor of extracting in step S402, and the corresponding theme of this descriptor, in text information, determine the label word corresponding with this theme, for example, the text message being the theme for disease, in step S403, index apparatus for establishing 1 is determined the following label word corresponding with this theme: palpitation and short breath, uncomfortable in chest, diarrhoea, vomiting, weakness of limbs etc.
Preferably, step S403 also comprises that sub-step S403a(is not shown), sub-step S403b(is not shown) and sub-step S403c(not shown).Particularly, in sub-step S403a, index apparatus for establishing 1 is according to the corresponding theme of described descriptor, in described text message, determine at least one candidate label word corresponding with described theme, for example, in sub-step S403a, index apparatus for establishing 1 is to all monobasic, binary, ternary word statistics of carrying out take vocabulary as the page data of tissue, extract the word that is greater than now some page datas, as candidate's label word.
Subsequently, in sub-step S403b, index apparatus for establishing 1, according to described at least one candidate's label word, is determined corresponding centre word.Then,, in sub-step S403c, index apparatus for establishing 1, according to the distance of described at least one candidate's label word and described centre word, is determined the label word corresponding with described theme.
For example, in sub-step S403b, index apparatus for establishing 1 is according to the label data of adding up above, all candidate's label words are merged, these candidate's label words are carried out adding up under line, statistic processes is as follows: by extensive text, as adopted whole network data, in statistics in the conllinear frequency of document.For any two candidate's label words, according to following formula, calculate the similarity between them:
Sim ( w 1 , w 2 ) = Σ w ′ PMI ( w ′ , w 1 ) PMI ( w ′ , w 2 ) Σ w ′ PMI ( w ′ , w 1 ) 2 Σ w ′ PMI ( w ′ , w 2 ) 2
At this, and PMI (w ', w 1) expression w'w 1between mutual information score value, be defined as
Figure BDA0000473323290000212
p (w) represents by the probability of statistics word w.
Subsequently, in sub-step S403b, index apparatus for establishing 1 is according to theme, determine and need to analyze which territory of text message, as the symptom classification of, disease, poem itself and explain part, personage's description part etc.And then, therefrom extract all words that occur and corresponding synonym in candidate's label word, then by these word composition Yi Ge centers, as centre word corresponding to this at least one candidate's label word.
Then, in sub-step S403c, index apparatus for establishing 1 calculates in this at least one candidate's label word the distance of each and this centre word, for example, supposes that this sentences T and represent centre word, and the distance of candidate's label word and this centre word can be calculated acquisition by following formula:
Dis ( x ) = Σ w ∈ T Sim ( w , x ) / Num ( T )
At this, Num (T) represents the number of the word comprising in centre word.
Subsequently, in sub-step S403c, index apparatus for establishing 1, according to the distance of this at least one candidate's label word and this centre word, is determined the label word corresponding with this theme, for example,, using the candidate's label word that is less than predetermined threshold with the distance of this centre word as the label word corresponding with this theme.
Preferably, as shown in Figure 3, in sub-step S403c, index apparatus for establishing 1 does a time series with the distance of this candidate's label rank and centre word, if the slope that rank changes is greater than predetermined slope threshold value, follow-up node is by amputation, if the rank in Fig. 3 the 5th o'clock is to the 6th point.
At this, this slope threshold value for example by the population distribution of statistics score, set by experience.
Those skilled in the art will be understood that the mode of above-mentioned definite label word is only for giving an example; other are existing or may occur from now on really calibrating the mode of signing word as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
More preferably, in sub-step S403b, index apparatus for establishing 1, according to predetermined filtering rule, carries out filtration treatment to described at least one candidate's label word, to obtain at least one candidate's label word after treatment after filtration; According to described at least one candidate's label word after treatment after filtration, determine described centre word; Wherein, described predetermined filtering rule based on below at least any one determine:
The part of speech of-described at least one candidate's label word;
The word rule of-described at least one candidate's label word;
The co-occurrence ratio of-described at least one candidate's label word and described theme.
Particularly, in the process that candidate's label word is added up, may introduce noise, therefore, need to carry out filtration treatment to candidate's label word, in sub-step S403b, index apparatus for establishing 1 is according to predetermined filtering rule, described at least one candidate's label word is carried out to filtration treatment, to obtain at least one candidate's label word after treatment after filtration.
For example, in sub-step S403b, index apparatus for establishing 1, according to the part of speech of this at least one candidate's label word, carries out filtration treatment to this at least one candidate's label word, as, this at least one candidate's label word is carried out to head-word and the filtration of tail word.
And for example, in sub-step S403b, index apparatus for establishing 1 is according to the word rule of this at least one candidate's label word, this at least one candidate's label word is carried out to filtration treatment, as, the lead-in of this candidate's label word can not be " ", the word such as " doing ", " quilt ", " ratio ", tail word can not be " when ", the word such as " arriving ", " obtaining ".
For another example, in sub-step S403b, index apparatus for establishing 1 is the co-occurrence ratio with described theme according to this at least one candidate's label word, and this at least one candidate's label word is carried out to filtration treatment, as, in sub-step S403b, index apparatus for establishing 1 is in searching statistical daily record and in the whole network title, adds up the co-occurrence ratio of this at least one candidate's label word and theme, only has and just being retained that this theme co-occurrence is crossed, or, retain with the co-occurrence of this theme than the candidate's label word that is greater than predetermined threshold.
Preferably, in sub-step S403b, index apparatus for establishing 1, according in conjunction with above-mentioned any two predetermined filtering rules or consider whole three predetermined filtering rules, carries out filtration treatment to this at least one candidate's label word.
Subsequently, in sub-step S403b, index apparatus for establishing 1, according to described at least one candidate's label word after treatment after filtration, is determined described centre word.
Those skilled in the art will be understood that above-mentioned predetermined filtering rule is only for giving an example, and other predetermined filtering rules existing or that may occur from now on, as applicable to the present invention, also should be included in protection domain of the present invention, and are contained in this at this with way of reference.
In step S404, index apparatus for establishing 1 is set up index for described descriptor and described label word.Particularly, in step S404, index apparatus for establishing 1 is according to the descriptor of extracting in step S402, and in step S402 determined label word, for index set up in this descriptor and label word.
For example, suppose that document corresponding to coronary heart disease is ID1, correspondence importance degree in the document is WC1 (x), as x can equal " disease ", " palpitation and short breath " etc., document corresponding to myocarditis is ID2, and document corresponding to gastritis is ID3, and document corresponding to apoplexy is ID4.In step S404, index apparatus for establishing 1 is set up unified inverted index to descriptor and label word in the following manner:
Disease-ID1(WC1 (x)), ID2(WC2 (x)), ID3(WC3 (x)) and, ID4(WC4 (x))
Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Palpitation-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Vomiting-ID3(WC3 (x)), ID4(WC4 (x))
Tell-ID3(WC3 (x)), ID4(WC4 (x))
Preferably, the method also comprises that step S406(is not shown), in step S406, if described label word comprises the label word of multiple semantic congruences, index apparatus for establishing 1 is determined the normalization result of the label word of described multiple semantic congruences; Wherein, in step S404, index apparatus for establishing 1 is that described descriptor, described label word and described normalization result are set up index.
Particularly, in the label word of descriptor " disease " correspondence, may comprise the label word of multiple semantic congruences, if " telling " and " n and V " is semantic congruence,, in step S406, index apparatus for establishing 1 determines that the normalization result of these two label words is " vomiting "; Subsequently, in step S404, index apparatus for establishing 1 for this descriptor " disease ", label word " are told ", " n and V " and normalization result " vomiting " set up index.
Those skilled in the art will be understood that the above-mentioned mode of setting up index is only for giving an example; other existing or modes of setting up index that may occur are from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Conventionally, setting up index is all to set up index for keyword, and at this, index apparatus for establishing 1 is also set up index to descriptor, label word and normalization result thereof, thereby the inquiry input message that realizes user is better mated with resources and knowledge.
Preferably, between each step of index apparatus for establishing 1, be constant work.Particularly, in step S401, index apparatus for establishing 1, according to text message, is therefrom determined structured message; In step S402, index apparatus for establishing 1 extracts descriptor in described structured message; In step S403, index apparatus for establishing 1, according to the corresponding theme of described descriptor, is determined the label word corresponding with described theme in described text message; In step S404, index apparatus for establishing 1 is set up index for described descriptor and described label word.At this, it will be understood by those skilled in the art that each step that " continuing " refer to index apparatus for establishing 1 requires to carry out respectively the determining and the foundation of index of extraction, label word of the determining of structured message, descriptor according to the mode of operation of setting or adjust in real time, until this index apparatus for establishing 1 stops determining structured message in a long time.
At this, index apparatus for establishing 1, according to text message, is therefrom determined structured message; In described structured message, extract descriptor; According to the corresponding theme of described descriptor, in described text message, determine the label word corresponding with described theme; For index set up in described descriptor and described label word.Index apparatus for establishing is based on encyclopaedia class resources and knowledge, or other are by the resources and knowledge of Web Mining, it is carried out to the extraction of theme, title, form the effective description to resources and knowledge content, represent better this class high-quality resource knowledge, make the follow-up semantic search to this class resources and knowledge more efficient, meet user and cannot accurately use the complexity of keyword expression to describe search need, promoted user's experience.
Fig. 5 illustrate according to the present invention another aspect for set up the method flow diagram of index based on text message.
In step S501, matching unit 2 obtains the inquiry input message of user's input.Particularly, mutual by with subscriber equipment of user, input inquiry input message, in step S501, the application programming interfaces (API) of matching unit 2 by calling this subscriber equipment and provide, by calling dynamic page technology such as JSP, ASP or PHP, or, by the communication mode of other agreements, obtain the inquiry input message of this user's input.
At this, this inquiry input message includes but not limited to the inquiry input message that user submits to by different input modes such as word input, phonetic entry, image inputs.
Those skilled in the art will be understood that the above-mentioned mode of obtaining inquiry input message is only for giving an example; other existing or modes of obtaining inquiry input message that may occur are from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
In step S502, matching unit 2 carries out theme and label analysis to described inquiry input message, to obtain the corresponding descriptor of described inquiry input message and label word.Particularly, in step S502, matching unit 2 carries out theme and label analysis to the inquiry input message of obtaining in step S501, for example, by this inquiry input message is inputted to the subject classification device that aforementioned training obtains, obtain the corresponding descriptor of this inquiry input message; In step S502, matching unit 2 carries out label analysis to the inquiry input message of this user's input, obtains corresponding label word.At this, in step S502, the mode that matching unit 2 is analyzed the label of this inquiry input message is identical or similar with the mode of aforementioned index apparatus for establishing 1 label word of definite text message in step S403, so locate to repeat no more, and mode is by reference contained in this.
In step S503, matching unit 2, according to described descriptor and label word, carries out matching inquiry at aforementioned index apparatus for establishing 1 in the index of setting up, to obtain the candidate's text message matching with described inquiry input message.Particularly, in step S503, matching unit 2 is according to the inquiry input message of user's input of obtaining in step S501, in the index of setting up at aforementioned index apparatus for establishing 1, carry out matching inquiry, for example, by the mode that all coupling or part are mated, the text message of the corresponding descriptor of this inquiry input message is hit in acquisition, or hits the text message of the corresponding label word of this inquiry input message, using the candidate's text message as matching with this inquiry input message.
For example, suppose that user input query input message is for " palpitation and short breath ", in step S501, matching unit 2 obtains the inquiry input message " palpitation and short breath " of this user's input; In step S502, matching unit 2 carries out label analysis to this inquiry input message, and the label word of acquisition is " palpitation and short breath ", and the index that aforementioned index apparatus for establishing 1 is set up this label word " palpitation and short breath " is as follows:
Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Wherein, ID1, ID2, ID4 represent respectively the id number of the text message that includes label word " palpitation and short breath ", and WC1 (x), WC2 (x), WC4 (x) represent respectively label word " palpitation and short breath " importance degree in these text messages respectively.
In step S503, matching unit 2 is according to the corresponding label word of this user's inquiry input message " palpitation and short breath ", in the index of setting up at index apparatus for establishing 1, carry out matching inquiry, as according to above-mentioned index, obtain the corresponding candidate's text message of this inquiry input message " palpitation and short breath "---text message ID1, ID2 and ID4.
Those skilled in the art will be understood that the mode of above-mentioned matching inquiry is only for giving an example; the mode of other matching inquiries existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
In step S504, matching unit 2, according to the semantic matches degree of described candidate's text message and described inquiry input message, is determined the target text information matching with described inquiry input message.
Particularly, between candidate's text message and inquiry input message, there is certain semantic matches degree, this semantic matches degree can obtain by calculating, or further obtains with the corresponding matching degree of mating between word set of this inquiry input message by calculating the index word set that this candidate's text message is corresponding.In step S504, matching unit 2 is according to the semantic matches degree of this candidate's text message and user's inquiry input message, determine the target text information matching with this inquiry input message, as using candidate's text message the highest semantic matches degree as the target text information matching with this inquiry input message, or candidate's text message that semantic matches degree is greater than to predetermined matching degree threshold value is as the target text information matching with this inquiry input message.
At this, this predetermined matching degree threshold value is the semantic matches degree for judging that whether candidate's text message matches with inquiry input message, and its value can be preset fixing, also can adjust according to actual conditions.
Preferably, this step S504 also comprises that sub-step S504a(is not shown) and sub-step S504b(not shown).In sub-step S504a, matching unit 2 calculates the semantic matches degree of described candidate's text message and described inquiry input message; In sub-step S504a, matching unit 2, according to described semantic matches degree, in conjunction with predetermined matching degree threshold value, is determined the target text information matching with described inquiry input message.
For example, in sub-step S504a, matching unit 2, according to existing matching degree computing method, calculates the semantic matches degree of this candidate's text message and user's inquiry input message; When this semantic matches degree is greater than this predetermined matching degree threshold value, in sub-step S504b, matching unit 2 is using this candidate's text message as the target text information matching with this inquiry input message.
Preferably, in step S504, matching unit 2 also can, according to the corresponding index word set of candidate's text message and the corresponding word set of mating of inquiry input message, be determined and the corresponding target text information of this inquiry input message.Particularly, candidate's text message has corresponding index word set, as suppose corresponding the theming as of candidate's text message ID1 " coronary heart disease " in upper example, its corresponding index terms comprises " disease ", " palpitation and short breath " etc., and the index word set that these index terms form is the corresponding index word set of this candidate's text message ID1.User's inquiry input message also has corresponding coupling word set, for example, by this inquiry input message being carried out to obtain coupling word after word segmentation processing, the set again this coupling word being formed is as coupling word set corresponding to this inquiry input message, the inquiry input message that user inputs as supposed is " palpitation and short breath vomiting ", matching unit 1 carries out after word segmentation processing this inquiry input message, obtain coupling word " palpitation and short breath " and " vomiting ", these two mate the set that forms of word and are coupling word set corresponding to this inquiry input message.In step S504, matching unit 2 mates word set according to this index word set with this, determine the target text information matching with this user's inquiry input message, for example, will hit the corresponding text messages of index word set of coupling words at most in this coupling word set, as the target text information matching with this inquiry input message; Or, the quantity of hitting coupling word is greater than to the corresponding text message of index word set of predetermined quantity threshold value, as the target text information matching with this inquiry input message.
For example, for candidate's text message ID1, ID2 and ID4 in upper example, the index word set that ID1 is corresponding comprises index terms " disease ", " palpitation and short breath "; The index word set that ID2 is corresponding comprises index terms " palpitation and short breath ", " vomiting ", " disease "; The index word set that ID4 is corresponding comprises index terms " palpitation and short breath ".The inquiry input message " palpitation and short breath vomiting " of inputting for user, its coupling word is " palpitation and short breath ", " vomiting ", the index word set that ID2 is corresponding is hit maximum coupling word in coupling word set corresponding to this inquiry input message, using this candidate's text message ID2 as the target text information matching most with this inquiry input message, or, suppose that predetermined quantity threshold value is 0, above-mentioned candidate's text message ID1, the quantity that the corresponding index word set of ID2 and ID4 is hit the coupling word in this coupling word set is all greater than this predetermined quantity threshold value, above-mentioned candidate's text message ID1, ID2 and ID4 are all as the target text information matching with this inquiry input message.When this matching unit 2 offers this user, can be according to corresponding index terms the height of the importance degree in this candidate's text message sort.
Those skilled in the art will be understood that the mode of above-mentioned definite target text information is only for giving an example; the mode of other existing or text messages that may occur from now on really setting the goal is as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, between each step of matching unit 2, be constant work.Particularly, in step S501, matching unit 2 obtains the inquiry input message of user's input; In step S502, matching unit 2 carries out theme and label analysis to described inquiry input message, to obtain the corresponding descriptor of described inquiry input message and label word; In step S503, matching unit 2, according to described descriptor and label word, carries out matching inquiry at aforementioned index apparatus for establishing 1 in the index of setting up, to obtain the candidate's text message matching with described inquiry input message; In step S504, matching unit 2, according to the semantic matches degree of described candidate's text message and described inquiry input message, is determined the target text information matching with described inquiry input message.At this, it will be understood by those skilled in the art that the definite of matching inquiry that each step that " continuing " refer to matching unit 2 requires to inquire about the obtaining of input message, theme and label analysis, candidate's text message according to the mode of operation of setting or adjust in real time respectively and target text information, until this matching unit 2 stops obtaining the inquiry input message that user inputs in a long time.
At this, between index apparatus for establishing 1 and each step of matching unit 2, cooperatively interact, to realize the inquiry input message based on user's input, coupling obtains corresponding with it target text information; Based on encyclopaedia class resources and knowledge, or other are by the resources and knowledge of Web Mining, it is carried out to the extraction of theme, title, form the effective description to resources and knowledge content, represent better this class high-quality resource knowledge, make the semantic search of this class resources and knowledge more efficient, meet user cannot accurately use keyword express complexity search need is described, promoted user's experience.
Preferably, two different territories also can be regarded as in described descriptor and label word, corresponding subject area and label field respectively, in step S503, matching unit 2 is according to this descriptor and label word, in subject area and the corresponding aforementioned index of label field, carry out matching inquiry respectively, to obtain the candidate's text message matching with described inquiry input message.
Particularly, in step S503, descriptor and label word that matching unit 2 obtains according to the analysis of the inquiry input message to user's input in step S502, adopt a point mode for territory coupling, in this subject area and the corresponding index of label field, carry out matching inquiry respectively, to obtain candidate's text message.
At this, this subject area and label field can, by this inquiry input message is analyzed to acquisition, for example, to the inquiry input message of user's input, utilize aforesaid subject classification device to analyze the inquiry input message of user's input, obtain subject categories.
At this, the corresponding index of subject area and label field is the index that aforementioned index apparatus for establishing 1 is set up, according to the label of setting up before, inquiry input message to user's input is carried out the extraction of label word, as for being included in this inquiry input message and in tag set the inside, extracted.Then, utilize label word and subject categories to unify to draw in index the candidate of inverted entry to corresponding theme and label, using the document that comprises this subject categories or label as candidate's text message corresponding with this inquiry input message, participate in subsequent calculations.
Preferably, in step S503, matching unit 2 also can be considered the corresponding weight of this subject area and label field, carries out matching inquiry in corresponding index, considers weight corresponding to this subject area and label field, finally obtains candidate's text message.
Preferably, in step S504, the coupling word that matching unit 2 is included according to described coupling word set, concentrates and determines target index word set at the corresponding index terms of described candidate's text message, wherein, described target index word set is hit coupling words maximum in described coupling word set; If described target index word set is greater than predetermined threshold with described similarity of mating word set, using the corresponding text message of described target index word set as the target text information matching with described inquiry input message.
Particularly, in step S504, matching unit 2 hits according to the corresponding index word set of candidate's text message the quantity that coupling is mated word in word set, will hit index word set that coupling word quantity is maximum as target index word set; Subsequently, in step S504, matching unit 2 calculates this target index word set and the similarity of mating word set, for example, calculate respectively target index word set and mate in word set, similarity between the index terms hitting and corresponding coupling word, again by modes such as simple addition or weighted means, calculate this target index word set and the similarity of mating word set, in the time that this similarity is greater than predetermined threshold, in step S504, matching unit 2 is using the corresponding text message of this target index word set as the target text information matching with this inquiry input message.
At this, this predetermined threshold is according to target index word set and the similarity of mating word set, judges whether text message that target index terms set pair the is answered similarity threshold as target text information, and its value can be fixed, and also can adjust according to actual conditions.
Preferably, the method also comprises that step S505(is not shown).In step S505, matching unit 2 carries out word segmentation processing to described inquiry input message, obtains the participle after described word segmentation processing; Descriptor and label word that described participle and described matching unit 2 are obtained in step S502 merge processing, and to obtain the coupling word set corresponding with described inquiry input message, wherein, word included in described coupling word set is as coupling word.Subsequently, in sub-step S504a, matching unit 2, according to described coupling word set and the corresponding index word set of described candidate's text message, calculates the semantic matches degree of described candidate's text message and described inquiry input message.
Particularly, in step S505, matching unit 2 carries out word segmentation processing to the inquiry input message of obtaining in step S501, to obtain the participle after word segmentation processing, preferably, in step S505, matching unit 2 also can be removed the filtration treatment such as stop words to obtaining participle after this word segmentation processing, and then obtains final participle; Subsequently, in step S505, matching unit 2 is according to obtained participle, descriptor and label word that itself and matching unit 2 are obtained in step S502 merge processing, de-redundancy processing etc., finally to obtain the coupling word set corresponding with this inquiry input message, and using word included in this coupling word set as the coupling word corresponding with this inquiry input message.
Subsequently, in sub-step S504a, matching unit 2, according to described coupling word set and the corresponding index word set of described candidate's text message, calculates the semantic matches degree of described candidate's text message and described inquiry input message.
More preferably, the method also comprises that step S506(is not shown).In step S506, matching unit 2 carries out subsequent treatment to described coupling word, to upgrade described coupling word set; Wherein, described subsequent treatment comprises following at least any one:
-determine the coupling word of mutual synonym included in described coupling word, the coupling word of described mutual synonym is merged into the subset of described coupling word set.
-described coupling word is carried out to synonym expansion, by the synonym obtaining after synonym expansion and described subset of mating word and be defined as described coupling word set.
Particularly, in step S506, matching unit 2 carries out subsequent treatment to the coupling word in determined coupling word set in step S505, to upgrade this coupling word set.For example, in step S506, matching unit 2 is determined the coupling word of mutual synonym included in described coupling word, the coupling word of described mutual synonym is merged into the subset of described coupling word set.Owing to may comprising the coupling word of mutual synonym in coupling word, as " vomiting " and " telling ", in step S506, the subset of this coupling word set merged into the coupling word of these mutual synonyms by matching unit 2.
For example, suppose that the inquiry input message of user's input is Q, in step S505, matching unit 2 carries out word segmentation processing to this inquiry input message, and after removing the filtration treatment such as stop words, the coupling word set in label field is expressed as Q={a, b, c, d, e}, wherein, a, b, c, d, e is respectively coupling word included in this coupling word set; Suppose that coupling word a and b are wherein the coupling words of mutual synonym,, in step S506, the subset of this coupling word set merged into this coupling word a and b by matching unit 2, and this coupling word set updating form is shown Q={{a, b}, c, d, e}.Subsequently, subsequent step operates as step S503 carries out follow-up matching inquiry.
And for example, in step S506, matching unit 2 also carries out synonym expansion to described coupling word, by the synonym obtaining after synonym expansion and described subset of mating word and be defined as described coupling word set.Particularly, in step S506, matching unit 2 also can carry out synonym expansion to the coupling word in coupling word set corresponding to this inquiry input message, as " palpitation " synonym is expanded to " palpitation and short breath ", subsequently, in step S506, matching unit 2 mates with this synonym obtaining after expansion of this synonym word and is defined as the subset of this coupling word set.
Connect example, for the coupling word set Q={{a after synonym merges, b}, c, d, e}, in step S506, matching unit 2 also can carry out synonym expansion to this coupling word set, and expansion obtains coupling word a wherein, b, c, d, the synonym of e, and the synonym obtaining after the expansion of this synonym is mated to word with this and is defined as the subset of this coupling word set, for example, this coupling word set Q, after repeatedly synonym is expanded, obtains following expression:
Q = { ( w 11 1 , w 11 2 . . . w 11 k ) , ( w 12 1 , w 12 2 . . . w 1 2 k ) , . . . , ( w 1 m 1 , w 1 m 2 . . . w 1 m k ) }
Subsequently, in step S503, matching unit 2, according to this coupling word set, carries out matching inquiry at index apparatus for establishing 1 in the index of setting up, and for example, through inverted index, acquisition comprises
Figure BDA0000473323290000322
candidate's text message.
Suppose that by the index terms set representations that hits coupling words maximum in coupling word set be C, C is:
C = { ( w 21 1 , w 21 2 . . . w 21 k ) , ( w 22 1 , w 22 2 . . . w 22 k ) , . . . , ( w 2 n 1 , w 2 n 2 . . . w 2 n k ) }
Wherein, C represents the maximum that synonym hits
Figure BDA0000473323290000332
w 1ithe set of words of corresponding position Semantic mapping
Figure BDA0000473323290000333
, in sub-step S504a, matching unit 2, according to described coupling word set and the corresponding index word set of described candidate's text message, calculates the semantic matches degree of described candidate's text message and described inquiry input message.
Semantic matches degree between Q and C can calculate by following formula:
R ( Q , C ) = Σ w 1 k j = w 2 k j ( W Q ( w 1 k i ) * W C ( w 2 k j ) ) Σ t = 1 . . . m Wgy ( w 1 k t ) 2 Σ j = 1 . . . n Wgt ( w 2 k j ) 2 * Match ( T Q , T C )
Wherein,
Figure BDA0000473323290000335
represent word
Figure BDA0000473323290000336
weight, use (log(TF)+1 here) * log(N/DF) represent; Match (T q, T c) represent whether index word set, coupling word set mate with theme.
At this, Match (T q, T c) corresponding value definable, as supposed, this index word set, coupling word set mate with theme, Match (T q, T c) value be 1, otherwise be 0.5.
Subsequently, suppose that this semantic matches degree value calculating is greater than predetermined threshold, in sub-step S504b, matching unit 2 is using the corresponding text message of this index word set as the target text information matching with this inquiry input message.
It should be noted that the present invention can be implemented in the assembly of software and/or software and hardware, for example, can adopt special IC (ASIC), general object computing machine or any other similar hardware device to realize.In one embodiment, software program of the present invention can carry out to realize step mentioned above or function by processor.Similarly, software program of the present invention (comprising relevant data structure) can be stored in computer readable recording medium storing program for performing, for example, and RAM storer, magnetic or CD-ROM driver or flexible plastic disc and similar devices.In addition, steps more of the present invention or function can adopt hardware to realize, for example, thereby as coordinate the circuit of carrying out each step or function with processor.
In addition, a part of the present invention can be applied to computer program, and for example computer program instructions, in the time that it is carried out by computing machine, by the operation of this computing machine, can call or provide the method according to this invention and/or technical scheme.And call the programmed instruction of method of the present invention, may be stored in fixing or movably in recording medium, and/or be transmitted by the data stream in broadcast or other signal bearing medias, and/or be stored in according in the working storage of the computer equipment of described programmed instruction operation.At this, comprise according to one embodiment of present invention a device, this device comprises storer for storing computer program instructions and the processor for execution of program instructions, wherein, in the time that this computer program instructions is carried out by this processor, trigger this device and move based on aforementioned according to the method for multiple embodiment of the present invention and/or technical scheme.
To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned one exemplary embodiment, and in the situation that not deviating from spirit of the present invention or essential characteristic, can realize the present invention with other concrete form.Therefore, no matter from which point, all should regard embodiment as exemplary, and be nonrestrictive, scope of the present invention is limited by claims rather than above-mentioned explanation, is therefore intended to all changes that drop in the implication and the scope that are equal to important document of claim to be included in the present invention.Any Reference numeral in claim should be considered as limiting related claim.In addition, obviously other unit or step do not got rid of in " comprising " word, and odd number is not got rid of plural number.Multiple unit of stating in device claim or device also can be realized by software or hardware by a unit or device.The first, the second word such as grade is used for representing title, and does not represent any specific order.

Claims (19)

1. for set up a method for index based on text message, wherein, the method comprises the following steps:
A, according to text message, therefrom determines structured message;
B extracts descriptor in described structured message;
C, according to the corresponding theme of described descriptor, determines the label word corresponding with described theme in described text message;
D is that index set up in described descriptor and described label word.
2. method according to claim 1, wherein, the method also comprises:
-according to predetermined theme system, obtain the corpus corresponding with described predetermined theme system;
-according to described corpus, training subject classification device;
Wherein, described step B comprises:
-according to described subject classification device, in described structured message, extract described descriptor.
3. method according to claim 1, wherein, described step C comprises:
C1, according to the corresponding theme of described descriptor, determines at least one candidate label word corresponding with described theme in described text message;
C2, according to described at least one candidate's label word, determines corresponding centre word;
C3, according to the distance of described at least one candidate's label word and described centre word, determines the label word corresponding with described theme.
4. method according to claim 3, wherein, described step C2 comprises:
-according to predetermined filtering rule, described at least one candidate's label word is carried out to filtration treatment, to obtain at least one candidate's label word after treatment after filtration;
-according to described at least one candidate's label word after treatment after filtration, determine described centre word;
Wherein, described predetermined filtering rule based on below at least any one determine:
The part of speech of-described at least one candidate's label word;
The word rule of-described at least one candidate's label word;
The co-occurrence ratio of-described at least one candidate's label word and described theme.
5. according to the method described in any one in claim 1 to 4, wherein, the method also comprises:
Described in-Ruo, label word comprises the label word of multiple semantic congruences, determines the normalization result of the label word of described multiple semantic congruences;
Wherein, described step D comprises:
-be that described descriptor, described label word and described normalization result are set up index.
6. a method for the inquiry input message of the index match user of setting up according to claim 1, wherein, the method comprises the following steps:
A obtains the inquiry input message of user's input;
B carries out theme and label analysis to described inquiry input message, to obtain the corresponding descriptor of described inquiry input message and label word;
C, according to described descriptor and label word, carries out matching inquiry in the index of setting up as claim 1, to obtain the candidate's text message matching with described inquiry input message;
D, according to the semantic matches degree of described candidate's text message and described inquiry input message, determines the target text information matching with described inquiry input message.
7. method according to claim 6, wherein, described steps d comprises:
D1 calculates the semantic matches degree of described candidate's text message and described inquiry input message;
D2, according to described semantic matches degree, in conjunction with predetermined matching degree threshold value, determines the target text information matching with described inquiry input message.
8. method according to claim 7, wherein, the method also comprises:
-described inquiry input message is carried out to word segmentation processing, obtain the participle after described word segmentation processing;
-descriptor obtaining in described participle and step b and label word are merged to processing, to obtain the coupling word set corresponding with described inquiry input message, wherein, word included in described coupling word set is as coupling word;
Wherein, described steps d 1 comprises:
-according to described coupling word set and the corresponding index word set of described candidate's text message, calculate the semantic matches degree of described candidate's text message and described inquiry input message.
9. method according to claim 8, wherein, the method also comprises:
-described coupling word is carried out to subsequent treatment, to upgrade described coupling word set;
Wherein, described subsequent treatment comprises following at least any one:
-determine the coupling word of mutual synonym included in described coupling word, the coupling word of described mutual synonym is merged into the subset of described coupling word set.
-described coupling word is carried out to synonym expansion, by the synonym obtaining after synonym expansion and described subset of mating word and be defined as described coupling word set.
10. for set up an index apparatus for establishing for index based on text message, wherein, this equipment comprises:
Information determining device, for according to text message, therefrom determines structured message;
Theme extraction element, for extracting descriptor from described structured message;
Label determining device for according to the corresponding theme of described descriptor, is determined the label word corresponding with described theme in described text message;
Index apparatus for establishing, is used to described descriptor and described label word to set up index.
11. index apparatus for establishing according to claim 10, wherein, this equipment also comprises theme trainer, for:
-according to predetermined theme system, obtain the corpus corresponding with described predetermined theme system;
-according to described corpus, training subject classification device;
Wherein, described theme extraction element is used for:
-according to described subject classification device, in described structured message, extract described descriptor.
12. index apparatus for establishing according to claim 10, wherein, described label determining device comprises:
Candidate's determining unit for according to the corresponding theme of described descriptor, is determined at least one candidate label word corresponding with described theme in described text message;
Centre word determining unit, for according to described at least one candidate's label word, determines corresponding centre word;
Label determining unit, for according to the distance of described at least one candidate's label word and described centre word, determines the label word corresponding with described theme.
13. index apparatus for establishing according to claim 12, wherein, described centre word determining unit is used for:
-according to predetermined filtering rule, described at least one candidate's label word is carried out to filtration treatment, to obtain at least one candidate's label word after treatment after filtration;
-according to described at least one candidate's label word after treatment after filtration, determine described centre word;
Wherein, described predetermined filtering rule based on below at least any one determine:
The part of speech of-described at least one candidate's label word;
The word rule of-described at least one candidate's label word;
The co-occurrence ratio of-described at least one candidate's label word and described theme.
14. according to claim 10 to the index apparatus for establishing described in any one in 13, and wherein, this equipment also comprises:
Normalized device, if comprise the label word of multiple semantic congruences for described label word, determines the normalization result of the label word of described multiple semantic congruences;
Wherein, described index apparatus for establishing is used for:
-be that described descriptor, described label word and described normalization result are set up index.
The matching unit of the inquiry input message of 15. 1 kinds of index match user of setting up according to claim 10, wherein, this equipment comprises:
Inquiry acquisition device, for obtaining the inquiry input message of user's input;
Information analysis apparatus, for described inquiry input message is carried out to theme and label analysis, to obtain the corresponding descriptor of described inquiry input message and label word;
Matching inquiry device for according to described descriptor and label word, carries out matching inquiry in the index of setting up as claim 10, to obtain the candidate's text message matching with described inquiry input message;
Text determining device, for according to the semantic matches degree of described candidate's text message and described inquiry input message, determines the target text information matching with described inquiry input message.
16. matching units according to claim 15, wherein, described text determining device comprises:
Coupling computing unit, for calculating the semantic matches degree of described candidate's text message and described inquiry input message;
Text determining unit, for according to described semantic matches degree, in conjunction with predetermined matching degree threshold value, determines the target text information matching with described inquiry input message.
17. matching units according to claim 16, wherein, this equipment also comprises word set determining device, for:
-described inquiry input message is carried out to word segmentation processing, obtain the participle after described word segmentation processing;
-descriptor and label word that described participle and described information analysis apparatus are obtained merges processing, and to obtain the coupling word set corresponding with described inquiry input message, wherein, word included in described coupling word set is as coupling word;
Wherein, described coupling computing unit is used for:
-according to described coupling word set and the corresponding index word set of described candidate's text message, calculate the semantic matches degree of described candidate's text message and described inquiry input message.
18. matching units according to claim 17, wherein, this equipment also comprises aftertreatment device, for:
-described coupling word is carried out to subsequent treatment, to upgrade described coupling word set;
Wherein, described subsequent treatment comprises following at least any one:
-determine the coupling word of mutual synonym included in described coupling word, the coupling word of described mutual synonym is merged into the subset of described coupling word set.
-described coupling word is carried out to synonym expansion, by the synonym obtaining after synonym expansion and described subset of mating word and be defined as described coupling word set.
19. 1 kinds for setting up the system of inquiry input message of index and match user, comprises the index apparatus for establishing as described in any one in claim 10 to 14, and matching unit as described in any one in claim 15 to 18.
CN201410079818.7A 2014-03-05 2014-03-05 A kind of method and apparatus of inquiry input information that establishing index and matching user Active CN103886034B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410079818.7A CN103886034B (en) 2014-03-05 2014-03-05 A kind of method and apparatus of inquiry input information that establishing index and matching user

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410079818.7A CN103886034B (en) 2014-03-05 2014-03-05 A kind of method and apparatus of inquiry input information that establishing index and matching user

Publications (2)

Publication Number Publication Date
CN103886034A true CN103886034A (en) 2014-06-25
CN103886034B CN103886034B (en) 2019-03-19

Family

ID=50954926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410079818.7A Active CN103886034B (en) 2014-03-05 2014-03-05 A kind of method and apparatus of inquiry input information that establishing index and matching user

Country Status (1)

Country Link
CN (1) CN103886034B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786966A (en) * 2016-01-26 2016-07-20 浪潮软件集团有限公司 Text structuring method and device
CN106021225A (en) * 2016-05-12 2016-10-12 大连理工大学 Chinese maximal noun phrase (MNP) identification method based on Chinese simple noun phrases (SNPs)
WO2017071370A1 (en) * 2015-10-30 2017-05-04 华为技术有限公司 Label processing method and device
CN106815262A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 The searching method and device of judgement document
WO2017198039A1 (en) * 2016-05-16 2017-11-23 中兴通讯股份有限公司 Tag recommendation method and device
CN107436922A (en) * 2017-07-05 2017-12-05 北京百度网讯科技有限公司 Text label generation method and device
CN107844596A (en) * 2017-11-22 2018-03-27 福建中金在线信息科技有限公司 A kind of article search method and system
CN107918778A (en) * 2016-10-11 2018-04-17 阿里巴巴集团控股有限公司 A kind of information matching method and relevant apparatus
WO2018120447A1 (en) * 2016-12-28 2018-07-05 北京搜狗科技发展有限公司 Method, device and equipment for processing medical record information
CN108255985A (en) * 2017-12-28 2018-07-06 东软集团股份有限公司 Data directory construction method, search method and device, medium and electronic equipment
CN108416026A (en) * 2018-03-09 2018-08-17 腾讯科技(深圳)有限公司 Index generation method, content search method, device and equipment
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
TWI638274B (en) * 2016-04-12 2018-10-11 芋頭科技(杭州)有限公司 Semantic matching method and intelligent device
CN109074363A (en) * 2016-05-09 2018-12-21 华为技术有限公司 Data query method, data query system determine method and apparatus
CN109101469A (en) * 2017-06-21 2018-12-28 埃森哲环球解决方案有限公司 The information that can search for is extracted from digitized document
CN109213937A (en) * 2018-11-29 2019-01-15 深圳爱问科技股份有限公司 Intelligent search method and device
CN109543001A (en) * 2018-10-18 2019-03-29 华南理工大学 A kind of scientific and technological entry abstracting method characterizing Scientific Articles research contents
CN110209804A (en) * 2018-04-20 2019-09-06 腾讯科技(深圳)有限公司 Determination method and apparatus, storage medium and the electronic device of target corpus
CN110580276A (en) * 2018-06-08 2019-12-17 百度在线网络技术(北京)有限公司 method and apparatus for processing information
CN111008265A (en) * 2019-12-03 2020-04-14 腾讯云计算(北京)有限责任公司 Enterprise information searching method and device
CN113377922A (en) * 2021-06-25 2021-09-10 北京百度网讯科技有限公司 Method, apparatus, electronic device, and medium for matching information
CN113407671A (en) * 2017-06-01 2021-09-17 互动解决方案公司 Data information storage device for search
CN115687579A (en) * 2022-09-22 2023-02-03 广州视嵘信息技术有限公司 Document tag generation and matching method and device and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694523A (en) * 1995-05-31 1997-12-02 Oracle Corporation Content processing system for discourse
US20050246320A1 (en) * 2004-04-29 2005-11-03 International Business Machines Corporation Contextual flyout for search results
US20120166414A1 (en) * 2008-08-11 2012-06-28 Ultra Unilimited Corporation (dba Publish) Systems and methods for relevance scoring
CN103177036A (en) * 2011-12-23 2013-06-26 盛乐信息技术(上海)有限公司 Method and system for label automatic extraction
CN103294780A (en) * 2013-05-13 2013-09-11 百度在线网络技术(北京)有限公司 Directory mapping relationship mining device and directory mapping relationship mining device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694523A (en) * 1995-05-31 1997-12-02 Oracle Corporation Content processing system for discourse
US20050246320A1 (en) * 2004-04-29 2005-11-03 International Business Machines Corporation Contextual flyout for search results
US20120166414A1 (en) * 2008-08-11 2012-06-28 Ultra Unilimited Corporation (dba Publish) Systems and methods for relevance scoring
CN103177036A (en) * 2011-12-23 2013-06-26 盛乐信息技术(上海)有限公司 Method and system for label automatic extraction
CN103294780A (en) * 2013-05-13 2013-09-11 百度在线网络技术(北京)有限公司 Directory mapping relationship mining device and directory mapping relationship mining device

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017071370A1 (en) * 2015-10-30 2017-05-04 华为技术有限公司 Label processing method and device
CN106815262B (en) * 2015-12-01 2020-07-03 北京国双科技有限公司 Method and device for searching referee document
CN106815262A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 The searching method and device of judgement document
CN105786966A (en) * 2016-01-26 2016-07-20 浪潮软件集团有限公司 Text structuring method and device
TWI638274B (en) * 2016-04-12 2018-10-11 芋頭科技(杭州)有限公司 Semantic matching method and intelligent device
CN109074363A (en) * 2016-05-09 2018-12-21 华为技术有限公司 Data query method, data query system determine method and apparatus
CN106021225A (en) * 2016-05-12 2016-10-12 大连理工大学 Chinese maximal noun phrase (MNP) identification method based on Chinese simple noun phrases (SNPs)
CN106021225B (en) * 2016-05-12 2018-12-21 大连理工大学 A kind of Chinese Maximal noun phrase recognition methods based on the simple noun phrase of Chinese
CN107391509B (en) * 2016-05-16 2023-06-02 中兴通讯股份有限公司 Label recommending method and device
CN107391509A (en) * 2016-05-16 2017-11-24 中兴通讯股份有限公司 Label recommendation method and device
WO2017198039A1 (en) * 2016-05-16 2017-11-23 中兴通讯股份有限公司 Tag recommendation method and device
CN107918778A (en) * 2016-10-11 2018-04-17 阿里巴巴集团控股有限公司 A kind of information matching method and relevant apparatus
CN107918778B (en) * 2016-10-11 2022-03-15 阿里巴巴集团控股有限公司 Information matching method and related device
CN108257676B (en) * 2016-12-28 2020-03-03 北京搜狗科技发展有限公司 Medical case information processing method, device and equipment
WO2018120447A1 (en) * 2016-12-28 2018-07-05 北京搜狗科技发展有限公司 Method, device and equipment for processing medical record information
CN108257676A (en) * 2016-12-28 2018-07-06 北京搜狗科技发展有限公司 A kind of processing method, device and the equipment of case information
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
CN113407671A (en) * 2017-06-01 2021-09-17 互动解决方案公司 Data information storage device for search
CN109101469B (en) * 2017-06-21 2022-07-05 埃森哲环球解决方案有限公司 Extracting searchable information from digitized documents
CN109101469A (en) * 2017-06-21 2018-12-28 埃森哲环球解决方案有限公司 The information that can search for is extracted from digitized document
CN107436922A (en) * 2017-07-05 2017-12-05 北京百度网讯科技有限公司 Text label generation method and device
US10838997B2 (en) 2017-07-05 2020-11-17 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and device for generating text tag
CN107844596A (en) * 2017-11-22 2018-03-27 福建中金在线信息科技有限公司 A kind of article search method and system
CN108255985A (en) * 2017-12-28 2018-07-06 东软集团股份有限公司 Data directory construction method, search method and device, medium and electronic equipment
CN108416026A (en) * 2018-03-09 2018-08-17 腾讯科技(深圳)有限公司 Index generation method, content search method, device and equipment
CN110209804A (en) * 2018-04-20 2019-09-06 腾讯科技(深圳)有限公司 Determination method and apparatus, storage medium and the electronic device of target corpus
CN110580276A (en) * 2018-06-08 2019-12-17 百度在线网络技术(北京)有限公司 method and apparatus for processing information
CN110580276B (en) * 2018-06-08 2022-06-28 百度在线网络技术(北京)有限公司 Method and apparatus for processing information
CN109543001A (en) * 2018-10-18 2019-03-29 华南理工大学 A kind of scientific and technological entry abstracting method characterizing Scientific Articles research contents
CN109213937A (en) * 2018-11-29 2019-01-15 深圳爱问科技股份有限公司 Intelligent search method and device
CN111008265A (en) * 2019-12-03 2020-04-14 腾讯云计算(北京)有限责任公司 Enterprise information searching method and device
CN111008265B (en) * 2019-12-03 2023-03-28 腾讯云计算(北京)有限责任公司 Enterprise information searching method and device
CN113377922A (en) * 2021-06-25 2021-09-10 北京百度网讯科技有限公司 Method, apparatus, electronic device, and medium for matching information
CN113377922B (en) * 2021-06-25 2024-04-02 北京百度网讯科技有限公司 Method, device, electronic equipment and medium for matching information
CN115687579A (en) * 2022-09-22 2023-02-03 广州视嵘信息技术有限公司 Document tag generation and matching method and device and computer equipment

Also Published As

Publication number Publication date
CN103886034B (en) 2019-03-19

Similar Documents

Publication Publication Date Title
CN103886034A (en) Method and equipment for building indexes and matching inquiry input information of user
US10678816B2 (en) Single-entity-single-relation question answering systems, and methods
CN107436864B (en) Chinese question-answer semantic similarity calculation method based on Word2Vec
US10997370B2 (en) Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN106649818B (en) Application search intention identification method and device, application search method and server
CN107451126B (en) Method and system for screening similar meaning words
CN108491443B (en) Computer-implemented method and computer system for interacting with a user
CN108681574B (en) Text abstract-based non-fact question-answer selection method and system
US9679558B2 (en) Language modeling for conversational understanding domains using semantic web resources
CN110209897B (en) Intelligent dialogue method, device, storage medium and equipment
CN107861939A (en) A kind of domain entities disambiguation method for merging term vector and topic model
CN109871543B (en) Intention acquisition method and system
CN112650840A (en) Intelligent medical question-answering processing method and system based on knowledge graph reasoning
CN106156365A (en) A kind of generation method and device of knowledge mapping
CN103956169A (en) Speech input method, device and system
CN104765769A (en) Short text query expansion and indexing method based on word vector
CN106372060A (en) Search text labeling method and device
CN110162768B (en) Method and device for acquiring entity relationship, computer readable medium and electronic equipment
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN108304424B (en) Text keyword extraction method and text keyword extraction device
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN105528437A (en) Question-answering system construction method based on structured text knowledge extraction
CN111414763A (en) Semantic disambiguation method, device, equipment and storage device for sign language calculation
CN108038099B (en) Low-frequency keyword identification method based on word clustering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant