CN103886034B - A kind of method and apparatus of inquiry input information that establishing index and matching user - Google Patents
A kind of method and apparatus of inquiry input information that establishing index and matching user Download PDFInfo
- Publication number
- CN103886034B CN103886034B CN201410079818.7A CN201410079818A CN103886034B CN 103886034 B CN103886034 B CN 103886034B CN 201410079818 A CN201410079818 A CN 201410079818A CN 103886034 B CN103886034 B CN 103886034B
- Authority
- CN
- China
- Prior art keywords
- word
- matching
- information
- label
- index
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The object of the present invention is to provide a kind of for establishing the method and apparatus of the inquiry input information of index and matching user;According to text information, determines structured message and extract descriptor;According to the corresponding theme of descriptor, corresponding label word is determined;It establishes and indexes for the descriptor and label word.Further, descriptor and label word are obtained to the inquiry input information analysis of user's input, and carries out matching inquiry in the index of aforementioned foundation accordingly, obtain candidate text information;According to the semantic matching degree of candidate text information and inquiry input information, the determining target text information to match with inquiry input information.Compared with prior art, the present invention is based on encyclopaedia class or other Internet resources knowledge, carry out the extraction of theme, title, form effective description to resources and knowledge content, so that more efficient to the semantic search of this kind of resources and knowledge, meet the complicated description search need that user can not accurately be reached using antistop list, improves the usage experience of user.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of inquiry for establishing index and matching user are defeated
Enter the technology of information.
Background technique
People often do not know during using search engine and input which type of keyword to express oneself
Idea may input the descriptive words and phrases of a pile, such as: 1) get up morning vomiting, usually palpitation and short breath, weakness of limbs, is
What disease symptoms? 2) song profile thought fondly of to lover is expressed? 3) include " saying that forgets wealth and rank " song 4) eat kindling
Pot sing song be in any film, what who said? 4) verse 5 studied hard is described) difficulty of conducting oneself, doing woman's difficulty is who says,
What complete saying? there are also the expression contents that some users may input some clause complexity, such as some personages
Classification, user may ask " Anhui come out emperor and President which has? ", " the Politburo Standing Committee member in current government Shanxi
Introduce " etc..Search engine is difficult to search Suitable results in this case.
It analyzes from reason, is indexed this is because search engine generally used now mainly establishes title (title).
Although these search engines are generally also established content and indexed, due to adjusting the factors such as power, lead to some high-quality knowledge descriptions
Part is difficult to show well.For example, existing search engine is usually only for some resources-type such as songs, film information
It is that index is established to song title and movie name, in this case, when user does not remember title of the song or movie name, but only remembers
When firmly wherein the lyrics, lines brief introduction or fraction describe, existing search engine can not just carry out effective search inquiry.
These situations equally occur in classes such as novel, poem, distich, blessing language, personage, TV play, novel, sentence, Chinese idiom, diseases
In other resource.
Encyclopaedia class resources and knowledge is usually that will lead to arrange in general search in this way to index is established centered on word
In sequence algorithm, it is difficult the non-keyword sequence appeared in title in front.However in fact, since encyclopaedia class resource is known
The knowledge authority of knowledge can be good at meeting the needs of users if these data are come front.For example, for encyclopaedia
In disease, if being labelled and being indexed to symptom, according to user be depicted come symptom can be well
Corresponding resources and knowledge is supplied to user.
Therefore, existing resource knowledge how is efficiently used, establishes index and match acquisition and the inquiry of user input for it
The corresponding target text information of information, becomes one of the most urgent problems to be solved by those skilled in the art.
Summary of the invention
The object of the present invention is to provide a kind of for establishing the method and dress of the inquiry input information of index and matching user
It sets.
According to an aspect of the invention, there is provided a kind of method for establishing index based on text information, wherein
Method includes the following steps:
A therefrom determines structured message according to text information;
B extracts descriptor from the structured message;
C theme according to corresponding to the descriptor determines mark corresponding with the theme from the text information
Sign word;
D is that the descriptor and the label word are established and indexed.
According to another aspect of the present invention, a kind of inquiry according to aforementioned established index matching user is additionally provided
The method for inputting information, wherein method includes the following steps:
The inquiry that a obtains user's input inputs information;
B carries out theme to inquiry input information and label is analyzed, to obtain corresponding to the inquiry input information
Descriptor and label word;
C carries out matching inquiry in aforementioned established index according to the descriptor and label word, with acquisition and institute
State the candidate text information that inquiry input information matches;
For d according to the semantic matching degree of the candidate text information and the inquiry input information, determination and the inquiry are defeated
Enter the target text information that information matches.
According to another aspect of the invention, it additionally provides a kind of for establishing the index foundation of index based on text information
Equipment, wherein the equipment includes:
Information determining means, for therefrom determining structured message according to text information;
Subject distillation device, for extracting descriptor from the structured message;
Label determining device, for the theme according to corresponding to the descriptor, the determining and institute from the text information
State the corresponding label word of theme;
Index establishes device, for establishing and indexing for the descriptor and the label word.
In accordance with a further aspect of the present invention, a kind of inquiry according to aforementioned established index matching user is additionally provided
Input the matching unit of information, wherein the equipment includes:
Acquisition device is inquired, the inquiry for obtaining user's input inputs information;
Information analysis apparatus, for carrying out theme and label analysis to inquiry input information, to obtain the inquiry
Input descriptor corresponding to information and label word;
Matching inquiry device, for according to the descriptor and label word, such as progress in aforementioned the index established
With inquiry, to obtain the candidate text information to match with the inquiry input information;
Text determining device, for inputting the semantic matching degree of information according to the candidate text information and the inquiry,
The determining target text information to match with the inquiry input information.
In accordance with a further aspect of the present invention, it additionally provides a kind of for establishing the inquiry input letter of index and matching user
The system of breath, including index establish equipment and foregoing matching unit as previously described.
Compared with prior art, the present invention therefrom determines structured message according to text information;Believe from the structuring
Descriptor is extracted in breath;According to theme corresponding to the descriptor, determination is opposite with the theme from the text information
The label word answered;It establishes and indexes for the descriptor and the label word.Further, the present invention obtains looking into for user's input
Ask input information;Theme is carried out to inquiry input information and label is analyzed, it is right to obtain inquiry input information institute
The descriptor and label word answered;According to the descriptor and label word, matching inquiry is carried out in aforementioned established index,
To obtain the candidate text information to match with the inquiry input information;According to the candidate text information and the inquiry
Input the semantic matching degree of information, the determining target text information to match with the inquiry input information.
The present invention is based on encyclopaedia class resources and knowledge or other by the resources and knowledge of Web Mining, theme, mark are carried out to it
The extraction of topic forms effective description to resources and knowledge content, more preferably
Ground shows this kind of high-quality resource knowledge, so that it is more efficient to the semantic search of this kind of resources and knowledge, meet and uses
The complicated description search need that family can not accurately be reached using antistop list, improves the usage experience of user.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, its of the invention
Its feature, objects and advantages will become more apparent upon:
Fig. 1 shows the equipment schematic diagram for indexing based on text information foundation of one aspect according to the present invention;
Fig. 2 shows in accordance with a preferred embodiment of the present invention for establishing the equipment signal of index based on text information
Figure;
Fig. 3 shows the equipment schematic diagram of the inquiry input information for matching user according to a further aspect of the present invention;
Fig. 4 shows the method flow diagram for indexing based on text information foundation of another aspect according to the present invention;
Fig. 5 shows the method flow diagram for indexing based on text information foundation of another aspect according to the present invention.
The same or similar appended drawing reference represents the same or similar component in attached drawing.
Specific embodiment
Present invention is further described in detail with reference to the accompanying drawing.
Fig. 1 shows the equipment schematic diagram for indexing based on text information foundation of one aspect according to the present invention.Index
Establishing equipment 1 includes that information determining means 101, subject distillation device 102, label determining device 103 and index establish device
104。
Wherein, information determining means 101 therefrom determine structured message according to text information.Specifically, the information is true
Device 101 is determined for example by the interaction with data source, such as encyclopaedia data, text information is obtained, in turn, by this article
This information carries out structuring, and directory information, subdirectory information as included in analysis text information etc. therefrom determines knot
Structure information.
For example, information determining means 101 by with the interaction of Baidupedia, the encyclopaedias data such as interact encyclopaedia, obtain these
The resources and knowledge of encyclopaedia class, using as text information, in turn, which carries out structure to text information
Change, for example, analyzing the corresponding catalogue of each resources and knowledge and subdirectory, such as resources and knowledge for " disease ", analyzes it
The corresponding directories or subdirectories of symptom, corresponding directories or subdirectories for the treatment of method etc..
For another example, information determining means 101 excavate resources and knowledge by way of data mining from internet, to make
Structuring is carried out to determine structured message to text information in turn for text information.For example, the information determining means
101 by the excavation to vertical class resource website, therefrom obtains the symptom description of disease and disease, treatment method, speciality
The information such as hospital.Each resource carries out tissue using disease as ID.Such as, the seed of some candidates is provided according to classification first
Word, such as disease provide coronary heart disease, myocarditis, gastritis etc., obtain common website url in the top according to search result,
The structure of its website is analyzed, coronary heart disease, the symptom of coronary heart disease, the treatment method of coronary heart disease, coronary heart disease are therefrom extracted
Speciality hospital information, and above- mentioned information are integrated into coronary heart disease this kind of " disease ", by the coronary disease in a manner of tissue
Disease forms business card, is stored.Then should " coronary heart disease " can be used as final text information, and its corresponding " disease of coronary heart disease
It is corresponding then to can be used as text information for the information such as shape ", " treatment method of coronary heart disease ", " information of the speciality hospital of coronary heart disease "
Structured message.
Those skilled in the art will be understood that the mode of above-mentioned determining structured message is only for example, other it is existing or
The mode for the determination structured message being likely to occur from now on is such as applicable to the present invention, should also be included in the scope of the present invention
Within, and be incorporated herein by reference.
Subject distillation device 102 extracts descriptor from the structured message.Specifically, the subject distillation device 102
It is led according to structured message determined by information determining means 101, such as by subject classification device or other scheduled extractions
The mode of epigraph extracts descriptor from the structured message.
Here, the purpose for extracting descriptor is to extract the theme for indicating text information from text information, thus
Service is calculated to establish semantic indexing and subsequent semantic matches.
Preferably, it further includes theme training device (not shown) which, which establishes equipment 1, the theme training device according to
Predetermined theme system obtains training corpus corresponding with the predetermined theme system;According to the training corpus, training master
Inscribe classifier;Wherein, the subject distillation device 102 extracts institute from the structured message according to the subject classification device
State descriptor.
Specifically, theme training device determines predetermined theme system, for example, the theme training device is according to a large amount of networks
The statistical result for searching for the search sequence of user's input determines the common search need of web search user, and combines normal at present
Classification system, such as encyclopaedia, the existing system such as know, determining has the subject classification system of certain demand, and is made
For predetermined theme system.In turn, which obtains and the predetermined theme system phase according to the predetermined theme system
Corresponding training corpus, for example, it is assumed that there is corresponding station location marker " medical treatment & health internal medicine " in article, then the data are recognized
For the training corpus for being disease category.Then, the theme training device is according to the training corpus, training subject classification device, example
Such as, by training corpus, one svm disaggregated model of training, using as subject classification device.
Then, the subject classification device that subject distillation device 102 is trained according to the theme training device, self-structureization letter
Descriptor is extracted in breath.For example, the subject distillation device 102 believes the structurings such as " coronary heart disease " word and its symptom, treatment method
Breath inputs the subject classification device, so that obtaining the theme is " disease ".For another example, for new encyclopaedia business card, subject distillation dress
It sets 102 and is inputted the subject classification device, such as svm classifier, to obtain theme corresponding to the classification of the encyclopaedia business card.
Preferably, the subject distillation device 102 can also carry out synonymous expression extension to the theme of the extraction, for example, will lead
It inscribes " disease " and carries out synonymous expression extension, increase a synonymous theme " disease ".
Those skilled in the art will be understood that the mode of said extracted descriptor is only for example, other are existing or from now on
The mode for the extraction descriptor being likely to occur such as is applicable to the present invention, should also be included within the scope of protection of the present invention, and
This is incorporated herein by reference.
The theme according to corresponding to the descriptor of label determining device 103, from the text information determine with it is described
The corresponding label word of theme.Specifically, the label determining device 103 is according to the extracted theme of subject distillation device 102
Theme corresponding to word and the descriptor determines label word corresponding with the theme from text information, for example, for
The text information that disease is the theme, label determining device 103 determine following label word corresponding with the theme: nervous gas
Short, uncomfortable in chest, diarrhea, vomiting, weakness of limbs etc..
Preferably, the label determining device 103 include candidate determination unit (not shown), centre word determination unit (not
Show) and tag determination unit (not shown).Specifically, candidate's determination unit theme according to corresponding to the descriptor,
At least one candidate label word corresponding with the theme is determined from the text information, for example, candidate's determination unit
Unitary, binary, ternary word statistics are carried out by the page data of tissue of vocabulary to all, extraction is appeared in greater than certain amount
The word of page data, as candidate label word.
Then, centre word determination unit determines corresponding centre word according at least one described candidate label word.Then,
For tag determination unit according at least one described candidate label word at a distance from the centre word, determination is opposite with the theme
The label word answered.
For example, the label data that centre word determination unit is counted according to front, all candidate label words are merged,
These candidate label words count under line, statistic processes is as follows: by such as using whole network data in extensive text,
In the conllinear frequency of document in statistical data.The phase between them is calculated according to the following formula for any two candidate's label word
Like degree:
Here, PMI (w ', w1) indicate w'w1Between mutual information score value, be defined asP
(w) it indicates by the probability of statistics word w.
Then, centre word determination unit needs to analyze which domain of text information according to theme, determination, e.g., disease
The symptom classification of disease, poem itself and explain part, the description section of personage etc..In turn, it therefrom extracts all in candidate
Then these words are formed a center by the word and corresponding synonym occurred in label word, as this, at least one is waited
Select the corresponding centre word of label word.
Then, tag determination unit calculates each at least one candidate label word at a distance from the centre word, example
Such as, it is assumed that this, which sentences T, indicates centre word, then candidate label word can be calculate by the following formula acquisition at a distance from the centre word:
Here, Num (T) indicates the number of word included in centre word.
Then, according to this, at least one candidate label word is determined and is somebody's turn to do at a distance from the centre word tag determination unit
The corresponding label word of theme, for example, using at a distance from the centre word be less than predetermined threshold candidate label word as with the master
Inscribe corresponding label word.
Preferably, as shown in figure 3, when tag determination unit does one with candidate's label ranking at a distance from centre word
Between sequence, if the slope of ranking variation is greater than predetermined slope threshold value, subsequent node is truncated, such as the ranking in Fig. 3 the
5 points to the 6th point.
Here, the slope threshold value is for example set by counting the overall distribution experience of score.
Those skilled in the art will be understood that the mode of above-mentioned determining label word is only for example, other are existing or from now on
The mode for being likely to occur calibration label word really is such as applicable to the present invention, should also be included within the scope of protection of the present invention, and
This is incorporated herein by reference.
It is highly preferred that the centre word determination unit is according to predetermined filtering rule, at least one described candidate label word
It is filtered processing, to obtain at least one through filtering treated candidate label word;According to it is described at least one through filtering at
Candidate label word after reason, determines the centre word;Wherein, the predetermined filtering rule is based on following at least any one come really
It is fixed:
The part of speech of at least one candidate label word;
The word rule of at least one candidate label word;
The co-occurrence ratio of at least one candidate the label word and the theme.
Specifically, during counting to candidate label word, noise may be introduced, therefore, it is necessary to mark to candidate
Label word is filtered processing, and centre word determination unit carries out at least one described candidate label word according to predetermined filtering rule
Filtration treatment, to obtain at least one through filtering treated candidate label word.
For example, the part of speech of the centre word determination unit at least one candidate label word according to this, to this, at least one is candidate
Label word is filtered processing, and e.g., to this, at least one candidate label word carries out head-word and the filtering of tail word.
For another example, the word rule of the centre word determination unit at least one candidate label word according to this, to this at least one
Candidate label word is filtered processing, and e.g., the lead-in of candidate's label word is unlikely to be the words such as " ", " doing ", " quilt ", " ratio ",
Tail word is unlikely to be the words such as " when ", " arriving ", " obtaining ".
For another example, the co-occurrence ratio of the centre word determination unit at least one candidate label word and the theme according to this, to this
At least one candidate label word is filtered processing, and e.g., the centre word determination unit is in searching statistical log and the whole network
In title, the co-occurrence ratio of at least one candidate the label word and theme is counted, is only just protected with what the theme co-occurrence was crossed
It stays, alternatively, retaining the candidate label word with the co-occurrence of the theme than being greater than predetermined threshold.
Preferably, the centre word determination unit is according to the predetermined filtering rule of the above-mentioned any two of combination or comprehensively considers complete
The predetermined filtering rule in three, portion, to this, at least one candidate label word is filtered processing.
Then, centre word determination unit according to it is described at least one through filtering treated candidate label word, determine described in
Centre word.
Those skilled in the art will be understood that above-mentioned predetermined filtering rule is only for example, other are existing or from now on may
The predetermined filtering rule occurred is such as applicable to the present invention, should also be included within the scope of protection of the present invention, and herein with reference
Mode is incorporated herein.
It is that the descriptor and the label word are established and indexed that index, which establishes device 104,.Specifically, index establishes device
104 according to label word determined by the extracted descriptor of key phrases extraction device 102 and the label determining device 102, is
The descriptor and label word establish index.
For example, it is assumed that the corresponding document of coronary heart disease is ID1, the corresponding different degree in the document is WC1 (x), as x can be with
Equal to " disease ", " palpitation and short breath " etc., the corresponding document of myocarditis is ID2, and the corresponding document of gastritis is ID3, and apoplexy is corresponding
Document is ID4.Index establishes device 104 and establishes unified inverted index to descriptor and label word in the following manner:
Disease-ID1(WC1 (x)), ID2(WC2 (x)), ID3(WC3 (x)) and, ID4 (WC4 (x))
Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Shortness of breath and palpitation-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Vomiting-ID3(WC3 (x)), ID4(WC4 (x))
Spit-ID3(WC3 (x)), ID4(WC4 (x))
Preferably, it further includes normalized device (not shown) that index, which establishes equipment 1, if the normalized device label
Word includes the label word of multiple semantic congruences, determines the normalization result of the label word of the multiple semantic congruence;Wherein, institute
Stating index and establishing device 104 is that the descriptor, the label word and the normalization result establish index.
Specifically, it may include the label word of multiple semantic congruences in the corresponding label word of descriptor " disease ", such as " spitting "
" nausea and vomiting " i.e. semantic congruence, then normalized device determines that the normalization result of two label words is " vomiting ";With
Afterwards, it is the descriptor " disease ", label word " spitting ", " nausea and vomiting " and normalization result " vomiting " that index, which establishes device 104,
Establish index.
Those skilled in the art will be understood that the mode of above-mentioned foundation index is only for example, other are existing or from now on may be used
Can occur foundation index mode be such as applicable to the present invention, should also be included within the scope of protection of the present invention, and herein with
Way of reference is incorporated herein.
It is indexed in general, establishing index and being established both for keyword, here, index establishes equipment 1 also to descriptor, mark
It signs word and its normalization result establishes index, to realize that the inquiry input information of user is preferably matched with resources and knowledge.
Preferably, it indexes and constantly works between each device for establishing equipment 1.Specifically, information determines dress
101 are set according to text information, therefrom determines structured message;Subject distillation device 102 extracts master from the structured message
Epigraph;The theme according to corresponding to the descriptor of label determining device 103, the determining and master from the text information
Inscribe corresponding label word;It is that the descriptor and the label word are established and indexed that index, which establishes device 104,.Here, ability
Field technique personnel should be understood that " lasting " refers to that index establishes each device of equipment 1 respectively according to the work of setting or real-time adjustment
Operation mode requires to carry out the determination of structured message, the extraction of descriptor, the determination of label word and the foundation of index, until should
Index establishes equipment 1 and stops determining structured message in a long time.
Here, index establishes equipment 1 according to text information, structured message is therefrom determined;From in the structured message
Extract descriptor;According to theme corresponding to the descriptor, determination is corresponding with the theme from the text information
Label word;It establishes and indexes for the descriptor and the label word.Index establish equipment be based on encyclopaedia class resources and knowledge or its
He carries out the extraction of theme, title to it by the resources and knowledge of Web Mining, forms effectively retouching to resources and knowledge content
It states, preferably shows this kind of high-quality resource knowledge, so that it is subsequent more efficient to the semantic search of this kind of resources and knowledge, meet
The complicated description search need that user can not accurately be reached using antistop list, improves the usage experience of user.
Fig. 2 shows the equipment schematic diagrams of the inquiry input information for matching user according to a further aspect of the present invention.
Matching unit 2 includes inquiry acquisition device 201, information analysis apparatus 202, matching inquiry device 203 and text determining device
204。
Wherein, the inquiry that inquiry acquisition device 201 obtains user's input inputs information.Specifically, user by with user
The interaction of equipment has input inquiry input information, and inquiry acquisition device 201 is by calling application provided by the user equipment
Routine interface (API) passes through and calls the dynamic pages technologies such as JSP, ASP or PHP, alternatively, passing through other communications arranged
Mode obtains the inquiry input information of user input.
Here, inquiry input information includes but is not limited to that user passes through text input, voice input, image input etc.
The inquiry that different input modes are submitted inputs information.
Those skilled in the art will be understood that the mode of above-mentioned acquisition inquiry input information is only for example, other are existing
Or the mode for the acquisition inquiry input information being likely to occur from now on is such as applicable to the present invention, should also be included in protection of the present invention
Within range, and it is incorporated herein by reference.
Information analysis apparatus 202 carries out theme to inquiry input information and label is analyzed, defeated to obtain the inquiry
Enter descriptor corresponding to information and label word.Specifically, information analysis apparatus 202 is to acquired in the inquiry acquisition device 201
Inquiry input information carry out theme and label and analyze, for example, by the way that the inquiry input aforementioned training of information input is obtained
Subject classification device, obtain the inquiry input information corresponding to descriptor;The information analysis apparatus 202 inputs the user
Inquiry input information carry out label analysis, obtain corresponding label word.Here, the information analysis apparatus 202 is defeated to the inquiry
Enter mode and aforementioned label determining device 103 the determination mode of label word of text information of label analysis of information it is identical or
It is similar, therefore details are not described herein again, and is incorporated herein by reference.
Matching inquiry device 203 establishes the rope of the foundation of device 104 according to the descriptor and label word, in aforementioned index
Draw middle carry out matching inquiry, to obtain the candidate text information to match with the inquiry input information.Specifically, matching is looked into
The inquiry for asking the user according to acquired in the inquiry acquisition device 201 of device 203 input inputs information, establishes in aforementioned index
Matching inquiry is carried out in the index that device 104 is established, such as by all matching or the matched mode in part, obtaining hit should
The text information of descriptor corresponding to inquiry input information, or hit label word corresponding to inquiry input information
Text information, to input the candidate text information that information matches as with the inquiry.
For example, it is assumed that user input query input information is " palpitation and short breath ", inquiry acquisition device 201 obtains the user
The inquiry of input inputs information " palpitation and short breath ";Information analysis apparatus 202 carries out label analysis to inquiry input information, obtains
Label word be " palpitation and short breath ", aforementioned index establishes index that device 104 establishes the label word " palpitation and short breath " such as
Under:
Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Wherein, ID1, ID2, ID4 respectively indicate include label word " palpitation and short breath " text information id number, WC1
(x), WC2 (x), WC4 (x) then respectively indicate label word " palpitation and short breath " different degree in these text informations respectively.
The then label word " palpitation and short breath " according to corresponding to the inquiry of user input information of matching inquiry device 203,
Index, which is established, carries out matching inquiry in the index that device 104 is established, such as according to above-mentioned index, obtain inquiry input information
Candidate text information corresponding to " palpitation and short breath " --- text information ID1, ID2 and ID4.
Those skilled in the art will be understood that the mode of above-mentioned matching inquiry is only for example, other are existing or from now on may be used
The mode of matching inquiry that can occur such as is applicable to the present invention, should also be included within the scope of protection of the present invention, and herein with
Way of reference is incorporated herein.
Text determining device 204 inputs the semantic matching degree of information according to the candidate text information and the inquiry, really
The fixed target text information to match with the inquiry input information.
Specifically, there are certain semantic matching degree, the semantic matches between candidate text information and inquiry input information
Degree can be obtained by calculating, or further be inputted by calculating the corresponding index word set of candidate's text information and the inquiry
Matching degree between matching word set corresponding to information obtains.Text determining device 204 is according to candidate's text information and user
Inquiry input information semantic matching degree, the determining target text information to match with inquiry input information, such as by language
The highest candidate text information of adopted matching degree is as the target text information to match with inquiry input information, alternatively, by language
Adopted matching degree is greater than the candidate text information of predetermined matching degree threshold value as the target text to match with inquiry input information
Information.
Here, the predetermined matching degree threshold value is to match for judging whether candidate text information inputs information with inquiry
Semantic matching degree, value can be preset fixation, can also be adjusted according to the actual situation.
Preferably, text determining device further includes that matching primitives unit (not shown) and text determination unit (are not shown
Out).The matching primitives unit calculates the semantic matching degree of the candidate text information and the inquiry input information;Text is true
Order member is according to the semantic matching degree, in conjunction with predetermined matching degree threshold value, the determining mesh to match with the inquiry input information
Mark text information.
For example, the matching primitives unit according to existing matching degree calculation method, calculates candidate's text information and user
Inquiry input information semantic matching degree;When the semantic matching degree is greater than the predetermined matching degree threshold value, then the text determines single
Member is using candidate's text information as the target text information to match with inquiry input information.
Preferably, the text determining device also index word set according to corresponding to candidate text information and inquiry input letter
The corresponding matching word set of breath, to determine and target text information corresponding to inquiry input information.Specifically, candidate text
This information has corresponding index word set, and as assumed, the corresponding theme of candidate's text information ID1 is " coronary heart disease " in upper example,
Corresponding index terms includes " disease ", " palpitation and short breath " etc., then index word set composed by these index terms is the candidate
Index word set corresponding to text information ID1.User inquiry input information also have corresponding matching word set, for example, by pair
Inquiry input information obtains matching word after carrying out word segmentation processing, then set composed by the matching word is defeated as the inquiry
Enter the corresponding matching word set of information, such as assumes that the inquiry input information of user's input is " palpitation and short breath vomiting ", matching unit 1
After carrying out word segmentation processing to inquiry input information, obtain matching word " palpitation and short breath " and " vomiting ", then two matching word institutes
The set of composition is the corresponding matching word set of inquiry input information.Text determining device 204 is according to the index word set and is somebody's turn to do
Match word set, the target text information that the determining inquiry input information with the user matches, for example, the matching word will be hit
Text information corresponding to the index word set of most matching words is concentrated, as the target text to match with inquiry input information
This information;Alternatively, the quantity for hitting matching word is greater than text information corresponding to the index word set of predetermined quantity threshold value, make
To input the target text information that information matches with the inquiry.
For example, the corresponding index word set of ID1 includes index for candidate text information ID1, ID2 and ID4 in upper example
Word " disease ", " palpitation and short breath ";The corresponding index word set of ID2 includes index terms " palpitation and short breath ", " vomiting ", " disease ";ID4
Corresponding index word set includes index terms " palpitation and short breath ".Then for the inquiry input information of user's input, " palpitation and short breath is vomitted
Spit ", matching word is " palpitation and short breath ", " vomiting ", and the corresponding index word set of ID2 hits the inquiry and inputs information corresponding
With matching word most in word set, then using candidate's text information ID2 as the target most to match with inquiry input information
Text information, or, it is assumed that predetermined quantity threshold value is 0, then rope corresponding to above-mentioned candidate text information ID1, ID2 and ID4
The quantity for drawing the matching word that word set hits matching word concentration is all larger than the predetermined quantity threshold value, then above-mentioned candidate text information
ID1, ID2 and ID4 are as the target text information to match with inquiry input information.The matching unit 2 is supplied to the use
When family, it can be ranked up according to the height of different degree of the corresponding index terms in candidate's text information.
Those skilled in the art will be understood that the mode of above-mentioned determining target text information is only for example, other are existing
Or be likely to occur the mode of the text information that sets the goal really from now on and be such as applicable to the present invention, it should also be included in protection of the present invention
Within range, and it is incorporated herein by reference.
Preferably, it constantly works between each device of matching unit 2.Specifically, acquisition device is inquired
201 inquiries for obtaining user's input input information;Information analysis apparatus 202 carries out theme and mark to inquiry input information
Label analysis, to obtain descriptor and label word corresponding to the inquiry input information;Matching inquiry device 203 is according to described
Descriptor and label word carry out matching inquiry in the index that aforementioned index establishes the foundation of device 104, are looked into obtaining with described
Ask the candidate text information that input information matches;Text determining device 204 is according to the candidate text information and the inquiry
Input the semantic matching degree of information, the determining target text information to match with the inquiry input information.Here, this field
Technical staff should be understood that " lasting " refers to each device of matching unit 2 respectively according to the operating mode of setting or real-time adjustment
Believe it is required that carrying out the acquisition of inquiry input information, theme and label analysis, the matching inquiry of candidate text information and target text
The determination of breath, until the matching unit 2 stops obtaining in a long time the inquiry input information of user's input.
Here, index is established between equipment 1 and each device of matching unit 2 and cooperated, inputted with realizing based on user
Inquiry input information, matching obtains corresponding target text information;Based on encyclopaedia class resources and knowledge or other pass through
The resources and knowledge of Web Mining carries out the extraction of theme, title to it, forms effective description to resources and knowledge content, more preferably
Ground shows this kind of high-quality resource knowledge, so that more efficient to the semantic search of this kind of resources and knowledge, meeting user can not be quasi-
The complicated description search need really reached using antistop list, improves the usage experience of user.
Preferably, the descriptor and label the word domain that may also be viewed as two different, respectively corresponds subject area and label
Domain, the matching inquiry device 203 are aforementioned corresponding to the subject area and label field respectively according to the descriptor and label word
Matching inquiry is carried out in index, to obtain the candidate text information to match with the inquiry input information.
Specifically, matching inquiry device 203 inputs information to the inquiry that user inputs according to information analysis apparatus 202
Descriptor obtained and label word are analyzed, using dividing domain matched mode, respectively corresponding to the subject area and label field
Matching inquiry is carried out in index, to obtain candidate text information.
Here, the subject area and label field can carry out analysis acquisition by inputting information to the inquiry, for example, to
The inquiry of family input inputs information, is analyzed using subject classification device above-mentioned the inquiry input information that user inputs, is obtained
Obtain subject categories.
Here, index corresponding to subject area and label field is that aforementioned index establishes the index that device 104 is established, root
According to the label established before, the extraction of label word is carried out to the inquiry input information of user's input, it is such as defeated for the inquiry is included in
Enter in information and inside tag set, is then extracted.Then, using label word and subject categories to corresponding
The candidate for draw inverted entry in the unified index of theme and label, using the document comprising the subject categories or label as
Candidate text information corresponding with inquiry input information, participates in subsequent calculating.
Preferably, it is also possible to consider weights corresponding to the subject area and label field for the matching inquiry device 203, in correspondence
Index in carry out matching inquiry, comprehensively consider the subject area and the corresponding weight of label field, it is final to obtain candidate text envelope
Breath.
Preferably, the text determining device 204 matching word according to included by the matching word set, in the candidate
Index terms corresponding to text information, which is concentrated, determines that target indexes word set, wherein the target index word set hits the matching
Most matching word in word set;If the target index word set is greater than predetermined threshold with the similarity for matching word set, by institute
Text information corresponding to target index word set is stated as the target text information to match with the inquiry input information.
Specifically, the index word set according to corresponding to candidate text information of text determining device 204 hit matching word is concentrated
The quantity of matching word will hit the most index word set of matching word quantity as target and index word set;Then, the text determines
Device 204 calculate the target index word set with match the similarity of word set, for example, calculate separately target index word set with match
In word set, similarity between the index terms of hit and corresponding matching word, then by the side such as being simply added or being weighted and averaged
Formula calculates the similarity that the target indexes word set with matches word set, and when the similarity is greater than predetermined threshold, the text is determined
Device is believed text information corresponding to target index word set as the target text to match with inquiry input information
Breath.
Here, the predetermined threshold is to index word set and the similarity for matching word set according to target, judge whether target rope
Draw similarity threshold of the corresponding text information of word set as target text information, value can be fixed, can also be according to reality
Border situation adjusts.
Preferably, matching unit 2 further includes word set determining device (not shown).Wherein, word set determining device is looked into described
It askes input information and carries out word segmentation processing, obtain the participle after the word segmentation processing;The participle is filled with the information analysis
It sets 202 descriptor obtained and label word merges processing, to obtain matching word corresponding with inquiry input information
Collection, wherein the matching word concentrates included word as matching word.Then, the matching primitives unit is according to described
With index word set corresponding to word set and the candidate text information, calculates the candidate text information and the inquiry inputs
The semantic matching degree of information.
Specifically, word set determining device carries out at participle the input information of inquiry acquired in the inquiry acquisition device 201
Reason, to obtain participle after word segmentation processing, preferably, the word set determining device can also be to being segmented after the word segmentation processing
The filtration treatments such as stop words are removed, and then obtain final participle;Then, the word set determining device is according to obtained
It is merged processing, de-redundancy processing with the descriptor obtained of aforementioned information analytical equipment 202 and label word by participle
Deng, with finally obtain with the corresponding matching word set of inquiry input information, and using the matching word concentrate included word as
Matching word corresponding with inquiry input information.
Then, matching primitives unit index word set according to corresponding to the matching word set and the candidate text information,
Calculate the semantic matching degree of the candidate text information and the inquiry input information.
It is highly preferred that the matching unit 2 further includes aftertreatment device (not shown).The aftertreatment device is to described
Matching word carries out subsequent processing, to update the matching word set;Wherein, the subsequent processing includes following at least any one:
It determines mutual synonymous matching word included in the matching word, the mutually synonymous matching word is merged
For the subset of the matching word set.
Synonymous extension is carried out to the matching word, the synonym obtained after synonymous extension is determined as with the matching word
The subset of the matching word set.
Specifically, aftertreatment device carries out the matching word that matching word determined by word set determining device is concentrated subsequent
Processing, to update the matching word set.For example, aftertreatment device determines mutual synonymous included in the matching word
With word, the mutually synonymous matching word is merged into the subset of the matching word set.Due to may include in matching word mutually
These mutually synonymous matching words are merged into this by synonymous matching word, such as " vomiting " and " spitting ", the aftertreatment device
Subset with word set.
For example, it is assumed that the inquiry input information of user's input is Q, word set determining device carries out inquiry input information
Word segmentation processing, after removing the filtration treatments such as stop words, the matching word set representations in label field are Q={ a, b, c, d, e },
In, a, b, c, d, e is respectively that the matching word concentrates included matching word;Assuming that matching word a and b therein is mutually synonymous
Matching word, then matching word a and b are merged into the subset of the matching word set by aftertreatment device, then the matching word set is more
Newly it is expressed as Q={ { a, b }, c, d, e }.Then, follow up device such as matching inquiry device 203 carries out subsequent matching inquiry behaviour
Make.
For another example, aftertreatment device also carries out synonymous extension, the synonym that will be obtained after synonymous extension to the matching word
It is determined as the subset for matching word set with the matching word.Specifically, aftertreatment device can also input information to the inquiry
The matching word that corresponding matching word is concentrated carries out synonymous extension, is such as extended to " palpitation and short breath " for " shortness of breath and palpitation " is synonymous, with
Afterwards, which is determined as the synonym obtained after the synonymous extension and the matching word son of the matching word set
Collection.
Example is connected, for the matching word set Q after synonymous merging={ { a, b }, c, d, e }, which may be used also
Synonymous extension is carried out to the matching word set, extension obtains the synonym of matching word abcde therein, and will be after the synonymous extension
Obtained synonym and the matching word are determined as the subset of the matching word set, for example, the matching word set Q is through multiple synonymous extension
Afterwards, following expression is obtained:
Then, matching inquiry device 203 is according to the matching word set, index establish in the index that device 104 is established into
Row matching inquiry, for example, being included by inverted indexCandidate text information.
Concentrate the index terms set representations of most matching words for C assuming that matching word will be hit, then C are as follows:
Wherein, C indicates the maximum of synonymous hitw1iThe set of words of corresponding position Semantic mapping
Then matching primitives unit index word set according to corresponding to the matching word set and the candidate text information, meter
Calculate the semantic matching degree of the candidate text information and the inquiry input information.
Semantic matching degree between Q and C can be calculate by the following formula:
Wherein,Indicate wordWeight, here use (log(TF)+1) * log(N/DF) indicate;Match
(TQ,TC) indicate whether index word set, matching word set match with theme.
Here, Match (TQ,TC) it is corresponding value can define, such as assume the index word set, matching word set matched with theme,
Then Match (TQ,TC) value be 1, be otherwise 0.5.
Then, it is assumed that the semantic matches angle value being calculated is greater than predetermined threshold, then text determination unit is by the index
Text information corresponding to word set is as the target text information to match with inquiry input information.
Fig. 4 shows the method flow diagram for indexing based on text information foundation of another aspect according to the present invention.
In step S401, index establishes equipment 1 according to text information, therefrom determines structured message.Specifically, exist
In step S401, index establishes equipment 1 for example by the interaction with data source, such as encyclopaedia data, obtains text information,
In turn, by carrying out structuring to text information, the directory information as included in analysis text information, subdirectory letter
Breath etc. therefrom determines structured message.
For example, in step S401, index establish equipment 1 by with the friendship of Baidupedia, the encyclopaedias data such as interact encyclopaedia
Mutually, the resources and knowledge for obtaining these encyclopaedia classes, using as text information, in turn, in step S401, it is right that index establishes equipment 1
Text information carries out structuring, for example, the corresponding catalogue of each resources and knowledge and subdirectory are analyzed, such as " disease "
Resources and knowledge, analyze the corresponding directories or subdirectories of its symptom, the corresponding directories or subdirectories for the treatment of method etc..
For another example, in step S401, index establishes equipment 1 by way of data mining, excavates and provides funds from internet
Source knowledge, in turn, to carry out structuring to text information to determine structured message as text information.For example, in step
In rapid S401, index establishes equipment 1 by the excavation to vertical class resource website, therefrom obtains the symptom of disease and disease
Description, treatment method, speciality the information such as hospital.Each resource carries out tissue using disease as ID.Such as, first according to classification
The seed words of some candidates, such as disease are provided, coronary heart disease, myocarditis, gastritis etc. are provided, are obtained according to search result common
Website url in the top, analyzes the structure of its website, therefrom extracts coronary heart disease, the symptom of coronary heart disease, coronary disease
Treatment method, the information of the speciality hospital of coronary heart disease of disease, and above- mentioned information are integrated into coronary heart disease this kind of " disease ", with
The coronary heart disease is formed business card by the mode of tissue, is stored.Then being somebody's turn to do " coronary heart disease " can be used as final text information, and
The information such as its corresponding " symptom of coronary heart disease ", " treatment method of coronary heart disease ", " information of the speciality hospital of coronary heart disease ", then
It can be used as the corresponding structured message of text information.
Those skilled in the art will be understood that the mode of above-mentioned determining structured message is only for example, other it is existing or
The mode for the determination structured message being likely to occur from now on is such as applicable to the present invention, should also be included in the scope of the present invention
Within, and be incorporated herein by reference.
In step S402, index establishes equipment 1 and extracts descriptor from the structured message.Specifically, in step
In S402, index establishes equipment 1 according to structured message identified in step S401, such as by subject classification device, or
Other scheduled modes for extracting descriptor, extract descriptor from the structured message.
Here, the purpose for extracting descriptor is to extract the theme for indicating text information from text information, thus
Service is calculated to establish semantic indexing and subsequent semantic matches.
Preferably, this method further includes that step S405(is not shown), in step S405, index establishes equipment 1 according to pre-
Determine theme system, obtains training corpus corresponding with the predetermined theme system;According to the training corpus, training theme
Classifier;Wherein, in step S402, index establishes equipment 1 according to the subject classification device, from the structured message
Extract the descriptor.
Specifically, in step S405, index establishes equipment 1 and determines predetermined theme system, for example, in step S405,
Index establishes the statistical result for the search sequence that equipment 1 is inputted according to a large amount of web search users, determines that web search user is normal
Search need, and combine currently used classification system, such as encyclopaedia, the existing system such as know, determining to have centainly needs
The subject classification system asked, and as predetermined theme system.In turn, in step S405, index establishes 1 basis of equipment
The predetermined theme system obtains training corpus corresponding with the predetermined theme system, for example, it is assumed that there is correspondence in article
Station location marker " medical treatment & health internal medicine ", then the data are considered as the training corpus of disease category.Then, in step S405
In, index establishes equipment 1 according to the training corpus, training subject classification device, for example, training a svm by training corpus
Disaggregated model, using as subject classification device.
Then, in step S402, index establishes equipment 1 according to the subject classification device trained in step S405, certainly
Descriptor is extracted in structured message.For example, in step S402, index establish equipment 1 by " coronary heart disease " word and its symptom,
The structured messages such as treatment method input the subject classification device, so that obtaining the theme is " disease ".For another example, for newcomer
Encyclopaedia business card, in step S402, index establishes equipment 1 and is inputted the subject classification device, such as svm classifier, to obtain
Theme corresponding to the classification of the encyclopaedia business card.
Preferably, index, which establishes equipment 1, can also carry out synonymous expression extension to the theme of the extraction in step S402,
For example, theme " disease " is carried out synonymous expression extension, increase a synonymous theme " disease ".
Those skilled in the art will be understood that the mode of said extracted descriptor is only for example, other are existing or from now on
The mode for the extraction descriptor being likely to occur such as is applicable to the present invention, should also be included within the scope of protection of the present invention, and
This is incorporated herein by reference.
In step S403, index establishes the theme according to corresponding to the descriptor of equipment 1, from the text information
Middle determination label word corresponding with the theme.Specifically, in step S403, index establishes equipment 1 according in step
Theme corresponding to extracted descriptor and the descriptor in S402, determination is corresponding with the theme from text information
Label word, for example, for the text information that disease is the theme, in step S403, it is determining with the master that index establishes equipment 1
Inscribe corresponding following label word: palpitation and short breath, uncomfortable in chest, diarrhea, vomiting, weakness of limbs etc..
Preferably, step S403 further includes that sub-step S403a(is not shown), sub-step S403b (not shown) and sub-step
S403c(is not shown).Specifically, in sub-step S403a, index establishes the master according to corresponding to the descriptor of equipment 1
Topic determines at least one candidate label word corresponding with the theme from the text information, for example, in sub-step
In S403a, index establishes equipment 1 and carries out unitary, binary, ternary word statistics by the page data of tissue of vocabulary to all,
Extraction appears in the word greater than certain amount page data, as candidate label word.
Then, in sub-step S403b, index establishes equipment 1 according at least one described candidate label word, determination pair
The centre word answered.Then, in sub-step S403c, index establishes equipment 1 according at least one described candidate label word and institute
The distance of centre word is stated, determines label word corresponding with the theme.
For example, index establishes the label data that equipment 1 is counted according to front, by all candidates in sub-step S403b
Label word merges, and count under lines to these candidate label words, statistic processes is as follows: by extensive text,
Whole network data is such as used, in the conllinear frequency of document in statistical data.For any two candidate's label word, according to the following formula, meter
Calculate the similarity between them:
Here, PMI (w ', w1) indicate w'w1Between mutual information score value, be defined as
P (w) is indicated by the probability of statistics word w.
Then, in sub-step S403b, index establishes equipment 1 according to theme, determine need to text information which
Domain is analyzed, e.g., the symptom classification of disease, poem itself and explain part, the description section of personage etc..In turn, from
It is middle to extract all words and corresponding synonym occurred in candidate label word, these words are then formed into a center,
At least one corresponding centre word of candidate's label word as this.
Then, in sub-step S403c, index establish equipment 1 calculate each in this at least one candidate label word with
The distance of the centre word, for example, it is assumed that this, which sentences T, indicates centre word, then candidate label word can pass through at a distance from the centre word
Following formula, which calculates, to be obtained:
Here, Num (T) indicates the number of word included in centre word.
Then, in sub-step S403c, index establish equipment 1 according to this at least one candidate label word and the centre word
Distance, corresponding with theme label word is determined, for example, will be at a distance from the centre word less than the candidate of predetermined threshold
Label word is as label word corresponding with the theme.
Preferably, as shown in figure 3, index establishes equipment 1 with candidate's label ranking and center in sub-step S403c
The distance of word does a time series, if the slope of ranking variation is greater than predetermined slope threshold value, subsequent node is cut
It removes, such as the 5th point to the 6th point of the ranking in Fig. 3.
Here, the slope threshold value is for example set by counting the overall distribution experience of score.
Those skilled in the art will be understood that the mode of above-mentioned determining label word is only for example, other are existing or from now on
The mode for being likely to occur calibration label word really is such as applicable to the present invention, should also be included within the scope of protection of the present invention, and
This is incorporated herein by reference.
It is highly preferred that index establishes equipment 1 according to predetermined filtering rule, to described at least one in sub-step S403b
A candidate's label word is filtered processing, to obtain at least one through filtering treated candidate's label word;According to it is described at least
One, through filtering treated candidate label word, determines the centre word;Wherein, the predetermined filtering rule be based on down toward
Lack any one to determine:
The part of speech of at least one candidate label word;
The word rule of at least one candidate label word;
The co-occurrence ratio of at least one candidate the label word and the theme.
Specifically, during counting to candidate label word, noise may be introduced, therefore, it is necessary to mark to candidate
Label word is filtered processing, and in sub-step S403b, index establishes equipment 1 according to predetermined filtering rule, to described at least one
A candidate's label word is filtered processing, to obtain at least one through filtering treated candidate's label word.
For example, index establishes the part of speech of at least one the candidate label word according to this of equipment 1, right in sub-step S403b
At least one candidate label word is filtered processing, and e.g., to this, at least one candidate label word carries out head-word and tail word mistake
Filter.
For another example, in sub-step S403b, index establishes the word rule of at least one the candidate label word according to this of equipment 1
Then, to this, at least one candidate label word is filtered processing, e.g., the lead-in of candidate's label word be unlikely to be " ",
The words such as " doing ", " quilt ", " ratio ", tail word are unlikely to be the words such as " when ", " arriving ", " obtaining ".
For another example, in sub-step S403b, index establish equipment 1 according to this at least one candidate label word and the theme
Co-occurrence ratio, to this, at least one candidate label word is filtered processing, e.g., in sub-step S403b, indexes and establishes equipment 1
In searching statistical log and in the whole network title, the co-occurrence ratio of at least one candidate the label word and theme is counted, only
Just retained with what the theme co-occurrence was crossed, alternatively, retaining the candidate label with the co-occurrence of the theme than being greater than predetermined threshold
Word.
Preferably, index establishes equipment 1 according in conjunction with the predetermined filtering rule of above-mentioned any two in sub-step S403b
Or comprehensively consider all three predetermined filtering rules, to this, at least one candidate label word is filtered processing.
Then, in sub-step S403b, index establish equipment 1 according to it is described at least one through filtering that treated is candidate
Label word determines the centre word.
Those skilled in the art will be understood that above-mentioned predetermined filtering rule is only for example, other are existing or from now on may
The predetermined filtering rule occurred is such as applicable to the present invention, should also be included within the scope of protection of the present invention, and herein with reference
Mode is incorporated herein.
In step s 404, it is that the descriptor and the label word are established and indexed that index, which establishes equipment 1,.Specifically, exist
In step S404, index is established equipment 1 and is determined according to descriptor extracted in step S402, and in step S402
Label word, establish index for the descriptor and label word.
For example, it is assumed that the corresponding document of coronary heart disease is ID1, the corresponding different degree in the document is WC1 (x), as x can be with
Equal to " disease ", " palpitation and short breath " etc., the corresponding document of myocarditis is ID2, and the corresponding document of gastritis is ID3, and apoplexy is corresponding
Document is ID4.In step S404, index establishes equipment 1 and establishes the unified row of falling to descriptor and label word in the following manner
Index:
Disease-ID1(WC1 (x)), ID2(WC2 (x)), ID3(WC3 (x)) and, ID4 (WC4 (x))
Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Shortness of breath and palpitation-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Vomiting-ID3(WC3 (x)), ID4(WC4 (x))
Spit-ID3(WC3 (x)), ID4(WC4 (x))
Preferably, this method further includes that step S406(is not shown), in step S406, if the label word includes more
The label word of a semantic congruence, index establish the normalization result that equipment 1 determines the label word of the multiple semantic congruence;Its
In, in step s 404, it is that the descriptor, the label word and the normalization result establish rope that index, which establishes equipment 1,
Draw.
Specifically, it may include the label word of multiple semantic congruences in the corresponding label word of descriptor " disease ", such as " spitting "
" nausea and vomiting " i.e. semantic congruence, then in step S406, index establishes the normalization that equipment 1 determines two label words
As a result it is " vomiting ";Then, in step s 404, index establishes equipment 1 and is the descriptor " disease ", label word " spitting ", " dislikes
Index is established in heart vomiting " and normalization result " vomiting ".
Those skilled in the art will be understood that the mode of above-mentioned foundation index is only for example, other are existing or from now on may be used
Can occur foundation index mode be such as applicable to the present invention, should also be included within the scope of protection of the present invention, and herein with
Way of reference is incorporated herein.
It is indexed in general, establishing index and being established both for keyword, here, index establishes equipment 1 also to descriptor, mark
It signs word and its normalization result establishes index, to realize that the inquiry input information of user is preferably matched with resources and knowledge.
Preferably, it indexes and constantly works between each step for establishing equipment 1.Specifically, in step S401
In, index establishes equipment 1 according to text information, therefrom determines structured message;In step S402, index establishes equipment 1 certainly
Descriptor is extracted in the structured message;In step S403, index establishes equipment 1 according to corresponding to the descriptor
Theme determines label word corresponding with the theme from the text information;In step s 404, index establishes equipment 1
It establishes and indexes for the descriptor and the label word.Here, it will be understood by those skilled in the art that " lasting " refers to that index is established
Each step of equipment 1 requires to carry out determination, the master of structured message respectively according to the operating mode of setting or real-time adjustment
Extraction, the determination of label word and the foundation of index of epigraph, until the index establishes equipment 1 and stops determining in a long time
Structured message.
Here, index establishes equipment 1 according to text information, structured message is therefrom determined;From in the structured message
Extract descriptor;According to theme corresponding to the descriptor, determination is corresponding with the theme from the text information
Label word;It establishes and indexes for the descriptor and the label word.Index establish equipment be based on encyclopaedia class resources and knowledge or its
He carries out the extraction of theme, title to it by the resources and knowledge of Web Mining, forms effectively retouching to resources and knowledge content
It states, preferably shows this kind of high-quality resource knowledge, so that it is subsequent more efficient to the semantic search of this kind of resources and knowledge, meet
The complicated description search need that user can not accurately be reached using antistop list, improves the usage experience of user.
Fig. 5 shows the method flow diagram for indexing based on text information foundation of another aspect according to the present invention.
In step S501, the inquiry that matching unit 2 obtains user's input inputs information.Specifically, user by with
The interaction of family equipment has input inquiry input information, and in step S501, matching unit 2 is by calling the user equipment to be mentioned
The application programming interfaces (API) of confession pass through and call the dynamic pages technologies such as JSP, ASP or PHP, alternatively, passing through other
The communication mode of agreement obtains the inquiry input information of user input.
Here, inquiry input information includes but is not limited to that user passes through text input, voice input, image input etc.
The inquiry that different input modes are submitted inputs information.
Those skilled in the art will be understood that the mode of above-mentioned acquisition inquiry input information is only for example, other are existing
Or the mode for the acquisition inquiry input information being likely to occur from now on is such as applicable to the present invention, should also be included in protection of the present invention
Within range, and it is incorporated herein by reference.
In step S502, matching unit 2 carries out theme to inquiry input information and label is analyzed, to obtain
State descriptor and label word corresponding to inquiry input information.Specifically, in step S502, matching unit 2 is in step
Acquired inquiry input information carries out theme in S501 and label is analyzed, for example, before by the way that the inquiry is inputted information input
Training subject classification device obtained is stated, descriptor corresponding to inquiry input information is obtained;In step S502, matching
Equipment 2 carries out label analysis to the inquiry input information that the user inputs, and obtains corresponding label word.Here, in step S502
In, matching unit 2 establishes equipment 1 in step S403 to the mode and aforementioned index of the label analysis of inquiry input information
Determine that the mode of the label word of text information is same or like, therefore details are not described herein again, and includes by reference
In this.
In step S503, matching unit 2 is established equipment 1 in aforementioned index and is established according to the descriptor and label word
Index in carry out matching inquiry, to obtain and the candidate text information that matches of inquiry input information.Specifically, exist
In step S503, matching unit 2 inputs information according to the inquiry that acquired user in step S501 inputs, in aforementioned rope
Draw and carry out matching inquiry in the index for establishing the foundation of equipment 1, such as by all matching or the matched mode in part, is ordered
In the inquiry input information corresponding to descriptor text information, or hit the inquiry input information corresponding to label
The text information of word, to input the candidate text information that information matches as with the inquiry.
For example, it is assumed that user input query input information is " palpitation and short breath ", in step S501, matching unit 2 is obtained
The inquiry of user input inputs information " palpitation and short breath ";In step S502, matching unit 2 to the inquiry input information into
The label word of row label analysis, acquisition is " palpitation and short breath ", and aforementioned index is established equipment 1 and built to the label word " palpitation and short breath "
Vertical index is as follows:
Palpitation and short breath-ID1(WC1 (x)), ID2(WC2 (x)), ID4(WC4 (x))
Wherein, ID1, ID2, ID4 respectively indicate include label word " palpitation and short breath " text information id number, WC1
(x), WC2 (x), WC4 (x) then respectively indicate label word " palpitation and short breath " different degree in these text informations respectively.
Then in step S503, the label word according to corresponding to the inquiry of user input information of matching unit 2 is " nervous
Shortness of breath " is established in index and carries out matching inquiry in the index that equipment 1 is established, such as according to above-mentioned index, it is defeated to obtain the inquiry
Enter candidate text information corresponding to information " palpitation and short breath " --- text information ID1, ID2 and ID4.
Those skilled in the art will be understood that the mode of above-mentioned matching inquiry is only for example, other are existing or from now on may be used
The mode of matching inquiry that can occur such as is applicable to the present invention, should also be included within the scope of protection of the present invention, and herein with
Way of reference is incorporated herein.
In step S504, matching unit 2 inputs semantic of information according to the candidate text information and the inquiry
With degree, the determining target text information to match with the inquiry input information.
Specifically, there are certain semantic matching degree, the semantic matches between candidate text information and inquiry input information
Degree can be obtained by calculating, or further be inputted by calculating the corresponding index word set of candidate's text information and the inquiry
Matching degree between matching word set corresponding to information obtains.In step S504, matching unit 2 is according to candidate's text information
It is determining that the target text information that information matches is inputted with the inquiry with the semantic matching degree of the inquiry input information of user,
Such as using the highest candidate text information of semantic matching degree as the target text information to match with inquiry input information, or
The candidate text information that semantic matching degree is greater than predetermined matching degree threshold value is inputted what information matched as with the inquiry by person
Target text information.
Here, the predetermined matching degree threshold value is to match for judging whether candidate text information inputs information with inquiry
Semantic matching degree, value can be preset fixation, can also be adjusted according to the actual situation.
Preferably, step S504 further includes that sub-step S504a(is not shown) and sub-step S504b(be not shown).In son
In step S504a, matching unit 2 calculates the semantic matching degree of the candidate text information and the inquiry input information;In son
In step S504a, matching unit 2 is according to the semantic matching degree, determining defeated with the inquiry in conjunction with predetermined matching degree threshold value
Enter the target text information that information matches.
For example, matching unit 2 calculates candidate text according to existing matching degree calculation method in sub-step S504a
The semantic matching degree of this information and the inquiry input information of user;When the semantic matching degree be greater than the predetermined matching degree threshold value, then
In sub-step S504b, matching unit 2 is using candidate's text information as the target text to match with inquiry input information
This information.
Preferably, in step S504, the also index word set according to corresponding to candidate text information of matching unit 2 with
Matching word set corresponding to inquiry input information, to determine and target text information corresponding to inquiry input information.Specifically
Ground, candidate text information have corresponding index word set, and as assumed, the corresponding theme of candidate's text information ID1 is " hat in upper example
Heart trouble ", corresponding index terms include " disease ", " palpitation and short breath " etc., then word set is indexed composed by these index terms is
For index word set corresponding to candidate's text information ID1.The inquiry input information of user also has corresponding matching word set, example
Such as, matching word is obtained by inputting after information carries out word segmentation processing to the inquiry, then conduct will be gathered composed by the matching word
The inquiry inputs the corresponding matching word set of information, such as assumes that the inquiry input information of user's input is " palpitation and short breath vomiting ",
Matching unit 1 to the inquiry input information carry out word segmentation processing after, obtain matching word " palpitation and short breath " and " vomiting ", then this two
Set composed by a matching word is the corresponding matching word set of inquiry input information.In step S504, matching unit 2
According to the index word set and the matching word set, the target text information that the determining inquiry input information with the user matches,
For example, concentrating text information corresponding to the index word set of most matching words for the matching word is hit, inputted as with the inquiry
The target text information that information matches;Alternatively, the quantity for hitting matching word to be greater than to the index word set institute of predetermined quantity threshold value
Corresponding text information, as the target text information to match with inquiry input information.
For example, the corresponding index word set of ID1 includes index for candidate text information ID1, ID2 and ID4 in upper example
Word " disease ", " palpitation and short breath ";The corresponding index word set of ID2 includes index terms " palpitation and short breath ", " vomiting ", " disease ";ID4
Corresponding index word set includes index terms " palpitation and short breath ".Then for the inquiry input information of user's input, " palpitation and short breath is vomitted
Spit ", matching word is " palpitation and short breath ", " vomiting ", and the corresponding index word set of ID2 hits the inquiry and inputs information corresponding
With matching word most in word set, then using candidate's text information ID2 as the target most to match with inquiry input information
Text information, or, it is assumed that predetermined quantity threshold value is 0, then rope corresponding to above-mentioned candidate text information ID1, ID2 and ID4
The quantity for drawing the matching word that word set hits matching word concentration is all larger than the predetermined quantity threshold value, then above-mentioned candidate text information
ID1, ID2 and ID4 are as the target text information to match with inquiry input information.The matching unit 2 is supplied to the use
When family, it can be ranked up according to the height of different degree of the corresponding index terms in candidate's text information.
Those skilled in the art will be understood that the mode of above-mentioned determining target text information is only for example, other are existing
Or be likely to occur the mode of the text information that sets the goal really from now on and be such as applicable to the present invention, it should also be included in protection of the present invention
Within range, and it is incorporated herein by reference.
Preferably, it constantly works between each step of matching unit 2.Specifically, in step S501,
The inquiry that matching unit 2 obtains user's input inputs information;In step S502, matching unit 2 inputs information to the inquiry
It carries out theme and label is analyzed, to obtain descriptor and label word corresponding to the inquiry input information;In step S503
In, matching unit 2 is matched in the index that aforementioned index establishes the foundation of equipment 1 according to the descriptor and label word
Inquiry, to obtain the candidate text information to match with the inquiry input information;In step S504,2 basis of matching unit
The semantic matching degree of candidate's text information and the inquiry input information, it is determining to match with inquiry input information
Target text information.Here, it will be understood by those skilled in the art that " lasting " refer to each step of matching unit 2 respectively according to
The operating mode of setting or real-time adjustment requires to carry out the acquisition of inquiry input information, theme and label analysis, candidate text
The determination of the matching inquiry of information and target text information, until to stop acquisition user in a long time defeated for the matching unit 2
The inquiry input information entered.
Here, index is established between equipment 1 and each step of matching unit 2 and cooperated, inputted with realizing based on user
Inquiry input information, matching obtains corresponding target text information;Based on encyclopaedia class resources and knowledge or other pass through
The resources and knowledge of Web Mining carries out the extraction of theme, title to it, forms effective description to resources and knowledge content, more preferably
Ground shows this kind of high-quality resource knowledge, so that more efficient to the semantic search of this kind of resources and knowledge, meeting user can not be quasi-
The complicated description search need really reached using antistop list, improves the usage experience of user.
Preferably, the descriptor and label the word domain that may also be viewed as two different, respectively corresponds subject area and label
Domain, in step S503, matching unit 2 according to the descriptor and label word, respectively corresponding to the subject area and label field before
It states in index and carries out matching inquiry, to obtain the candidate text information to match with the inquiry input information.
Specifically, in step S503, matching unit 2 inputs letter to the inquiry that user inputs according in step S502
The analysis of breath descriptor obtained and label word, it is right in the subject area and label field institute respectively using dividing domain matched mode
Matching inquiry is carried out in the index answered, to obtain candidate text information.
Here, the subject area and label field can carry out analysis acquisition by inputting information to the inquiry, for example, to
The inquiry of family input inputs information, is analyzed using subject classification device above-mentioned the inquiry input information that user inputs, is obtained
Obtain subject categories.
Here, index corresponding to subject area and label field is that aforementioned index establishes the index that equipment 1 is established, according to
The label established before inputs information to the inquiry of user's input and carries out the extraction of label word, is such as directed to and is included in inquiry input
In information and inside tag set, then extracted.Then, label word and subject categories to corresponding master are utilized
The candidate for draw inverted entry in topic and the unified index of label, using the document comprising the subject categories or label as with
The inquiry inputs the corresponding candidate text information of information, participates in subsequent calculating.
Preferably, in step S503, matching unit 2 it is also possible to consider weight corresponding to the subject area and label field,
Matching inquiry is carried out in corresponding index, comprehensively considers the subject area and the corresponding weight of label field, it is final to obtain candidate text
Information.
Preferably, in step S504, the matching word according to included by the matching word set of matching unit 2, in the time
It selects index terms corresponding to text information to concentrate and determines target index word set, wherein described the word set hit of target index described
With matching word most in word set;If the target index word set is greater than predetermined threshold with the similarity for matching word set, will
Text information corresponding to the target index word set is as the target text information to match with the inquiry input information.
Specifically, in step S504, the hit of the index word set according to corresponding to candidate text information of matching unit 2
Quantity with matching word in word set will hit the most index word set of matching word quantity as target and index word set;Then, exist
In step S504, matching unit 2 calculate the target index word set with match the similarity of word set, for example, calculating separately target rope
Draw word set and matching word to concentrate, the similarity between the index terms of hit and corresponding matching word, then by being simply added or
The modes such as weighted average calculate the similarity that the target indexes word set with matches word set, when the similarity is greater than predetermined threshold
When, in step S504, matching unit 2 inputs letter using text information corresponding to target index word set as with the inquiry
The matched target text information of manner of breathing.
Here, the predetermined threshold is to index word set and the similarity for matching word set according to target, judge whether target rope
Draw similarity threshold of the corresponding text information of word set as target text information, value can be fixed, can also be according to reality
Border situation adjusts.
Preferably, this method further includes that step S505(is not shown).In step S505, matching unit 2 is to the inquiry
It inputs information and carries out word segmentation processing, obtain the participle after the word segmentation processing;The participle is existed with the matching unit 2
Descriptor obtained in step S502 and label word merge processing, corresponding with inquiry input information to obtain
Match word set, wherein the matching word concentrates included word as matching word.Then, in sub-step S504a, matching
The index word set according to corresponding to the matching word set and the candidate text information of equipment 2, calculates the candidate text information
With the semantic matching degree of the inquiry input information.
Specifically, in step S505, matching unit 2 carries out inquiry input information acquired in step S501
Word segmentation processing, to obtain the participle after word segmentation processing, preferably, matching unit 2 can also be to the participle in step S505
Participle is obtained after processing and is removed the filtration treatments such as stop words, and then obtains final participle;Then, in step S505,
Matching unit 2 is according to participle obtained, by itself and the descriptor obtained in step S502 of matching unit 2 and label word
Processing, de-redundancy processing etc. are merged, finally to obtain matching word set corresponding with inquiry input information, and should
Matching word concentrates included word as matching word corresponding with inquiry input information.
Then, in sub-step S504a, matching unit 2 is right according to the matching word set and the candidate text information institute
The index word set answered calculates the semantic matching degree of the candidate text information and the inquiry input information.
It is highly preferred that this method further includes, step S506(is not shown).In step S506, matching unit 2 is to described
Subsequent processing is carried out with word, to update the matching word set;Wherein, the subsequent processing includes following at least any one:
It determines mutual synonymous matching word included in the matching word, the mutually synonymous matching word is merged
For the subset of the matching word set.
Synonymous extension is carried out to the matching word, the synonym obtained after synonymous extension is determined as with the matching word
The subset of the matching word set.
Specifically, in step S506, matching that matching unit 2 concentrates matching word identified in step S505
Word carries out subsequent processing, to update the matching word set.For example, matching unit 2 determines in the matching word in step S506
The mutually synonymous matching word is merged into the subset of the matching word set by included mutual synonymous matching word.Due to
May include mutually synonymous matching word in matching word, such as " vomiting " and " spitting ", in step S506, matching unit 2 by this
A little mutually synonymous matching words merge into the subset of the matching word set.
For example, it is assumed that the inquiry input information of user's input is Q, in step S505, matching unit 2 is defeated to the inquiry
Enter information and carry out word segmentation processing, after the removal filtration treatments such as stop words, the matching word set representations in label field be Q=a,
B, c, d, e }, wherein a, b, c, d, e are respectively that the matching word concentrates included matching word;Assuming that matching word a and b therein
It is mutually synonymous matching word, then in step S506, matching word a and b are merged into the matching word set by matching unit 2
Subset, then matching word set update are expressed as Q={ { a, b }, c, d, e }.Then, subsequent step such as step S503 is carried out subsequent
Matching inquiry operation.
For another example, in step S506, matching unit 2 also carries out synonymous extension to the matching word, will obtain after synonymous extension
To synonym be determined as the subset for matching word set with the matching word.Specifically, in step S506, matching unit 2
The matching word that the corresponding matching word of information is concentrated can be also inputted to the inquiry carries out synonymous extension, it is such as that " shortness of breath and palpitation " is synonymous
It is extended to " palpitation and short breath ", then, in step S506, matching unit 2 is by the synonym obtained after the synonymous extension and is somebody's turn to do
Matching word is determined as the subset of the matching word set.
Example is connected, for the matching word set Q after synonymous merging={ { a, b }, c, d, e }, in step S506, matching is set
Standby 2 can also carry out synonymous extension to the matching word set, extend and obtain matching word a, b, c therein, the synonym of d, e, and should
The synonym and the matching word that obtain after synonymous extension are determined as the subset of the matching word set, for example, the matching word set Q is through more
After secondary synonymous extension, following expression is obtained:
Then, in step S503, matching unit 2 establishes the rope that equipment 1 is established according to the matching word set, in index
Draw middle carry out matching inquiry, for example, being included by inverted indexCandidate text information.
Concentrate the index terms set representations of most matching words for C assuming that matching word will be hit, then C are as follows:
Wherein, C indicates the maximum of synonymous hitw1iThe set of words of corresponding position Semantic mapping
Then in sub-step S504a, matching unit 2 is according to corresponding to the matching word set and the candidate text information
Index word set, calculate the semantic matching degree of the candidate text information and the inquiry input information.
Semantic matching degree between Q and C can be calculate by the following formula:
Wherein,Indicate wordWeight, here use (log(TF)+1) * log(N/DF) indicate;Match
(TQ,TC) indicate whether index word set, matching word set match with theme.
Here, Match (TQ,TC) it is corresponding value can define, such as assume the index word set, matching word set matched with theme,
Then Match (TQ,TC) value be 1, be otherwise 0.5.
Then, it is assumed that the semantic matches angle value being calculated is greater than predetermined threshold, then in sub-step S504b, matching
Equipment 2 is using text information corresponding to the index word set as the target text information to match with inquiry input information.
It should be noted that the present invention can be carried out in the assembly of software and/or software and hardware, for example, can adopt
With specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment
In, software program of the invention can be executed to implement the above steps or functions by processor.Similarly, of the invention
Software program (including relevant data structure) can be stored in computer readable recording medium, for example, RAM store
Device, magnetic or optical driver or floppy disc and similar devices.In addition, hardware can be used in fact in some steps of the invention or function
It is existing, for example, as the circuit cooperated with processor thereby executing each step or function.
In addition, a part of the invention can be applied to computer program product, such as computer program instructions, when it
When being computer-executed, by the operation of the computer, it can call or provide according to the method for the present invention and/or technical side
Case.And the program instruction of method of the invention is called, it is possibly stored in fixed or moveable recording medium, and/or
It is transmitted by the data flow in broadcast or other signal-bearing mediums, and/or is stored in and is instructed according to described program
In the working storage of the computer equipment of operation.Here, according to one embodiment of present invention including a device, the dress
It sets including the memory for storing computer program instructions and the processor for executing program instructions, wherein when the calculating
When machine program instruction is executed by the processor, side of the device operation based on aforementioned multiple embodiments according to the present invention is triggered
Method and/or technical solution.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie
In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, nothing
By from the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by institute
Attached claim rather than above description limit, it is intended that will fall within the meaning and scope of the equivalent elements of the claims
All changes be included in the present invention.It should not treat any reference in the claims as limiting related right
It is required that.Furthermore, it is to be understood that one word of " comprising " does not exclude other units or steps, odd number is not excluded for plural number.It is old in device claim
The multiple units or device stated can also be implemented through software or hardware by a unit or device.The first, the second etc.
Word is used to indicate names, and is not indicated any particular order.
Claims (17)
1. a kind of method for establishing index based on text information, wherein method includes the following steps:
A analyzes directory information and/or subdirectory information included in text information, carries out structure to the text information of acquisition
Change, therefrom determines structured message;
B extracts descriptor from the structured message;
C1 theme according to corresponding to the descriptor determines corresponding with the theme at least one from the text information
A candidate's label word;
C2 determines corresponding centre word according at least one described candidate label word;
C3 at a distance from the centre word, determines label corresponding with the theme according at least one described candidate label word
Word;
D is that the descriptor and the label word are established and indexed.
2. according to the method described in claim 1, wherein, this method further include:
According to predetermined theme system, training corpus corresponding with the predetermined theme system is obtained;
According to the training corpus, training subject classification device;
Wherein, the step B includes:
According to the subject classification device, the descriptor is extracted from the structured message.
3. according to the method described in claim 1, wherein, the step C2 includes:
According to predetermined filtering rule, processing is filtered at least one described candidate label word, to obtain at least one process
Filter treated candidate label word;
According to it is described at least one through filtering treated candidate label word, determine the centre word;
Wherein, the predetermined filtering rule is determined based on following at least any one:
The part of speech of at least one candidate label word;
The word rule of at least one candidate label word;
The co-occurrence ratio of at least one candidate the label word and the theme.
4. according to the method in any one of claims 1 to 3, wherein this method further include:
If the label word includes the label word of multiple semantic congruences, the normalizing of the label word of the multiple semantic congruence is determined
Change result;
Wherein, the step D includes:
Index is established for the descriptor, the label word and the normalization result.
5. the method for the inquiry input information for the index matching user that one kind is established according to claim 1, wherein this method packet
Include following steps:
The inquiry that a obtains user's input inputs information;
B carries out theme to inquiry input information and label is analyzed, to obtain theme corresponding to the inquiry input information
Word and label word;
C carries out matching inquiry according to the descriptor and label word in the index that such as claim 1 is established, with obtain with
The candidate text information that the inquiry input information matches;
D inputs the semantic matching degree of information according to the candidate text information and the inquiry, determining to believe with inquiry input
The matched target text information of manner of breathing.
6. according to the method described in claim 5, wherein, the step d includes:
D1 calculates the semantic matching degree of the candidate text information and the inquiry input information;
D2 is according to the semantic matching degree, in conjunction with predetermined matching degree threshold value, the determining mesh to match with the inquiry input information
Mark text information.
7. according to the method described in claim 6, wherein, this method further include:
Word segmentation processing is carried out to inquiry input information, obtains the participle after the word segmentation processing;
The participle is merged into processing with descriptor and label word obtained in step b, it is defeated with the inquiry to obtain
Enter the corresponding matching word set of information, wherein the matching word concentrates included word as matching word;
Wherein, the step d1 includes:
According to index word set corresponding to the matching word set and the candidate text information, the candidate text information is calculated
With the semantic matching degree of the inquiry input information.
8. according to the method described in claim 7, wherein, this method further include:
Subsequent processing is carried out to the matching word, to update the matching word set;
Wherein, the subsequent processing includes following at least any one:
It determines mutual synonymous matching word included in the matching word, the mutually synonymous matching word is merged into institute
State the subset of matching word set;
Synonymous extension is carried out to the matching word, the synonym obtained after synonymous extension and the matching word are determined as described
Match the subset of word set.
9. a kind of index for establishing index based on text information establishes equipment, wherein the equipment includes:
Information determining means, for analyzing directory information and/or subdirectory information included in text information, to the text of acquisition
This information carries out structuring, therefrom determines structured message;
Subject distillation device, for extracting descriptor from the structured message;
Label determining device, comprising:
Candidate determination unit, for the theme according to corresponding to the descriptor, the determining and master from the text information
Inscribe at least one corresponding candidate label word;
Centre word determination unit, for determining corresponding centre word according at least one described candidate label word;
Tag determination unit, for the centre word at a distance from, determined according at least one described candidate label word with it is described
The corresponding label word of theme;
Index establishes device, for establishing and indexing for the descriptor and the label word.
10. index according to claim 9 establishes equipment, wherein the equipment further includes theme training device, is used for:
According to predetermined theme system, training corpus corresponding with the predetermined theme system is obtained;
According to the training corpus, training subject classification device;
Wherein, the subject distillation device is used for:
According to the subject classification device, the descriptor is extracted from the structured message.
11. index according to claim 9 establishes equipment, wherein the centre word determination unit is used for:
According to predetermined filtering rule, processing is filtered at least one described candidate label word, to obtain at least one process
Filter treated candidate label word;
According to it is described at least one through filtering treated candidate label word, determine the centre word;
Wherein, the predetermined filtering rule is determined based on following at least any one:
The part of speech of at least one candidate label word;
The word rule of at least one candidate label word;
The co-occurrence ratio of at least one candidate the label word and the theme.
12. the index according to any one of claim 9 to 11 establishes equipment, wherein the equipment further include:
Normalized device determines the multiple semantic congruence if including the label word of multiple semantic congruences for the label word
Label word normalization result;
Wherein, the index is established device and is used for:
Index is established for the descriptor, the label word and the normalization result.
13. a kind of matching unit of the inquiry input information of the index matching user established according to claim 9, wherein should
Equipment includes:
Acquisition device is inquired, the inquiry for obtaining user's input inputs information;
Information analysis apparatus, for carrying out theme and label analysis to inquiry input information, to obtain the inquiry input
Descriptor corresponding to information and label word;
Matching inquiry device, for being carried out in the index that such as claim 10 is established according to the descriptor and label word
Matching inquiry, to obtain the candidate text information to match with the inquiry input information;
Text determining device is determined for the semantic matching degree according to the candidate text information and the inquiry input information
The target text information to match with the inquiry input information.
14. equipment according to claim 13, wherein the text determining device includes:
Matching primitives unit, for calculating the semantic matching degree of the candidate text information and the inquiry input information;
Text determination unit, for according to the semantic matching degree, in conjunction with predetermined matching degree threshold value, the determining and inquiry to be inputted
The target text information that information matches.
15. equipment according to claim 14, wherein the equipment further includes word set determining device, is used for:
Word segmentation processing is carried out to inquiry input information, obtains the participle after the word segmentation processing;
The participle is merged into processing with information analysis apparatus descriptor obtained and label word, to obtain and
The inquiry inputs the corresponding matching word set of information, wherein the matching word concentrates included word as matching word;
Wherein, the matching primitives unit is used for:
According to index word set corresponding to the matching word set and the candidate text information, the candidate text information is calculated
With the semantic matching degree of the inquiry input information.
16. equipment according to claim 15, wherein the equipment further includes aftertreatment device, is used for:
Subsequent processing is carried out to the matching word, to update the matching word set;
Wherein, the subsequent processing includes following at least any one:
It determines mutual synonymous matching word included in the matching word, the mutually synonymous matching word is merged into institute
State the subset of matching word set;
Synonymous extension is carried out to the matching word, the synonym obtained after synonymous extension and the matching word are determined as described
Match the subset of word set.
17. a kind of system for establishing the inquiry input information of index and matching user, including as appointed in claim 9 to 12
Index described in one establishes equipment, and the matching unit as described in any one of claim 13 to 16.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410079818.7A CN103886034B (en) | 2014-03-05 | 2014-03-05 | A kind of method and apparatus of inquiry input information that establishing index and matching user |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410079818.7A CN103886034B (en) | 2014-03-05 | 2014-03-05 | A kind of method and apparatus of inquiry input information that establishing index and matching user |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103886034A CN103886034A (en) | 2014-06-25 |
CN103886034B true CN103886034B (en) | 2019-03-19 |
Family
ID=50954926
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410079818.7A Active CN103886034B (en) | 2014-03-05 | 2014-03-05 | A kind of method and apparatus of inquiry input information that establishing index and matching user |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103886034B (en) |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017071370A1 (en) * | 2015-10-30 | 2017-05-04 | 华为技术有限公司 | Label processing method and device |
CN106815262B (en) * | 2015-12-01 | 2020-07-03 | 北京国双科技有限公司 | Method and device for searching referee document |
CN105786966A (en) * | 2016-01-26 | 2016-07-20 | 浪潮软件集团有限公司 | Text structuring method and device |
CN107291783B (en) * | 2016-04-12 | 2021-04-30 | 芋头科技(杭州)有限公司 | Semantic matching method and intelligent equipment |
CN109074363A (en) * | 2016-05-09 | 2018-12-21 | 华为技术有限公司 | Data query method, data query system determine method and apparatus |
CN106021225B (en) * | 2016-05-12 | 2018-12-21 | 大连理工大学 | A kind of Chinese Maximal noun phrase recognition methods based on the simple noun phrase of Chinese |
CN107391509B (en) * | 2016-05-16 | 2023-06-02 | 中兴通讯股份有限公司 | Label recommending method and device |
CN107918778B (en) * | 2016-10-11 | 2022-03-15 | 阿里巴巴集团控股有限公司 | Information matching method and related device |
CN108257676B (en) * | 2016-12-28 | 2020-03-03 | 北京搜狗科技发展有限公司 | Medical case information processing method, device and equipment |
CN108536708A (en) * | 2017-03-03 | 2018-09-14 | 腾讯科技(深圳)有限公司 | A kind of automatic question answering processing method and automatically request-answering system |
US10824657B2 (en) * | 2017-06-01 | 2020-11-03 | Interactive Solutions Inc. | Search document information storage device |
US10318593B2 (en) * | 2017-06-21 | 2019-06-11 | Accenture Global Solutions Limited | Extracting searchable information from a digitized document |
CN107436922B (en) | 2017-07-05 | 2021-06-08 | 北京百度网讯科技有限公司 | Text label generation method and device |
CN107844596A (en) * | 2017-11-22 | 2018-03-27 | 福建中金在线信息科技有限公司 | A kind of article search method and system |
CN108255985A (en) * | 2017-12-28 | 2018-07-06 | 东软集团股份有限公司 | Data directory construction method, search method and device, medium and electronic equipment |
CN108416026B (en) * | 2018-03-09 | 2023-04-18 | 腾讯科技(深圳)有限公司 | Index generation method, content search method, device and equipment |
CN110209804B (en) * | 2018-04-20 | 2023-11-21 | 腾讯科技(深圳)有限公司 | Target corpus determining method and device, storage medium and electronic device |
CN110580276B (en) * | 2018-06-08 | 2022-06-28 | 百度在线网络技术(北京)有限公司 | Method and apparatus for processing information |
CN109543001A (en) * | 2018-10-18 | 2019-03-29 | 华南理工大学 | A kind of scientific and technological entry abstracting method characterizing Scientific Articles research contents |
CN109213937B (en) * | 2018-11-29 | 2020-07-24 | 深圳爱问科技股份有限公司 | Intelligent search method and device |
CN111008265B (en) * | 2019-12-03 | 2023-03-28 | 腾讯云计算(北京)有限责任公司 | Enterprise information searching method and device |
CN113377922B (en) * | 2021-06-25 | 2024-04-02 | 北京百度网讯科技有限公司 | Method, device, electronic equipment and medium for matching information |
CN115687579B (en) * | 2022-09-22 | 2023-08-01 | 广州视嵘信息技术有限公司 | Document tag generation and matching method, device and computer equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5694523A (en) * | 1995-05-31 | 1997-12-02 | Oracle Corporation | Content processing system for discourse |
CN103177036A (en) * | 2011-12-23 | 2013-06-26 | 盛乐信息技术(上海)有限公司 | Method and system for label automatic extraction |
CN103294780A (en) * | 2013-05-13 | 2013-09-11 | 百度在线网络技术(北京)有限公司 | Directory mapping relationship mining device and directory mapping relationship mining device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7472115B2 (en) * | 2004-04-29 | 2008-12-30 | International Business Machines Corporation | Contextual flyout for search results |
US20120166414A1 (en) * | 2008-08-11 | 2012-06-28 | Ultra Unilimited Corporation (dba Publish) | Systems and methods for relevance scoring |
-
2014
- 2014-03-05 CN CN201410079818.7A patent/CN103886034B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5694523A (en) * | 1995-05-31 | 1997-12-02 | Oracle Corporation | Content processing system for discourse |
CN103177036A (en) * | 2011-12-23 | 2013-06-26 | 盛乐信息技术(上海)有限公司 | Method and system for label automatic extraction |
CN103294780A (en) * | 2013-05-13 | 2013-09-11 | 百度在线网络技术(北京)有限公司 | Directory mapping relationship mining device and directory mapping relationship mining device |
Also Published As
Publication number | Publication date |
---|---|
CN103886034A (en) | 2014-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103886034B (en) | A kind of method and apparatus of inquiry input information that establishing index and matching user | |
Ljubešić et al. | A global analysis of emoji usage | |
Wang et al. | Relevant document discovery for fact-checking articles | |
WO2018066445A1 (en) | Causal relationship recognition apparatus and computer program therefor | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
JP5711674B2 (en) | Question answering program, server and method using a large amount of comment text | |
CN112650840A (en) | Intelligent medical question-answering processing method and system based on knowledge graph reasoning | |
JP2017511922A (en) | Method, system, and storage medium for realizing smart question answer | |
CN103313248B (en) | Method and device for identifying junk information | |
CN102663129A (en) | Medical field deep question and answer method and medical retrieval system | |
JP2013171550A (en) | Non-factoid question answering system and computer program | |
CN102682120B (en) | Method and device for acquiring essential article commented on network | |
JPWO2016051551A1 (en) | Sentence generation system | |
CN102955853B (en) | A kind of generation method and device across language digest | |
CN110263319A (en) | A kind of scholar's viewpoint abstracting method based on web page text | |
CN106649849A (en) | Text information base building method and device and searching method, device and system | |
CN109871543A (en) | A kind of intention acquisition methods and system | |
CN107203520A (en) | The method for building up of hotel's sentiment dictionary, the sentiment analysis method and system of comment | |
CN107590128A (en) | A kind of paper based on high confidence features attribute Hierarchical clustering methods author's disambiguation method of the same name | |
CN107102976A (en) | Entertainment newses autocreating technology and system based on microblogging | |
US20180349360A1 (en) | Systems and methods for automatically generating news article | |
CN109086355A (en) | Hot spot association relationship analysis method and system based on theme of news word | |
JP5718405B2 (en) | Utterance selection apparatus, method and program, dialogue apparatus and method | |
Tembhurnikar et al. | Topic detection using BNgram method and sentiment analysis on twitter dataset | |
CN109446399A (en) | A kind of video display entity search method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |