CN106951420A - Literature search method and apparatus, author's searching method and equipment - Google Patents
Literature search method and apparatus, author's searching method and equipment Download PDFInfo
- Publication number
- CN106951420A CN106951420A CN201610007271.9A CN201610007271A CN106951420A CN 106951420 A CN106951420 A CN 106951420A CN 201610007271 A CN201610007271 A CN 201610007271A CN 106951420 A CN106951420 A CN 106951420A
- Authority
- CN
- China
- Prior art keywords
- theme
- document
- query text
- author
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of literature search method and apparatus, author's searching method and equipment.Document searching method includes:Receive the query text on the document to be searched for;Using Hierarchical Semantic Model, it is determined that the Layer semantics theme related to query text;And from the document related to identified Layer semantics theme, select document, be used as Search Results.Compared to the method for not utilizing hierarchical information, the present invention can obtain more accurate Search Results by using hierarchical subject information.
Description
Technical field
This invention relates generally to natural language processing field.Specifically, the present invention relates to a kind of energy
Enough literature search method and apparatus, author's searching method and the equipment for obtaining Search Results exactly.
Background technology
In recent years, with information storage capability and the fast lifting of web search technology, current science
The lookup of document and the search of related scholar are largely completed by network retrieval platform.Network retrieval is put down
The retrieval side based on Keywords matching and text similarity similar with universal search engine is used platform more
Formula, although such retrieval mode is performed well in universal search engine, is directed to academic documents
The search of/author, then be short of the information in terms of the classification for considering sphere of learning, domanial hierarchy structure,
So that the result that search is returned is not accurate enough.
For example, it is sentiment analysis that data mining technology, which has a specific branch,.If searching for feelings
Feel the academic documents in terms of analysis, inevitably return to some or even be much absorbed in data mining
The document of this high-level Abstract study, may merely because be referred to sentiment analysis wherein, or
Sentiment analysis is briefly introduced.But, searchers is actually not relevant for abstract data mining,
And it is desirable to obtain the specific achievement in research of this lower level of sentiment analysis.When search sentiment analysis neck
During the author in domain, in returning result also can doped with the Abstract study for being absorbed in data mining author.
It can be seen that, problem of the prior art is that Search Results are not accurate enough, produces having its source in for problem
Do not make full use of hierarchical information.
Therefore, it is contemplated that carrying out literature search and author's search exactly.
The content of the invention
The brief overview on the present invention is given below, to provide on some of the present invention
The basic comprehension of aspect.It should be appreciated that this general introduction is not the exhaustive general introduction on the present invention.
It is not intended to determine the key or pith of the present invention, nor the model of the intended limitation present invention
Enclose.Its purpose only provides some concepts in simplified form, more detailed in this, as what is discussed later
The preamble carefully described.
The purpose of the present invention is to propose to a kind of literature search method and apparatus for returning to accurate Search Results,
Author's searching method and equipment.
To achieve these goals, according to an aspect of the invention, there is provided a kind of literature search side
Method, this method includes:Receive the query text on the document to be searched for;Using Hierarchical Semantic Model,
It is determined that the Layer semantics theme related to query text;And from identified Layer semantics theme phase
In the document of pass, document is selected, Search Results are used as.
According to another aspect of the present invention there is provided a kind of literature search equipment, the equipment includes:
Query text reception device, is configured as:Receive the query text on the document to be searched for;Theme
Determining device, is configured as:The level language related to query text is determined using Hierarchical Semantic Model
Adopted theme;And document selection device, it is configured as:From related to identified Layer semantics theme
Document in, select document, be used as Search Results.
In accordance with a further aspect of the present invention there is provided a kind of author's searching method, this method includes:Connect
Receive the query text on the author to be searched for;Using Hierarchical Semantic Model, it is determined that with query text phase
The Layer semantics theme of pass;And from the author related to identified Layer semantics theme, selection
Author, is used as Search Results.
Equipment is searched for there is provided a kind of author according to another aspect of the invention, the equipment includes:Look into
Received text device is ask, is configured as:Receive the query text on the author to be searched for;Theme is true
Determine device, be configured as:The Layer semantics related to query text are determined using Hierarchical Semantic Model
Theme;And author's selection device, it is configured as:From related to identified Layer semantics theme
In author, author is selected, Search Results are used as.
In addition, according to another aspect of the present invention, additionally providing a kind of storage medium.The storage is situated between
Matter includes machine readable program code, when performing described program code on message processing device,
Described program code make it that described information processing equipment performs the above method according to the present invention.
In addition, in accordance with a further aspect of the present invention, additionally providing a kind of program product.Described program is produced
Product include the instruction that machine can perform, when performing the instruction on message processing device, the finger
Order make it that described information processing equipment performs the above method according to the present invention.
Brief description of the drawings
With reference to explanation below in conjunction with the accompanying drawings to embodiments of the invention, this hair can be more readily understood that
Bright above and other objects, features and advantages.Part in accompanying drawing is intended merely to show the present invention's
Principle.In the accompanying drawings, same or similar technical characteristic or part will use same or similar attached
Icon is remembered to represent.In accompanying drawing:
Fig. 1 shows the flow chart of literature search method according to an embodiment of the invention;
Fig. 2 shows the example for the implicit level thematic structure that hierarchical subject model is obtained;
Fig. 3 shows the example for the implicit level thematic structure that hierarchical subject model is obtained;
Fig. 4 shows step S2 specific implementation;
Fig. 5 shows the flow chart of author's searching method according to an embodiment of the invention;
Fig. 6 shows the block diagram of literature search equipment according to an embodiment of the invention;
Fig. 7 shows the block diagram of author's search equipment according to an embodiment of the invention;And
Fig. 8 is shown available for the computer for implementing method and apparatus according to an embodiment of the invention
Schematic block diagram.
Embodiment
The one exemplary embodiment of the present invention is described in detail hereinafter in connection with accompanying drawing.In order to clear
For the sake of Chu and simplicity, all features of actual embodiment are not described in the description.However, should
The understanding, must make many specific to implementation during any this actual embodiment is developed
The decision of mode, to realize the objectives of developer, for example, meeting and system and business phase
Those restrictive conditions closed, and these restrictive conditions may be with the difference of embodiment
Change.In addition, it also should be appreciated that, although development is likely to be extremely complex and time-consuming, but
For the those skilled in the art for having benefited from present disclosure, this development is only routine
Task.
Herein, in addition it is also necessary to which explanation is a bit, in order to avoid having obscured this hair because of unnecessary details
It is bright, illustrate only in the accompanying drawings with according to the closely related apparatus structure of the solution of the present invention and/or
Process step, and eliminate and the little other details of relation of the present invention.In addition, it is also stated that
It is that the element and feature described in a kind of accompanying drawing or embodiment of the present invention can be with one
Or more the element that shows in other accompanying drawings or embodiment and feature be combined.
The flow of literature search method according to an embodiment of the invention is described below with reference to Fig. 1.
Fig. 1 shows the flow chart of literature search method according to an embodiment of the invention.Such as Fig. 1 institutes
Show, document searching method comprises the following steps:Receive the query text (step on the document to be searched for
Rapid S1);Using Hierarchical Semantic Model, it is determined that the Layer semantics theme (step related to query text
S2);And from the document related to identified Layer semantics theme, document is selected, as searching
Hitch fruit (step S3).
In step sl, the query text on the document to be searched for is received.
Specifically, the inquiry text by user input on its document to be searched for, can be one
Or more keyword or passage etc., e.g. user's document interested summary.
Receive after query text, it is necessary to query text is converted into term vector, in favor of subsequent treatment.
The set of the corresponding word of element of term vector is equal in all documents in the range of literature search
Including word set Yu domanial words list union.Word mentioned here includes word, phrase
Deng.Wherein word includes the word for constituting phrase.For example, both including phrase " sentiment analysis ", also include
Word " emotion ", word " analysis ".
When being preferably converted to term vector, changed according to maximum length matching principle.Namely
Say, be by text to be converted query text word corresponding with the element of term vector set in can
The most long word mixed is mapped.For example, if containing phrase " emotion in query text
Analysis ", it is necessary to which term vector, which embodies text, includes phrase " sentiment analysis ", rather than including word " feelings
Sense " and word " analysis ".In practical operation, for example, carry out in the following way by maximum length matching
The conversion of principle:" Named Entity Recognition " and " Named Entity " are natures
Term in Language Processing field, according to maximum length matching principle, first to " Named Entity
Recognition " is matched, if matching is unsuccessful, is reattempted to " Named Entity " enter
Row matching.
The set of the corresponding word of element on term vector, literature search necessarily has its hunting zone,
Naturally needs turn into the basic element compared, phase to all words that all documents in hunting zone include
Ground is answered to constitute a source of the set of the corresponding word of element of term vector.
Another source of the set of the corresponding word of element of term vector is domanial words list.Field
The acquisition methods of word list include collecting the known art art of the every field in the range of literature search
Language.Namely collect the existing field term in each field, the keyword field of such as each document, specially
The nomenclature that family or practitioner provide, nomenclature in textbook attached sheet etc..
The acquisition methods of domanial words list are also using hot word analytical technology, from the text of every field
In offering, various types of hot words are extracted.The high frequency hot word of every field is namely excavated as field
Word, this is the important supplement to existing field term.
In addition, traditional hot word extraction technique mainly extracts nominal word (including word and phrase),
The invention is not restricted to this, other types of word can also be extracted.
In the extraction process of hot word, for Chinese literature, also need to divide Chinese text first
Word and part-of-speech tagging.The conventional known process that participle and part-of-speech tagging belong in natural language processing,
This is repeated no more.For english literature, without carrying out participle and part-of-speech tagging.Further, since Chinese
With the different semantic meaning representation abilities of English, the hot word extracted to Chinese is to preferably include two to four words
Word string, and the hot word extracted to English is the word string that preferably includes two to three words.It is being drawn into
After these word strings, the word string result to these extractions carries out Substring reduction.The rule of Substring reduction is such as
Shown in formula (1), wherein TlengthRefer to the number for the word that the length of word string, i.e. word string are included, Tfrequency
Refer to the number of times that the word string occurs, and TvalueDetermined by above-mentioned two factor, if a long word string
Comprising another short word string, and the T of the long word stringvalueMore than the T of the short word stringvalue, then this is short
Word string will be merged, otherwise, retain short word string, delete long word string.All word frequency finally given
Word string more than setpoint frequency threshold value is added into hot word list.Referring herein to frequency threshold can
It is configured with the quantity of document according to the every field in the range of literature search.If for example, certain
The related quantity of document in individual field is 100,000, then can rule of thumb set, occurrence number is less than
100 word string, is not considered as focus word string.
Tvalue=Tfrequency*Tlength (1)
Complete after hot word extraction, the hot word list extracted is carried out with above-mentioned existing field term
Merge, as domanial words list.
To sum up, it is possible to use hot word analytical technology enriches domanial words lists, so with hunting zone
All words for including of all documents constitute together term vector the corresponding word of element set, from
And the query text on the document to be searched for received can be converted to term vector, for subsequent
Processing.
In step s 2, using Hierarchical Semantic Model, it is determined that the Layer semantics related to query text
Theme.
The present invention utilizes hierarchical information by using Hierarchical Semantic Model, so as to improve the essence of search
Degree.
The training method of Hierarchical Semantic Model is that all documents in the range of literature search are converted into word
Vector, is then input to Hierarchical Semantic Model to be trained, training is obtained by resulting term vector
Hierarchical Semantic Model there is level, go out data-oriented collection equivalent to Hierarchical Semantic Model automatic detection
The Semantic hierarchy implied in (all documents in the range of literature search).The level language trained
Adopted model can provide corresponding level language in Semantic hierarchy for the term vector newly inputted
Adopted theme and corresponding semantic similarity.Hierarchical Semantic Model is, for example, that level implies Di Li Crays point
Cloth (hierarchical Latent Dirichlet Allocation, hLDA).
Term vector conversion herein and the term vector by maximum length matching principle of query text before
Conversion is different, in order to increase the level of coverage that content is searched for user, in the range of by literature search
During all documents are converted to term vector, word and phrase are preserved simultaneously.For example, in a document
If comprising " Named Entity ", term vector will embody simultaneously " Named ", " Entity " and
" Named Entity " presence.Similarly, if including " Named Entity in a document
Recognition ", then term vector will embody " Named Entity Recognition ", " Named simultaneously
Entity ", " Named ", " Entity ", the presence of " Recognition ".
Fig. 2 and Fig. 3 respectively illustrate the implicit level thematic structure that two hierarchical subject models are obtained.
Fig. 2 is to utilize 1272《Psychological Review》Article abstract in periodical is used as number
According to the implicit level thematic structure therefrom found.Because the content that psychology is included is very wide, therefore figure
The keyword in root node in 2 is the everyday expressions such as the, of and and, represents one " virtual "
Root node, also i other words, be practically without apparent pass between several leaf nodes of the second layer
Connection relation, because the shared father node between them is simply comprising these common words of the, of and and
Language.Fig. 3 is related to birds group using in more than 200 Flickr websites as data, therefrom
It was found that implicit level thematic structure.In query process after training, processing of the invention equivalent to
First have to find a corresponding node (level master from existing Layer semantics structure for query text
Topic) (step S2), further search for knot from the document associated with the node and author again afterwards
Really (step S3).
That is, in step s 2, using Hierarchical Semantic Model, it is determined that related to query text
Layer semantics theme.
Specifically, as shown in figure 4, step S2 can be achieved by the steps of.
Input the query text to Hierarchical Semantic Model, with obtain multiple candidate topics and its with inquiry
The semantic similarity (step S41) of text;Determined from multiple candidate topics related to query text
Layer semantics theme (step S42).
Because Hierarchical Semantic Model is by training, so according to the query text of input (with term vector
Form), can obtain may corresponding multiple candidate topics, the determination of above-mentioned candidate topics
Semantic similarity based on query text and candidate topics.
And then, candidate topics can be screened according to semantic similarity.
Specifically, in the case where the semantic similarity of only one candidate topics is more than predetermined threshold,
The candidate topics are defined as to the Layer semantics theme related to query text.
Semantic similarity in all candidate topics be respectively less than or equal to predetermined threshold in the case of, by pre-
Set pattern then therefrom selected section candidate topics, are used as the Layer semantics theme related to query text.In advance
Set pattern then can flexibly be set by those skilled in the art, e.g. choose semantic similarity top n
Candidate topics, N is the natural number specified.
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase
Descending order is seemingly spent, the candidate topics of predetermined quantity are selected, as related to query text
Layer semantics theme.
Or, will in the case where the semantic similarity of more than one candidate topics is more than predetermined threshold
All candidate topics that semantic similarity is more than predetermined threshold are presented to user, are selected by user with inquiring about
The related Layer semantics theme of text.The major advantage of this embodiment is the provision of the anti-of user
Feedback, can optimize the present invention according to user feedback, make its true idea gradually closer to user.Example
Such as, the candidate topics of user's selection and the semantic similarity of query text can be improved, reduction user is not
The candidate topics of selection and the semantic similarity of query text.
In step s3, from the document related to identified Layer semantics theme, document is selected,
It is used as Search Results.
When being initially trained with all documents in hunting zone to Hierarchical Semantic Model, actually
Construct the incidence relation of Layer semantics theme and all documents in hunting zone.By in hunting zone
All documents be classified by Layer semantics theme.Therefore, level is determined in step s 2
Semantic topic, also determined that by level screen candidate's literature borders, it is clear that the scope due to
The use of Layer semantics theme and accurately correspond to the subject fields that user expects to search, such as emotion point
Analysis, without regard to the document for being absorbed in data mining this last layer time theme.
Need below in step s3 in the related document of identified Layer semantics theme, continuation
Document is selected, final Search Results are used as.
For example, can be according to the semantic phase between identified Layer semantics theme and associated document
Like spending, document is selected, Search Results are used as.
Can also deliver time, the rank of the affiliated meeting of document or periodical, document according to document is cited
Number of times at least one and above-mentioned semantic similarity, select document, be used as Search Results.
The rank of the affiliated meeting of document can utilize CORE grade evaluation results.The affiliated periodical of document
Rank can utilize periodical factor of influence.
Following formula (2) gives in summary factor and carries out the document selection based on theme and row
The example of sequence.
For hierarchical subject t, Wp,tRefer to document p weight order.S (t, p) refers to document p and level
Theme t semantic similarity, the value of the similarity is obtained by Layer semantics topic model, is 0 to 1
Between real number.Refer to assign its time coefficient according to the time that document p is delivered, wherein y is
Time before feeling the pulse with the finger-tip, ypRefer to the time that document p is delivered, λ is time attenuation coefficient, can basis
It is the nature truth of a matter that λ is set to 15, e by experience.CpIt is the number of times that up to the present document p is cited.
QpFor the affiliated periodicals of document p or the rank of meeting.For meeting, according to CORE meeting grades
Data, can will be designated as the Q of A* meetingp1 is assigned, the Q of A meeting is designated aspAssign 0.8, mark
For the Q of B meetingp0.6 is assigned, the Q of C meeting is designated asp0.4 is assigned, else meeting is designated as
QpAssign 0.2.For periodical, the value of the factor of influence of periodical can be normalized, QpValue
For 1-0.6*e-i, wherein i is the factor of influence of periodical.Q after normalizationpValue, be 0.4 to 1 it
Between real number.And for the periodical currently without factor of influence, then by its QpUniformly it is entered as 0.2.
Above concrete numerical value is all preferred value, the invention is not restricted to this, can be flexible by those skilled in the art
Set.
Afterwards, according to Wp,tValue, to document carry out based on theme t selection (may be selected certain theme under
Whole or preceding M, M is natural number) and sort (can be by Wp,tDescending is arranged), and by result
Return to user.
So far, document is screened and sorted using hierarchical subject information, obtain more level
Thus have more targetedly accurate Search Results.
It is similar with the situation for searching for document, it is possible to use hierarchical information, author is scanned for.
Fig. 5 shows the flow chart of author's searching method according to an embodiment of the invention.Such as Fig. 5 institutes
Show, author's searching method comprises the following steps according to an embodiment of the invention:Receive on to search for
Author query text (step S51);Using Hierarchical Semantic Model, it is determined that with query text phase
The Layer semantics theme (step S52) of pass;And from related to identified Layer semantics theme
In author, author is selected, is used as Search Results (step S53).
The main distinction of author's searching method and above-mentioned literature search method is that set up is not search
In the range of all documents and Layer semantics theme between associate, but need to set up literature search model
Associating between each author and Layer semantics theme in enclosing.
The document that author can be made by it is characterized.
For example, all documents of each author are merged into a text for representing the author, by this article
Originally the term vector of the author is converted to, the sign of the author is used as.In the case, equivalent to each
Text corresponds to author, the search of author is equivalent in the search to text, text progress
State literature search method.
In the way of linear weighted function, it can also obtain the author's according to all documents of each author
Term vector.Wherein, the weight of each document of the author can deliver time, document with the document
Row of the number of times, the author that the rank of affiliated meeting or periodical, document are cited in document author
Tagmeme is put, at least one phase in the semantic similarity between the document and all documents of the author
Close.
That is, author can be regarded as to its linear weighted function result for delivering document, and the line of document
Property weighted results are it is also assumed that be a text, so the search to author is equivalent to search text
Rope, the text carries out above-mentioned literature search method.
The training method of corresponding Hierarchical Semantic Model is by each author in the range of literature search
Term vector is input to Hierarchical Semantic Model to be trained.
Then, when carrying out author's search, in step s 51, receive on the author's to be searched for
Query text.Query text is, for example, one or more keywords, passage etc..
Then, in step S52, using Hierarchical Semantic Model, it is determined that the layer related to query text
Secondary semantic topic.
Finally, in step S53, from the author related to identified Layer semantics theme, choosing
Author is selected, Search Results are used as.
Similar to above based on Wp,tThe selection and sequence of document are carried out, author can be carried out based on W (a, t)
Selection and sequence.
W (a, t) is for example as shown in formula (3):
Wherein, W (a, t) refers to that author a is directed to theme t weight orders.Delivered for all authors
The document crossed, is multiplied by certain coefficient, line of going forward side by side by the W (p, t) (formula (2)) of every document
Property weighting, you can obtain W (a, t).R (a, p) refers to ranking positions of the author a in document p, for example
The r (a, p) of the first authors is 1, and the r (a, p) of the second author is 2, the like.setaRefer to all works
The literature collection that person a is delivered, and S (p, seta) it is to represent document p and setaBetween semantic similarity.
The semantic similarity of document can be calculated based on theme vector.Specifically, using topic model,
Can be by each document representation into theme vector, therefore p and setaBetween semantic similarity can be with
Calculated using the similarity calculated between both corresponding theme vectors.The mode of calculating is typically adopted
Cosine similarity method is used, be will not be described here.
The result that can will be selected and be sorted based on W (a, t), is presented to user.
Literature search equipment according to an embodiment of the invention is described next, with reference to Fig. 6.
Fig. 6 shows the block diagram of literature search equipment according to an embodiment of the invention.As schemed
Shown in 6, included according to the literature search equipment 600 of the present invention:Query text reception device 61, quilt
It is configured to:Receive the query text on the document to be searched for;Theme determining device 62, is configured
For:The Layer semantics theme related to query text is determined using Hierarchical Semantic Model;And document
Selection device 63, is configured as:From the document related to identified Layer semantics theme, choosing
Document is selected, Search Results are used as.
In one embodiment, literature search equipment 600 also includes:Conversion equipment, is configured as:
All documents in the range of literature search are converted into term vector;Trainer, is configured as:By institute
Obtained term vector is input to Hierarchical Semantic Model to be trained, and trains obtained Hierarchical Semantic Model
With level.
In one embodiment, the conversion equipment is further configured to query text reception device
61 query texts received are converted to term vector.
In one embodiment, the set of the corresponding word of the element of the term vector is equal to literature search
In the range of the union of set Yu the domanial words list of word that includes of all documents.
In one embodiment, literature search equipment 600 also includes:Domanial words list builder device,
It is configured as:Collect the known art term of the every field in the range of literature search;Utilize hot word point
Analysis technology, from the document of the every field, extracts various types of hot words;By collected neck
Domain term and the hot word extracted are combined into the domanial words list.
In one embodiment, the word includes phrase and constitutes the word of phrase.
In one embodiment, when being converted to term vector, changed according to maximum length matching principle.
In one embodiment, the theme determining device 62 includes:Candidate topics acquiring unit,
It is configured as:Input the query text to Hierarchical Semantic Model, with obtain multiple candidate topics and its
With the semantic similarity of query text;Theme select unit, is configured as:From multiple candidate topics
It is determined that the Layer semantics theme related to query text.
In one embodiment, the theme select unit is further configured to:In only one time
Select the semantic similarity of theme more than in the case of predetermined threshold, the candidate topics are defined as and inquired about
The related Layer semantics theme of text.
In one embodiment, the theme select unit is further configured to:In all candidate masters
The semantic similarity of topic be respectively less than or equal to predetermined threshold in the case of, by pre-defined rule therefrom selector
Divide candidate topics, be used as the Layer semantics theme related to query text.
In one embodiment, the theme select unit is further configured to:Waited in more than one
Select in the case that the semantic similarity of theme is more than predetermined threshold, by descending suitable of semantic similarity
Sequence, selects the candidate topics of predetermined quantity, is used as the Layer semantics theme related to query text.
In one embodiment, the theme select unit is further configured to:Waited in more than one
Select the semantic similarity of theme more than in the case of predetermined threshold, semantic similarity is more than predetermined threshold
All candidate topics be presented to user, the Layer semantics master related to query text is selected by user
Topic.
In one embodiment, literature search equipment 600 also includes similarity adjusting apparatus, is configured
For:The candidate topics of user's selection and the semantic similarity of query text are improved, reduction user is non-selected
Candidate topics and query text semantic similarity.
In one embodiment, document selection device 63 is further configured to:According to identified
Semantic similarity between Layer semantics theme and associated document, selects document, is used as search
As a result.
In one embodiment, when document selection device 63 is further configured to be delivered according to document
Between, in the number of times that is cited of the rank of the affiliated meeting of document or periodical, document at least one with really
Semantic similarity between fixed Layer semantics theme and associated document, selects document, as
Search Results.
Due to each device and unit included in the literature search equipment 600 according to the present invention
In processing respectively with the processing in each step included in literature search method described above
It is similar, therefore for simplicity, the detailed description of these devices and unit is omitted herein.
Next, with reference to Fig. 7 descriptions, author searches for equipment according to an embodiment of the invention.
Fig. 7 shows the block diagram of author's search equipment according to an embodiment of the invention.As schemed
Shown in 7, included according to the author of present invention search equipment 700:Query text reception device 71, quilt
It is configured to:Receive the query text on the author to be searched for;Theme determining device 72, is configured
For:The Layer semantics theme related to query text is determined using Hierarchical Semantic Model;And author
Selection device 73, is configured as:From the author related to identified Layer semantics theme, choosing
Author is selected, Search Results are used as.
In one embodiment, author's search equipment 700 also includes:Trainer, the trainer
Including:Converting unit, is configured as:All documents of each author in the range of literature search are turned
It is changed to the term vector of the author;Training unit, is configured as:Resulting term vector is input to layer
Secondary semantic model is to be trained.
In one embodiment, the converting unit is further configured to:By all of each author
Document merges into a text for representing the author, and the text is converted to the term vector of the author.
In one embodiment, the converting unit is further configured to:According to the institute of each author
There is document, in the way of linear weighted function, obtain the term vector of the author;Wherein, the author's is each
The weight of document and document are delivered time, the rank of the affiliated meeting of document or periodical, document and are cited
Sorting position, the document and all documents of the author of number of times, the author in document author it
Between semantic similarity at least one is related.
Due to searching for each device and unit included in equipment 700 according to the author of the present invention
In processing respectively with the processing in each step included in author's searching method described above
It is similar, therefore for simplicity, the detailed description of these devices and unit is omitted herein.
In addition, still needing here, it is noted that the component devices of each in the said equipment, unit can pass through
Software, firmware, hardware or the mode of its combination are configured.The workable specific means of configuration or side
Formula is well known to those skilled in the art, and will not be repeated here.In the feelings realized by software or firmware
Under condition, from storage medium or network to the computer with specialized hardware structure (such as shown in Fig. 8
All-purpose computer 800) install constitute the software program, the computer when being provided with various programs,
It is able to carry out various functions etc..
Fig. 8 is shown available for the computer for implementing method and apparatus according to an embodiment of the invention
Schematic block diagram.
In fig. 8, CPU (CPU) 801 is deposited according in read-only storage (ROM) 802
The program of storage or the program for being loaded into random access memory (RAM) 803 from storage part 808 are performed
Various processing.In RAM 803, stored always according to needs when CPU 801 performs various processing etc.
Deng when required data.CPU 801, ROM 802 and RAM 803 connect each other via bus 804
Connect.Input/output interface 805 is also connected to bus 804.
Components described below is connected to input/output interface 805:Importation 806 (including keyboard, mouse
Etc.), output par, c 807 (including display, such as cathode-ray tube (CRT), liquid crystal display
(LCD) etc., and loudspeaker etc.), storage part 808 (including hard disk etc.), (bag of communications portion 809
Include NIC such as LAN card, modem etc.).Communications portion 809 via network such as
Internet performs communication process.As needed, driver 810 can be connected to input/output interface
805.Detachable media 811 such as disk, CD, magneto-optic disk, semiconductor memory etc. can be with
It is installed in as needed on driver 810 so that the computer program read out quilt as needed
It is installed in storage part 808.
In the case where realizing above-mentioned series of processes by software, it is situated between from network such as internet or storage
Matter such as detachable media 811 installs the program for constituting software.
It will be understood by those of skill in the art that this storage medium is not limited to shown in Fig. 8 wherein
Have program stored therein, separately distribute to provide a user the detachable media 811 of program with equipment.
The example of detachable media 811 (includes CD comprising disk (including floppy disk (registration mark)), CD
Read-only storage (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (note
Volume trade mark)) and semiconductor memory.Or, storage medium can be ROM 802, storage part 808
In the hard disk that includes etc., wherein computer program stored, and being distributed to together with the equipment comprising them
User.
The present invention also proposes a kind of program product of the instruction code for the machine-readable that is stored with.It is described to refer to
When making the code be read and be performed by machine, above-mentioned method according to an embodiment of the invention can perform.
Correspondingly, the program product for the instruction code that carries the above-mentioned machine-readable that is stored with is deposited
Storage media is also included within disclosure of the invention.The storage medium include but is not limited to floppy disk, CD,
Magneto-optic disk, storage card, memory stick etc..
In description above to the specific embodiment of the invention, for a kind of embodiment description and/or
The feature shown can be made in same or similar mode in one or more other embodiments
With, it is combined with feature in other embodiment, or substitute the feature in other embodiment.
It should be emphasized that term "comprises/comprising" refer to when being used herein feature, key element, step or
The presence of component, but it is not precluded from depositing for one or more further features, key element, step or component
Or it is additional.
In addition, the method for the present invention be not limited to specifications described in time sequencing perform,
Can according to other time sequencings, concurrently or independently perform.Therefore, retouched in this specification
The execution sequence for the method stated is not construed as limiting to the technical scope of the present invention.
Although being draped over one's shoulders above by the description of the specific embodiment to the present invention to the present invention
Dew, however, it is to be understood that above-mentioned all embodiments and example are exemplary, and it is unrestricted
Property.Those skilled in the art can design to the present invention in the spirit and scope of the appended claims
Various modifications, improvement or equivalent.These modifications, improvement or equivalent should also be as being considered as
Including within the scope of the present invention.
Note
1. a kind of literature search method, including:
Receive the query text on the document to be searched for;
Using Hierarchical Semantic Model, it is determined that the Layer semantics theme related to query text;And
From the document related to identified Layer semantics theme, document is selected, Search Results are used as.
2. the method as described in note 1, also includes:
All documents in the range of literature search are converted into term vector;
Resulting term vector is input to Hierarchical Semantic Model to be trained, obtained level is trained
Semantic model has level.
3. the method as described in note 2, wherein, the collection of the corresponding word of element of the term vector
Close the set and domanial words list for the word that all documents being equal in the range of literature search include
Union.
4. the method as described in note 3, also includes:
Collect the known art term of the every field in the range of literature search;
Using hot word analytical technology, from the document of the every field, various types of hot words are extracted;
Collected field term and the hot word extracted are combined into the domanial words list.
5. the method as described in note 1, wherein, using Hierarchical Semantic Model, it is determined that literary with inquiry
This related Layer semantics theme includes:
Input the query text to Hierarchical Semantic Model, with obtain multiple candidate topics and its with inquiry
The semantic similarity of text;
The Layer semantics theme related to query text is determined from multiple candidate topics.
6. the method as described in note 5, wherein, determined and query text from multiple candidate topics
Related Layer semantics theme includes:
In the case where the semantic similarity of only one candidate topics is more than predetermined threshold, by the candidate
Theme is defined as the Layer semantics theme related to query text;
Semantic similarity in all candidate topics be respectively less than or equal to predetermined threshold in the case of, by pre-
Set pattern then therefrom selected section candidate topics, are used as the Layer semantics theme related to query text;
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase
Descending order is seemingly spent, the candidate topics of predetermined quantity are selected, as related to query text
Layer semantics theme.
7. the method as described in note 5, wherein, determined and query text from multiple candidate topics
Related Layer semantics theme includes:
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase
All candidate topics like degree more than predetermined threshold are presented to user, by user's selection and query text phase
The Layer semantics theme of pass.
8. the method as described in note 7, wherein, improve the candidate topics and inquiry text of user's selection
The semantic similarity of this semantic similarity, the non-selected candidate topics of reduction user and query text.
9. the method as described in note 1, wherein, from related to identified Layer semantics theme
In document, selection document includes:
According to the semantic similarity between identified Layer semantics theme and associated document, choosing
Document is selected, Search Results are used as.
10. the method as described in note 9, wherein, deliver the time always according to document, meeting belonging to document
At least one in the number of times that view or the rank of periodical, document are cited, selects document, is used as search
As a result.
11. a kind of author's searching method, including:
Receive the query text on the author to be searched for;
Using Hierarchical Semantic Model, it is determined that the Layer semantics theme related to query text;And
From the author related to identified Layer semantics theme, author is selected, Search Results are used as.
12. a kind of literature search equipment, including:
Query text reception device, is configured as:Receive the query text on the document to be searched for;
Theme determining device, is configured as:Determined using Hierarchical Semantic Model related to query text
Layer semantics theme;And
Document selection device, is configured as:From the document related to identified Layer semantics theme,
Document is selected, Search Results are used as.
13. the equipment as described in note 12, also includes:
Conversion equipment, is configured as:All documents in the range of literature search are converted into term vector;
Trainer, is configured as:Resulting term vector is input to Hierarchical Semantic Model to carry out
Training, trains obtained Hierarchical Semantic Model to have level.
14. the equipment as described in note 13, wherein, the corresponding word of element of the term vector
Set is equal to the set for the word that all documents in the range of literature search include and domanial words are arranged
The union of table.
15. the equipment as described in note 14, also includes:Domanial words list builder device, by with
It is set to:
Collect the known art term of the every field in the range of literature search;
Using hot word analytical technology, from the document of the every field, various types of hot words are extracted;
Collected field term and the hot word extracted are combined into the domanial words list.
16. the equipment as described in note 12, wherein, the theme determining device includes:
Candidate topics acquiring unit, is configured as:The query text is inputted to Hierarchical Semantic Model,
To obtain multiple candidate topics and its semantic similarity with query text;
Theme select unit, is configured as:Determined from multiple candidate topics related to query text
Layer semantics theme.
17. the equipment as described in note 16, wherein, the theme select unit is further configured
For:
In the case where the semantic similarity of only one candidate topics is more than predetermined threshold, by the candidate
Theme is defined as the Layer semantics theme related to query text;
Semantic similarity in all candidate topics be respectively less than or equal to predetermined threshold in the case of, by pre-
Set pattern then therefrom selected section candidate topics, are used as the Layer semantics theme related to query text;
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase
Descending order is seemingly spent, the candidate topics of predetermined quantity are selected, as related to query text
Layer semantics theme.
18. the equipment as described in note 16, wherein, the theme select unit is further configured
For:In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase
All candidate topics like degree more than predetermined threshold are presented to user, by user's selection and query text phase
The Layer semantics theme of pass.
19. the equipment as described in note 18, also includes:Similarity adjusting apparatus, is configured as:
Improve the candidate topics of user's selection and the semantic similarity of query text, the non-selected time of reduction user
Select theme and the semantic similarity of query text.
20. the equipment as described in note 12, wherein, document selection device is further configured to:
According to the semantic similarity between identified Layer semantics theme and associated document, selection text
Offer, be used as Search Results.
Claims (10)
1. a kind of literature search method, including:
Receive the query text on the document to be searched for;
Using Hierarchical Semantic Model, it is determined that the Layer semantics theme related to query text;And
From the document related to identified Layer semantics theme, document is selected, Search Results are used as.
2. the method as described in claim 1, also includes:
All documents in the range of literature search are converted into term vector;
Resulting term vector is input to Hierarchical Semantic Model to be trained, obtained level is trained
Semantic model has level.
3. method as claimed in claim 2, wherein, the corresponding word of element of the term vector
Set be equal to the set of word and the domanial words that all documents in the range of literature search include
The union of list.
4. method as claimed in claim 3, also includes:
Collect the known art term of the every field in the range of literature search;
Using hot word analytical technology, from the document of the every field, various types of hot words are extracted;
Collected field term and the hot word extracted are combined into the domanial words list.
5. Hierarchical Semantic Model is the method for claim 1, wherein utilized, it is determined that with looking into
Asking the related Layer semantics theme of text includes:
Input the query text to Hierarchical Semantic Model, with obtain multiple candidate topics and its with inquiry
The semantic similarity of text;
The Layer semantics theme related to query text is determined from multiple candidate topics.
6. method as claimed in claim 5, wherein, determined from multiple candidate topics and inquiry
The related Layer semantics theme of text includes:
In the case where the semantic similarity of only one candidate topics is more than predetermined threshold, by the candidate
Theme is defined as the Layer semantics theme related to query text;
Semantic similarity in all candidate topics be respectively less than or equal to predetermined threshold in the case of, by pre-
Set pattern then therefrom selected section candidate topics, are used as the Layer semantics theme related to query text;
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase
Descending order is seemingly spent, the candidate topics of predetermined quantity are selected, as related to query text
Layer semantics theme.
7. method as claimed in claim 5, wherein, determined from multiple candidate topics and inquiry
The related Layer semantics theme of text includes:
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase
All candidate topics like degree more than predetermined threshold are presented to user, by user's selection and query text phase
The Layer semantics theme of pass.
8. method as claimed in claim 7, wherein, the candidate topics of user's selection are improved with looking into
The semantic similarity of text is ask, the non-selected candidate topics of reduction user are semantic similar to query text
Degree.
9. a kind of author's searching method, including:
Receive the query text on the author to be searched for;
Using Hierarchical Semantic Model, it is determined that the Layer semantics theme related to query text;And
From the author related to identified Layer semantics theme, author is selected, Search Results are used as.
10. a kind of literature search equipment, including:
Query text reception device, is configured as:Receive the query text on the document to be searched for;
Theme determining device, is configured as:Determined using Hierarchical Semantic Model related to query text
Layer semantics theme;And
Document selection device, is configured as:From the document related to identified Layer semantics theme,
Document is selected, Search Results are used as.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610007271.9A CN106951420A (en) | 2016-01-06 | 2016-01-06 | Literature search method and apparatus, author's searching method and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610007271.9A CN106951420A (en) | 2016-01-06 | 2016-01-06 | Literature search method and apparatus, author's searching method and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106951420A true CN106951420A (en) | 2017-07-14 |
Family
ID=59465655
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610007271.9A Pending CN106951420A (en) | 2016-01-06 | 2016-01-06 | Literature search method and apparatus, author's searching method and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106951420A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108829822A (en) * | 2018-06-12 | 2018-11-16 | 腾讯科技(深圳)有限公司 | The recommended method and device of media content, storage medium, electronic device |
CN110334178A (en) * | 2019-03-28 | 2019-10-15 | 平安科技(深圳)有限公司 | Data retrieval method, device, equipment and readable storage medium storing program for executing |
CN111324701A (en) * | 2020-02-24 | 2020-06-23 | 腾讯科技(深圳)有限公司 | Content supplement method, content supplement device, computer equipment and storage medium |
CN111666371A (en) * | 2020-04-21 | 2020-09-15 | 北京三快在线科技有限公司 | Theme-based matching degree determination method and device, electronic equipment and storage medium |
CN113591853A (en) * | 2021-08-10 | 2021-11-02 | 北京达佳互联信息技术有限公司 | Keyword extraction method and device and electronic equipment |
WO2022116324A1 (en) * | 2020-12-04 | 2022-06-09 | 中国科学院深圳先进技术研究院 | Search model training method, apparatus, terminal device, and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102799677A (en) * | 2012-07-20 | 2012-11-28 | 河海大学 | Water conservation domain information retrieval system and method based on semanteme |
CN103136352A (en) * | 2013-02-27 | 2013-06-05 | 华中师范大学 | Full-text retrieval system based on two-level semantic analysis |
CN103425710A (en) * | 2012-05-25 | 2013-12-04 | 北京百度网讯科技有限公司 | Subject-based searching method and device |
CN103678302A (en) * | 2012-08-30 | 2014-03-26 | 北京百度网讯科技有限公司 | Document structuration organizing method and device |
US20150186364A1 (en) * | 2005-05-06 | 2015-07-02 | John M. Nelson | Database and Index Organization for Enhanced Document Retrieval |
CN104834679A (en) * | 2015-04-14 | 2015-08-12 | 苏州大学 | Representation and inquiry method of behavior track and device therefor |
-
2016
- 2016-01-06 CN CN201610007271.9A patent/CN106951420A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150186364A1 (en) * | 2005-05-06 | 2015-07-02 | John M. Nelson | Database and Index Organization for Enhanced Document Retrieval |
CN103425710A (en) * | 2012-05-25 | 2013-12-04 | 北京百度网讯科技有限公司 | Subject-based searching method and device |
CN102799677A (en) * | 2012-07-20 | 2012-11-28 | 河海大学 | Water conservation domain information retrieval system and method based on semanteme |
CN103678302A (en) * | 2012-08-30 | 2014-03-26 | 北京百度网讯科技有限公司 | Document structuration organizing method and device |
CN103136352A (en) * | 2013-02-27 | 2013-06-05 | 华中师范大学 | Full-text retrieval system based on two-level semantic analysis |
CN104834679A (en) * | 2015-04-14 | 2015-08-12 | 苏州大学 | Representation and inquiry method of behavior track and device therefor |
Non-Patent Citations (1)
Title |
---|
贾鹏: "互联网科研文献检索系统设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108829822A (en) * | 2018-06-12 | 2018-11-16 | 腾讯科技(深圳)有限公司 | The recommended method and device of media content, storage medium, electronic device |
CN108829822B (en) * | 2018-06-12 | 2023-10-27 | 腾讯科技(深圳)有限公司 | Media content recommendation method and device, storage medium and electronic device |
CN110334178A (en) * | 2019-03-28 | 2019-10-15 | 平安科技(深圳)有限公司 | Data retrieval method, device, equipment and readable storage medium storing program for executing |
CN110334178B (en) * | 2019-03-28 | 2023-06-20 | 平安科技(深圳)有限公司 | Data retrieval method, device, equipment and readable storage medium |
CN111324701A (en) * | 2020-02-24 | 2020-06-23 | 腾讯科技(深圳)有限公司 | Content supplement method, content supplement device, computer equipment and storage medium |
CN111324701B (en) * | 2020-02-24 | 2023-04-07 | 腾讯科技(深圳)有限公司 | Content supplement method, content supplement device, computer equipment and storage medium |
CN111666371A (en) * | 2020-04-21 | 2020-09-15 | 北京三快在线科技有限公司 | Theme-based matching degree determination method and device, electronic equipment and storage medium |
WO2022116324A1 (en) * | 2020-12-04 | 2022-06-09 | 中国科学院深圳先进技术研究院 | Search model training method, apparatus, terminal device, and storage medium |
CN113591853A (en) * | 2021-08-10 | 2021-11-02 | 北京达佳互联信息技术有限公司 | Keyword extraction method and device and electronic equipment |
CN113591853B (en) * | 2021-08-10 | 2024-04-19 | 北京达佳互联信息技术有限公司 | Keyword extraction method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Marcos-Pablos et al. | Information retrieval methodology for aiding scientific database search | |
CN106844658B (en) | Automatic construction method and system of Chinese text knowledge graph | |
CN106951420A (en) | Literature search method and apparatus, author's searching method and equipment | |
CN107463607B (en) | Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning | |
CN110457708B (en) | Vocabulary mining method and device based on artificial intelligence, server and storage medium | |
US20070073534A1 (en) | Corpus expansion system and method thereof | |
US20110295857A1 (en) | System and method for aligning and indexing multilingual documents | |
CN106874279A (en) | Generate the method and device of applicating category label | |
CN113268995A (en) | Chinese academy keyword extraction method, device and storage medium | |
Barriere et al. | TerminoWeb: a software environment for term study in rich contexts | |
CN111061828B (en) | Digital library knowledge retrieval method and device | |
CN111221968B (en) | Author disambiguation method and device based on subject tree clustering | |
CN105893485A (en) | Automatic special subject generating method based on book catalogue | |
CN110888991A (en) | Sectional semantic annotation method in weak annotation environment | |
CN110516074A (en) | Website theme classification method and device based on deep learning | |
Meena et al. | Survey on graph and cluster based approaches in multi-document text summarization | |
CN105893362A (en) | A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points | |
KR20220134695A (en) | System for author identification using artificial intelligence learning model and a method thereof | |
Subbalakshmi et al. | A Gravitational Search Algorithm Study on Text Summarization Using NLP | |
JP2008084203A (en) | System, method and program for assigning label | |
Alami et al. | Arabic text summarization based on graph theory | |
CN108345694A (en) | A kind of document retrieval method and system based on subject data base | |
JP3198932B2 (en) | Document search device | |
JP4534019B2 (en) | Name and keyword grouping method, program, recording medium and apparatus thereof | |
CN103514194B (en) | Determine method and apparatus and the classifier training method of the dependency of language material and entity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170714 |
|
WD01 | Invention patent application deemed withdrawn after publication |