CN106951420A - Literature search method and apparatus, author's searching method and equipment - Google Patents

Literature search method and apparatus, author's searching method and equipment Download PDF

Info

Publication number
CN106951420A
CN106951420A CN201610007271.9A CN201610007271A CN106951420A CN 106951420 A CN106951420 A CN 106951420A CN 201610007271 A CN201610007271 A CN 201610007271A CN 106951420 A CN106951420 A CN 106951420A
Authority
CN
China
Prior art keywords
theme
document
query text
author
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610007271.9A
Other languages
Chinese (zh)
Inventor
宋双永
房璐
缪庆亮
孟遥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201610007271.9A priority Critical patent/CN106951420A/en
Publication of CN106951420A publication Critical patent/CN106951420A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of literature search method and apparatus, author's searching method and equipment.Document searching method includes:Receive the query text on the document to be searched for;Using Hierarchical Semantic Model, it is determined that the Layer semantics theme related to query text;And from the document related to identified Layer semantics theme, select document, be used as Search Results.Compared to the method for not utilizing hierarchical information, the present invention can obtain more accurate Search Results by using hierarchical subject information.

Description

Literature search method and apparatus, author's searching method and equipment
Technical field
This invention relates generally to natural language processing field.Specifically, the present invention relates to a kind of energy Enough literature search method and apparatus, author's searching method and the equipment for obtaining Search Results exactly.
Background technology
In recent years, with information storage capability and the fast lifting of web search technology, current science The lookup of document and the search of related scholar are largely completed by network retrieval platform.Network retrieval is put down The retrieval side based on Keywords matching and text similarity similar with universal search engine is used platform more Formula, although such retrieval mode is performed well in universal search engine, is directed to academic documents The search of/author, then be short of the information in terms of the classification for considering sphere of learning, domanial hierarchy structure, So that the result that search is returned is not accurate enough.
For example, it is sentiment analysis that data mining technology, which has a specific branch,.If searching for feelings Feel the academic documents in terms of analysis, inevitably return to some or even be much absorbed in data mining The document of this high-level Abstract study, may merely because be referred to sentiment analysis wherein, or Sentiment analysis is briefly introduced.But, searchers is actually not relevant for abstract data mining, And it is desirable to obtain the specific achievement in research of this lower level of sentiment analysis.When search sentiment analysis neck During the author in domain, in returning result also can doped with the Abstract study for being absorbed in data mining author.
It can be seen that, problem of the prior art is that Search Results are not accurate enough, produces having its source in for problem Do not make full use of hierarchical information.
Therefore, it is contemplated that carrying out literature search and author's search exactly.
The content of the invention
The brief overview on the present invention is given below, to provide on some of the present invention The basic comprehension of aspect.It should be appreciated that this general introduction is not the exhaustive general introduction on the present invention. It is not intended to determine the key or pith of the present invention, nor the model of the intended limitation present invention Enclose.Its purpose only provides some concepts in simplified form, more detailed in this, as what is discussed later The preamble carefully described.
The purpose of the present invention is to propose to a kind of literature search method and apparatus for returning to accurate Search Results, Author's searching method and equipment.
To achieve these goals, according to an aspect of the invention, there is provided a kind of literature search side Method, this method includes:Receive the query text on the document to be searched for;Using Hierarchical Semantic Model, It is determined that the Layer semantics theme related to query text;And from identified Layer semantics theme phase In the document of pass, document is selected, Search Results are used as.
According to another aspect of the present invention there is provided a kind of literature search equipment, the equipment includes: Query text reception device, is configured as:Receive the query text on the document to be searched for;Theme Determining device, is configured as:The level language related to query text is determined using Hierarchical Semantic Model Adopted theme;And document selection device, it is configured as:From related to identified Layer semantics theme Document in, select document, be used as Search Results.
In accordance with a further aspect of the present invention there is provided a kind of author's searching method, this method includes:Connect Receive the query text on the author to be searched for;Using Hierarchical Semantic Model, it is determined that with query text phase The Layer semantics theme of pass;And from the author related to identified Layer semantics theme, selection Author, is used as Search Results.
Equipment is searched for there is provided a kind of author according to another aspect of the invention, the equipment includes:Look into Received text device is ask, is configured as:Receive the query text on the author to be searched for;Theme is true Determine device, be configured as:The Layer semantics related to query text are determined using Hierarchical Semantic Model Theme;And author's selection device, it is configured as:From related to identified Layer semantics theme In author, author is selected, Search Results are used as.
In addition, according to another aspect of the present invention, additionally providing a kind of storage medium.The storage is situated between Matter includes machine readable program code, when performing described program code on message processing device, Described program code make it that described information processing equipment performs the above method according to the present invention.
In addition, in accordance with a further aspect of the present invention, additionally providing a kind of program product.Described program is produced Product include the instruction that machine can perform, when performing the instruction on message processing device, the finger Order make it that described information processing equipment performs the above method according to the present invention.
Brief description of the drawings
With reference to explanation below in conjunction with the accompanying drawings to embodiments of the invention, this hair can be more readily understood that Bright above and other objects, features and advantages.Part in accompanying drawing is intended merely to show the present invention's Principle.In the accompanying drawings, same or similar technical characteristic or part will use same or similar attached Icon is remembered to represent.In accompanying drawing:
Fig. 1 shows the flow chart of literature search method according to an embodiment of the invention;
Fig. 2 shows the example for the implicit level thematic structure that hierarchical subject model is obtained;
Fig. 3 shows the example for the implicit level thematic structure that hierarchical subject model is obtained;
Fig. 4 shows step S2 specific implementation;
Fig. 5 shows the flow chart of author's searching method according to an embodiment of the invention;
Fig. 6 shows the block diagram of literature search equipment according to an embodiment of the invention;
Fig. 7 shows the block diagram of author's search equipment according to an embodiment of the invention;And
Fig. 8 is shown available for the computer for implementing method and apparatus according to an embodiment of the invention Schematic block diagram.
Embodiment
The one exemplary embodiment of the present invention is described in detail hereinafter in connection with accompanying drawing.In order to clear For the sake of Chu and simplicity, all features of actual embodiment are not described in the description.However, should The understanding, must make many specific to implementation during any this actual embodiment is developed The decision of mode, to realize the objectives of developer, for example, meeting and system and business phase Those restrictive conditions closed, and these restrictive conditions may be with the difference of embodiment Change.In addition, it also should be appreciated that, although development is likely to be extremely complex and time-consuming, but For the those skilled in the art for having benefited from present disclosure, this development is only routine Task.
Herein, in addition it is also necessary to which explanation is a bit, in order to avoid having obscured this hair because of unnecessary details It is bright, illustrate only in the accompanying drawings with according to the closely related apparatus structure of the solution of the present invention and/or Process step, and eliminate and the little other details of relation of the present invention.In addition, it is also stated that It is that the element and feature described in a kind of accompanying drawing or embodiment of the present invention can be with one Or more the element that shows in other accompanying drawings or embodiment and feature be combined.
The flow of literature search method according to an embodiment of the invention is described below with reference to Fig. 1.
Fig. 1 shows the flow chart of literature search method according to an embodiment of the invention.Such as Fig. 1 institutes Show, document searching method comprises the following steps:Receive the query text (step on the document to be searched for Rapid S1);Using Hierarchical Semantic Model, it is determined that the Layer semantics theme (step related to query text S2);And from the document related to identified Layer semantics theme, document is selected, as searching Hitch fruit (step S3).
In step sl, the query text on the document to be searched for is received.
Specifically, the inquiry text by user input on its document to be searched for, can be one Or more keyword or passage etc., e.g. user's document interested summary.
Receive after query text, it is necessary to query text is converted into term vector, in favor of subsequent treatment.
The set of the corresponding word of element of term vector is equal in all documents in the range of literature search Including word set Yu domanial words list union.Word mentioned here includes word, phrase Deng.Wherein word includes the word for constituting phrase.For example, both including phrase " sentiment analysis ", also include Word " emotion ", word " analysis ".
When being preferably converted to term vector, changed according to maximum length matching principle.Namely Say, be by text to be converted query text word corresponding with the element of term vector set in can The most long word mixed is mapped.For example, if containing phrase " emotion in query text Analysis ", it is necessary to which term vector, which embodies text, includes phrase " sentiment analysis ", rather than including word " feelings Sense " and word " analysis ".In practical operation, for example, carry out in the following way by maximum length matching The conversion of principle:" Named Entity Recognition " and " Named Entity " are natures Term in Language Processing field, according to maximum length matching principle, first to " Named Entity Recognition " is matched, if matching is unsuccessful, is reattempted to " Named Entity " enter Row matching.
The set of the corresponding word of element on term vector, literature search necessarily has its hunting zone, Naturally needs turn into the basic element compared, phase to all words that all documents in hunting zone include Ground is answered to constitute a source of the set of the corresponding word of element of term vector.
Another source of the set of the corresponding word of element of term vector is domanial words list.Field The acquisition methods of word list include collecting the known art art of the every field in the range of literature search Language.Namely collect the existing field term in each field, the keyword field of such as each document, specially The nomenclature that family or practitioner provide, nomenclature in textbook attached sheet etc..
The acquisition methods of domanial words list are also using hot word analytical technology, from the text of every field In offering, various types of hot words are extracted.The high frequency hot word of every field is namely excavated as field Word, this is the important supplement to existing field term.
In addition, traditional hot word extraction technique mainly extracts nominal word (including word and phrase), The invention is not restricted to this, other types of word can also be extracted.
In the extraction process of hot word, for Chinese literature, also need to divide Chinese text first Word and part-of-speech tagging.The conventional known process that participle and part-of-speech tagging belong in natural language processing, This is repeated no more.For english literature, without carrying out participle and part-of-speech tagging.Further, since Chinese With the different semantic meaning representation abilities of English, the hot word extracted to Chinese is to preferably include two to four words Word string, and the hot word extracted to English is the word string that preferably includes two to three words.It is being drawn into After these word strings, the word string result to these extractions carries out Substring reduction.The rule of Substring reduction is such as Shown in formula (1), wherein TlengthRefer to the number for the word that the length of word string, i.e. word string are included, Tfrequency Refer to the number of times that the word string occurs, and TvalueDetermined by above-mentioned two factor, if a long word string Comprising another short word string, and the T of the long word stringvalueMore than the T of the short word stringvalue, then this is short Word string will be merged, otherwise, retain short word string, delete long word string.All word frequency finally given Word string more than setpoint frequency threshold value is added into hot word list.Referring herein to frequency threshold can It is configured with the quantity of document according to the every field in the range of literature search.If for example, certain The related quantity of document in individual field is 100,000, then can rule of thumb set, occurrence number is less than 100 word string, is not considered as focus word string.
Tvalue=Tfrequency*Tlength (1)
Complete after hot word extraction, the hot word list extracted is carried out with above-mentioned existing field term Merge, as domanial words list.
To sum up, it is possible to use hot word analytical technology enriches domanial words lists, so with hunting zone All words for including of all documents constitute together term vector the corresponding word of element set, from And the query text on the document to be searched for received can be converted to term vector, for subsequent Processing.
In step s 2, using Hierarchical Semantic Model, it is determined that the Layer semantics related to query text Theme.
The present invention utilizes hierarchical information by using Hierarchical Semantic Model, so as to improve the essence of search Degree.
The training method of Hierarchical Semantic Model is that all documents in the range of literature search are converted into word Vector, is then input to Hierarchical Semantic Model to be trained, training is obtained by resulting term vector Hierarchical Semantic Model there is level, go out data-oriented collection equivalent to Hierarchical Semantic Model automatic detection The Semantic hierarchy implied in (all documents in the range of literature search).The level language trained Adopted model can provide corresponding level language in Semantic hierarchy for the term vector newly inputted Adopted theme and corresponding semantic similarity.Hierarchical Semantic Model is, for example, that level implies Di Li Crays point Cloth (hierarchical Latent Dirichlet Allocation, hLDA).
Term vector conversion herein and the term vector by maximum length matching principle of query text before Conversion is different, in order to increase the level of coverage that content is searched for user, in the range of by literature search During all documents are converted to term vector, word and phrase are preserved simultaneously.For example, in a document If comprising " Named Entity ", term vector will embody simultaneously " Named ", " Entity " and " Named Entity " presence.Similarly, if including " Named Entity in a document Recognition ", then term vector will embody " Named Entity Recognition ", " Named simultaneously Entity ", " Named ", " Entity ", the presence of " Recognition ".
Fig. 2 and Fig. 3 respectively illustrate the implicit level thematic structure that two hierarchical subject models are obtained. Fig. 2 is to utilize 1272《Psychological Review》Article abstract in periodical is used as number According to the implicit level thematic structure therefrom found.Because the content that psychology is included is very wide, therefore figure The keyword in root node in 2 is the everyday expressions such as the, of and and, represents one " virtual " Root node, also i other words, be practically without apparent pass between several leaf nodes of the second layer Connection relation, because the shared father node between them is simply comprising these common words of the, of and and Language.Fig. 3 is related to birds group using in more than 200 Flickr websites as data, therefrom It was found that implicit level thematic structure.In query process after training, processing of the invention equivalent to First have to find a corresponding node (level master from existing Layer semantics structure for query text Topic) (step S2), further search for knot from the document associated with the node and author again afterwards Really (step S3).
That is, in step s 2, using Hierarchical Semantic Model, it is determined that related to query text Layer semantics theme.
Specifically, as shown in figure 4, step S2 can be achieved by the steps of.
Input the query text to Hierarchical Semantic Model, with obtain multiple candidate topics and its with inquiry The semantic similarity (step S41) of text;Determined from multiple candidate topics related to query text Layer semantics theme (step S42).
Because Hierarchical Semantic Model is by training, so according to the query text of input (with term vector Form), can obtain may corresponding multiple candidate topics, the determination of above-mentioned candidate topics Semantic similarity based on query text and candidate topics.
And then, candidate topics can be screened according to semantic similarity.
Specifically, in the case where the semantic similarity of only one candidate topics is more than predetermined threshold, The candidate topics are defined as to the Layer semantics theme related to query text.
Semantic similarity in all candidate topics be respectively less than or equal to predetermined threshold in the case of, by pre- Set pattern then therefrom selected section candidate topics, are used as the Layer semantics theme related to query text.In advance Set pattern then can flexibly be set by those skilled in the art, e.g. choose semantic similarity top n Candidate topics, N is the natural number specified.
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase Descending order is seemingly spent, the candidate topics of predetermined quantity are selected, as related to query text Layer semantics theme.
Or, will in the case where the semantic similarity of more than one candidate topics is more than predetermined threshold All candidate topics that semantic similarity is more than predetermined threshold are presented to user, are selected by user with inquiring about The related Layer semantics theme of text.The major advantage of this embodiment is the provision of the anti-of user Feedback, can optimize the present invention according to user feedback, make its true idea gradually closer to user.Example Such as, the candidate topics of user's selection and the semantic similarity of query text can be improved, reduction user is not The candidate topics of selection and the semantic similarity of query text.
In step s3, from the document related to identified Layer semantics theme, document is selected, It is used as Search Results.
When being initially trained with all documents in hunting zone to Hierarchical Semantic Model, actually Construct the incidence relation of Layer semantics theme and all documents in hunting zone.By in hunting zone All documents be classified by Layer semantics theme.Therefore, level is determined in step s 2 Semantic topic, also determined that by level screen candidate's literature borders, it is clear that the scope due to The use of Layer semantics theme and accurately correspond to the subject fields that user expects to search, such as emotion point Analysis, without regard to the document for being absorbed in data mining this last layer time theme.
Need below in step s3 in the related document of identified Layer semantics theme, continuation Document is selected, final Search Results are used as.
For example, can be according to the semantic phase between identified Layer semantics theme and associated document Like spending, document is selected, Search Results are used as.
Can also deliver time, the rank of the affiliated meeting of document or periodical, document according to document is cited Number of times at least one and above-mentioned semantic similarity, select document, be used as Search Results.
The rank of the affiliated meeting of document can utilize CORE grade evaluation results.The affiliated periodical of document Rank can utilize periodical factor of influence.
Following formula (2) gives in summary factor and carries out the document selection based on theme and row The example of sequence.
For hierarchical subject t, Wp,tRefer to document p weight order.S (t, p) refers to document p and level Theme t semantic similarity, the value of the similarity is obtained by Layer semantics topic model, is 0 to 1 Between real number.Refer to assign its time coefficient according to the time that document p is delivered, wherein y is Time before feeling the pulse with the finger-tip, ypRefer to the time that document p is delivered, λ is time attenuation coefficient, can basis It is the nature truth of a matter that λ is set to 15, e by experience.CpIt is the number of times that up to the present document p is cited. QpFor the affiliated periodicals of document p or the rank of meeting.For meeting, according to CORE meeting grades Data, can will be designated as the Q of A* meetingp1 is assigned, the Q of A meeting is designated aspAssign 0.8, mark For the Q of B meetingp0.6 is assigned, the Q of C meeting is designated asp0.4 is assigned, else meeting is designated as QpAssign 0.2.For periodical, the value of the factor of influence of periodical can be normalized, QpValue For 1-0.6*e-i, wherein i is the factor of influence of periodical.Q after normalizationpValue, be 0.4 to 1 it Between real number.And for the periodical currently without factor of influence, then by its QpUniformly it is entered as 0.2. Above concrete numerical value is all preferred value, the invention is not restricted to this, can be flexible by those skilled in the art Set.
Afterwards, according to Wp,tValue, to document carry out based on theme t selection (may be selected certain theme under Whole or preceding M, M is natural number) and sort (can be by Wp,tDescending is arranged), and by result Return to user.
So far, document is screened and sorted using hierarchical subject information, obtain more level Thus have more targetedly accurate Search Results.
It is similar with the situation for searching for document, it is possible to use hierarchical information, author is scanned for.
Fig. 5 shows the flow chart of author's searching method according to an embodiment of the invention.Such as Fig. 5 institutes Show, author's searching method comprises the following steps according to an embodiment of the invention:Receive on to search for Author query text (step S51);Using Hierarchical Semantic Model, it is determined that with query text phase The Layer semantics theme (step S52) of pass;And from related to identified Layer semantics theme In author, author is selected, is used as Search Results (step S53).
The main distinction of author's searching method and above-mentioned literature search method is that set up is not search In the range of all documents and Layer semantics theme between associate, but need to set up literature search model Associating between each author and Layer semantics theme in enclosing.
The document that author can be made by it is characterized.
For example, all documents of each author are merged into a text for representing the author, by this article Originally the term vector of the author is converted to, the sign of the author is used as.In the case, equivalent to each Text corresponds to author, the search of author is equivalent in the search to text, text progress State literature search method.
In the way of linear weighted function, it can also obtain the author's according to all documents of each author Term vector.Wherein, the weight of each document of the author can deliver time, document with the document Row of the number of times, the author that the rank of affiliated meeting or periodical, document are cited in document author Tagmeme is put, at least one phase in the semantic similarity between the document and all documents of the author Close.
That is, author can be regarded as to its linear weighted function result for delivering document, and the line of document Property weighted results are it is also assumed that be a text, so the search to author is equivalent to search text Rope, the text carries out above-mentioned literature search method.
The training method of corresponding Hierarchical Semantic Model is by each author in the range of literature search Term vector is input to Hierarchical Semantic Model to be trained.
Then, when carrying out author's search, in step s 51, receive on the author's to be searched for Query text.Query text is, for example, one or more keywords, passage etc..
Then, in step S52, using Hierarchical Semantic Model, it is determined that the layer related to query text Secondary semantic topic.
Finally, in step S53, from the author related to identified Layer semantics theme, choosing Author is selected, Search Results are used as.
Similar to above based on Wp,tThe selection and sequence of document are carried out, author can be carried out based on W (a, t) Selection and sequence.
W (a, t) is for example as shown in formula (3):
Wherein, W (a, t) refers to that author a is directed to theme t weight orders.Delivered for all authors The document crossed, is multiplied by certain coefficient, line of going forward side by side by the W (p, t) (formula (2)) of every document Property weighting, you can obtain W (a, t).R (a, p) refers to ranking positions of the author a in document p, for example The r (a, p) of the first authors is 1, and the r (a, p) of the second author is 2, the like.setaRefer to all works The literature collection that person a is delivered, and S (p, seta) it is to represent document p and setaBetween semantic similarity. The semantic similarity of document can be calculated based on theme vector.Specifically, using topic model, Can be by each document representation into theme vector, therefore p and setaBetween semantic similarity can be with Calculated using the similarity calculated between both corresponding theme vectors.The mode of calculating is typically adopted Cosine similarity method is used, be will not be described here.
The result that can will be selected and be sorted based on W (a, t), is presented to user.
Literature search equipment according to an embodiment of the invention is described next, with reference to Fig. 6.
Fig. 6 shows the block diagram of literature search equipment according to an embodiment of the invention.As schemed Shown in 6, included according to the literature search equipment 600 of the present invention:Query text reception device 61, quilt It is configured to:Receive the query text on the document to be searched for;Theme determining device 62, is configured For:The Layer semantics theme related to query text is determined using Hierarchical Semantic Model;And document Selection device 63, is configured as:From the document related to identified Layer semantics theme, choosing Document is selected, Search Results are used as.
In one embodiment, literature search equipment 600 also includes:Conversion equipment, is configured as: All documents in the range of literature search are converted into term vector;Trainer, is configured as:By institute Obtained term vector is input to Hierarchical Semantic Model to be trained, and trains obtained Hierarchical Semantic Model With level.
In one embodiment, the conversion equipment is further configured to query text reception device 61 query texts received are converted to term vector.
In one embodiment, the set of the corresponding word of the element of the term vector is equal to literature search In the range of the union of set Yu the domanial words list of word that includes of all documents.
In one embodiment, literature search equipment 600 also includes:Domanial words list builder device, It is configured as:Collect the known art term of the every field in the range of literature search;Utilize hot word point Analysis technology, from the document of the every field, extracts various types of hot words;By collected neck Domain term and the hot word extracted are combined into the domanial words list.
In one embodiment, the word includes phrase and constitutes the word of phrase.
In one embodiment, when being converted to term vector, changed according to maximum length matching principle.
In one embodiment, the theme determining device 62 includes:Candidate topics acquiring unit, It is configured as:Input the query text to Hierarchical Semantic Model, with obtain multiple candidate topics and its With the semantic similarity of query text;Theme select unit, is configured as:From multiple candidate topics It is determined that the Layer semantics theme related to query text.
In one embodiment, the theme select unit is further configured to:In only one time Select the semantic similarity of theme more than in the case of predetermined threshold, the candidate topics are defined as and inquired about The related Layer semantics theme of text.
In one embodiment, the theme select unit is further configured to:In all candidate masters The semantic similarity of topic be respectively less than or equal to predetermined threshold in the case of, by pre-defined rule therefrom selector Divide candidate topics, be used as the Layer semantics theme related to query text.
In one embodiment, the theme select unit is further configured to:Waited in more than one Select in the case that the semantic similarity of theme is more than predetermined threshold, by descending suitable of semantic similarity Sequence, selects the candidate topics of predetermined quantity, is used as the Layer semantics theme related to query text.
In one embodiment, the theme select unit is further configured to:Waited in more than one Select the semantic similarity of theme more than in the case of predetermined threshold, semantic similarity is more than predetermined threshold All candidate topics be presented to user, the Layer semantics master related to query text is selected by user Topic.
In one embodiment, literature search equipment 600 also includes similarity adjusting apparatus, is configured For:The candidate topics of user's selection and the semantic similarity of query text are improved, reduction user is non-selected Candidate topics and query text semantic similarity.
In one embodiment, document selection device 63 is further configured to:According to identified Semantic similarity between Layer semantics theme and associated document, selects document, is used as search As a result.
In one embodiment, when document selection device 63 is further configured to be delivered according to document Between, in the number of times that is cited of the rank of the affiliated meeting of document or periodical, document at least one with really Semantic similarity between fixed Layer semantics theme and associated document, selects document, as Search Results.
Due to each device and unit included in the literature search equipment 600 according to the present invention In processing respectively with the processing in each step included in literature search method described above It is similar, therefore for simplicity, the detailed description of these devices and unit is omitted herein.
Next, with reference to Fig. 7 descriptions, author searches for equipment according to an embodiment of the invention.
Fig. 7 shows the block diagram of author's search equipment according to an embodiment of the invention.As schemed Shown in 7, included according to the author of present invention search equipment 700:Query text reception device 71, quilt It is configured to:Receive the query text on the author to be searched for;Theme determining device 72, is configured For:The Layer semantics theme related to query text is determined using Hierarchical Semantic Model;And author Selection device 73, is configured as:From the author related to identified Layer semantics theme, choosing Author is selected, Search Results are used as.
In one embodiment, author's search equipment 700 also includes:Trainer, the trainer Including:Converting unit, is configured as:All documents of each author in the range of literature search are turned It is changed to the term vector of the author;Training unit, is configured as:Resulting term vector is input to layer Secondary semantic model is to be trained.
In one embodiment, the converting unit is further configured to:By all of each author Document merges into a text for representing the author, and the text is converted to the term vector of the author.
In one embodiment, the converting unit is further configured to:According to the institute of each author There is document, in the way of linear weighted function, obtain the term vector of the author;Wherein, the author's is each The weight of document and document are delivered time, the rank of the affiliated meeting of document or periodical, document and are cited Sorting position, the document and all documents of the author of number of times, the author in document author it Between semantic similarity at least one is related.
Due to searching for each device and unit included in equipment 700 according to the author of the present invention In processing respectively with the processing in each step included in author's searching method described above It is similar, therefore for simplicity, the detailed description of these devices and unit is omitted herein.
In addition, still needing here, it is noted that the component devices of each in the said equipment, unit can pass through Software, firmware, hardware or the mode of its combination are configured.The workable specific means of configuration or side Formula is well known to those skilled in the art, and will not be repeated here.In the feelings realized by software or firmware Under condition, from storage medium or network to the computer with specialized hardware structure (such as shown in Fig. 8 All-purpose computer 800) install constitute the software program, the computer when being provided with various programs, It is able to carry out various functions etc..
Fig. 8 is shown available for the computer for implementing method and apparatus according to an embodiment of the invention Schematic block diagram.
In fig. 8, CPU (CPU) 801 is deposited according in read-only storage (ROM) 802 The program of storage or the program for being loaded into random access memory (RAM) 803 from storage part 808 are performed Various processing.In RAM 803, stored always according to needs when CPU 801 performs various processing etc. Deng when required data.CPU 801, ROM 802 and RAM 803 connect each other via bus 804 Connect.Input/output interface 805 is also connected to bus 804.
Components described below is connected to input/output interface 805:Importation 806 (including keyboard, mouse Etc.), output par, c 807 (including display, such as cathode-ray tube (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.), storage part 808 (including hard disk etc.), (bag of communications portion 809 Include NIC such as LAN card, modem etc.).Communications portion 809 via network such as Internet performs communication process.As needed, driver 810 can be connected to input/output interface 805.Detachable media 811 such as disk, CD, magneto-optic disk, semiconductor memory etc. can be with It is installed in as needed on driver 810 so that the computer program read out quilt as needed It is installed in storage part 808.
In the case where realizing above-mentioned series of processes by software, it is situated between from network such as internet or storage Matter such as detachable media 811 installs the program for constituting software.
It will be understood by those of skill in the art that this storage medium is not limited to shown in Fig. 8 wherein Have program stored therein, separately distribute to provide a user the detachable media 811 of program with equipment. The example of detachable media 811 (includes CD comprising disk (including floppy disk (registration mark)), CD Read-only storage (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (note Volume trade mark)) and semiconductor memory.Or, storage medium can be ROM 802, storage part 808 In the hard disk that includes etc., wherein computer program stored, and being distributed to together with the equipment comprising them User.
The present invention also proposes a kind of program product of the instruction code for the machine-readable that is stored with.It is described to refer to When making the code be read and be performed by machine, above-mentioned method according to an embodiment of the invention can perform.
Correspondingly, the program product for the instruction code that carries the above-mentioned machine-readable that is stored with is deposited Storage media is also included within disclosure of the invention.The storage medium include but is not limited to floppy disk, CD, Magneto-optic disk, storage card, memory stick etc..
In description above to the specific embodiment of the invention, for a kind of embodiment description and/or The feature shown can be made in same or similar mode in one or more other embodiments With, it is combined with feature in other embodiment, or substitute the feature in other embodiment.
It should be emphasized that term "comprises/comprising" refer to when being used herein feature, key element, step or The presence of component, but it is not precluded from depositing for one or more further features, key element, step or component Or it is additional.
In addition, the method for the present invention be not limited to specifications described in time sequencing perform, Can according to other time sequencings, concurrently or independently perform.Therefore, retouched in this specification The execution sequence for the method stated is not construed as limiting to the technical scope of the present invention.
Although being draped over one's shoulders above by the description of the specific embodiment to the present invention to the present invention Dew, however, it is to be understood that above-mentioned all embodiments and example are exemplary, and it is unrestricted Property.Those skilled in the art can design to the present invention in the spirit and scope of the appended claims Various modifications, improvement or equivalent.These modifications, improvement or equivalent should also be as being considered as Including within the scope of the present invention.
Note
1. a kind of literature search method, including:
Receive the query text on the document to be searched for;
Using Hierarchical Semantic Model, it is determined that the Layer semantics theme related to query text;And
From the document related to identified Layer semantics theme, document is selected, Search Results are used as.
2. the method as described in note 1, also includes:
All documents in the range of literature search are converted into term vector;
Resulting term vector is input to Hierarchical Semantic Model to be trained, obtained level is trained Semantic model has level.
3. the method as described in note 2, wherein, the collection of the corresponding word of element of the term vector Close the set and domanial words list for the word that all documents being equal in the range of literature search include Union.
4. the method as described in note 3, also includes:
Collect the known art term of the every field in the range of literature search;
Using hot word analytical technology, from the document of the every field, various types of hot words are extracted;
Collected field term and the hot word extracted are combined into the domanial words list.
5. the method as described in note 1, wherein, using Hierarchical Semantic Model, it is determined that literary with inquiry This related Layer semantics theme includes:
Input the query text to Hierarchical Semantic Model, with obtain multiple candidate topics and its with inquiry The semantic similarity of text;
The Layer semantics theme related to query text is determined from multiple candidate topics.
6. the method as described in note 5, wherein, determined and query text from multiple candidate topics Related Layer semantics theme includes:
In the case where the semantic similarity of only one candidate topics is more than predetermined threshold, by the candidate Theme is defined as the Layer semantics theme related to query text;
Semantic similarity in all candidate topics be respectively less than or equal to predetermined threshold in the case of, by pre- Set pattern then therefrom selected section candidate topics, are used as the Layer semantics theme related to query text;
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase Descending order is seemingly spent, the candidate topics of predetermined quantity are selected, as related to query text Layer semantics theme.
7. the method as described in note 5, wherein, determined and query text from multiple candidate topics Related Layer semantics theme includes:
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase All candidate topics like degree more than predetermined threshold are presented to user, by user's selection and query text phase The Layer semantics theme of pass.
8. the method as described in note 7, wherein, improve the candidate topics and inquiry text of user's selection The semantic similarity of this semantic similarity, the non-selected candidate topics of reduction user and query text.
9. the method as described in note 1, wherein, from related to identified Layer semantics theme In document, selection document includes:
According to the semantic similarity between identified Layer semantics theme and associated document, choosing Document is selected, Search Results are used as.
10. the method as described in note 9, wherein, deliver the time always according to document, meeting belonging to document At least one in the number of times that view or the rank of periodical, document are cited, selects document, is used as search As a result.
11. a kind of author's searching method, including:
Receive the query text on the author to be searched for;
Using Hierarchical Semantic Model, it is determined that the Layer semantics theme related to query text;And
From the author related to identified Layer semantics theme, author is selected, Search Results are used as.
12. a kind of literature search equipment, including:
Query text reception device, is configured as:Receive the query text on the document to be searched for;
Theme determining device, is configured as:Determined using Hierarchical Semantic Model related to query text Layer semantics theme;And
Document selection device, is configured as:From the document related to identified Layer semantics theme, Document is selected, Search Results are used as.
13. the equipment as described in note 12, also includes:
Conversion equipment, is configured as:All documents in the range of literature search are converted into term vector;
Trainer, is configured as:Resulting term vector is input to Hierarchical Semantic Model to carry out Training, trains obtained Hierarchical Semantic Model to have level.
14. the equipment as described in note 13, wherein, the corresponding word of element of the term vector Set is equal to the set for the word that all documents in the range of literature search include and domanial words are arranged The union of table.
15. the equipment as described in note 14, also includes:Domanial words list builder device, by with It is set to:
Collect the known art term of the every field in the range of literature search;
Using hot word analytical technology, from the document of the every field, various types of hot words are extracted;
Collected field term and the hot word extracted are combined into the domanial words list.
16. the equipment as described in note 12, wherein, the theme determining device includes:
Candidate topics acquiring unit, is configured as:The query text is inputted to Hierarchical Semantic Model, To obtain multiple candidate topics and its semantic similarity with query text;
Theme select unit, is configured as:Determined from multiple candidate topics related to query text Layer semantics theme.
17. the equipment as described in note 16, wherein, the theme select unit is further configured For:
In the case where the semantic similarity of only one candidate topics is more than predetermined threshold, by the candidate Theme is defined as the Layer semantics theme related to query text;
Semantic similarity in all candidate topics be respectively less than or equal to predetermined threshold in the case of, by pre- Set pattern then therefrom selected section candidate topics, are used as the Layer semantics theme related to query text;
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase Descending order is seemingly spent, the candidate topics of predetermined quantity are selected, as related to query text Layer semantics theme.
18. the equipment as described in note 16, wherein, the theme select unit is further configured For:In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase All candidate topics like degree more than predetermined threshold are presented to user, by user's selection and query text phase The Layer semantics theme of pass.
19. the equipment as described in note 18, also includes:Similarity adjusting apparatus, is configured as: Improve the candidate topics of user's selection and the semantic similarity of query text, the non-selected time of reduction user Select theme and the semantic similarity of query text.
20. the equipment as described in note 12, wherein, document selection device is further configured to: According to the semantic similarity between identified Layer semantics theme and associated document, selection text Offer, be used as Search Results.

Claims (10)

1. a kind of literature search method, including:
Receive the query text on the document to be searched for;
Using Hierarchical Semantic Model, it is determined that the Layer semantics theme related to query text;And
From the document related to identified Layer semantics theme, document is selected, Search Results are used as.
2. the method as described in claim 1, also includes:
All documents in the range of literature search are converted into term vector;
Resulting term vector is input to Hierarchical Semantic Model to be trained, obtained level is trained Semantic model has level.
3. method as claimed in claim 2, wherein, the corresponding word of element of the term vector Set be equal to the set of word and the domanial words that all documents in the range of literature search include The union of list.
4. method as claimed in claim 3, also includes:
Collect the known art term of the every field in the range of literature search;
Using hot word analytical technology, from the document of the every field, various types of hot words are extracted;
Collected field term and the hot word extracted are combined into the domanial words list.
5. Hierarchical Semantic Model is the method for claim 1, wherein utilized, it is determined that with looking into Asking the related Layer semantics theme of text includes:
Input the query text to Hierarchical Semantic Model, with obtain multiple candidate topics and its with inquiry The semantic similarity of text;
The Layer semantics theme related to query text is determined from multiple candidate topics.
6. method as claimed in claim 5, wherein, determined from multiple candidate topics and inquiry The related Layer semantics theme of text includes:
In the case where the semantic similarity of only one candidate topics is more than predetermined threshold, by the candidate Theme is defined as the Layer semantics theme related to query text;
Semantic similarity in all candidate topics be respectively less than or equal to predetermined threshold in the case of, by pre- Set pattern then therefrom selected section candidate topics, are used as the Layer semantics theme related to query text;
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase Descending order is seemingly spent, the candidate topics of predetermined quantity are selected, as related to query text Layer semantics theme.
7. method as claimed in claim 5, wherein, determined from multiple candidate topics and inquiry The related Layer semantics theme of text includes:
In the case where the semantic similarity of more than one candidate topics is more than predetermined threshold, by semantic phase All candidate topics like degree more than predetermined threshold are presented to user, by user's selection and query text phase The Layer semantics theme of pass.
8. method as claimed in claim 7, wherein, the candidate topics of user's selection are improved with looking into The semantic similarity of text is ask, the non-selected candidate topics of reduction user are semantic similar to query text Degree.
9. a kind of author's searching method, including:
Receive the query text on the author to be searched for;
Using Hierarchical Semantic Model, it is determined that the Layer semantics theme related to query text;And
From the author related to identified Layer semantics theme, author is selected, Search Results are used as.
10. a kind of literature search equipment, including:
Query text reception device, is configured as:Receive the query text on the document to be searched for;
Theme determining device, is configured as:Determined using Hierarchical Semantic Model related to query text Layer semantics theme;And
Document selection device, is configured as:From the document related to identified Layer semantics theme, Document is selected, Search Results are used as.
CN201610007271.9A 2016-01-06 2016-01-06 Literature search method and apparatus, author's searching method and equipment Pending CN106951420A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610007271.9A CN106951420A (en) 2016-01-06 2016-01-06 Literature search method and apparatus, author's searching method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610007271.9A CN106951420A (en) 2016-01-06 2016-01-06 Literature search method and apparatus, author's searching method and equipment

Publications (1)

Publication Number Publication Date
CN106951420A true CN106951420A (en) 2017-07-14

Family

ID=59465655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610007271.9A Pending CN106951420A (en) 2016-01-06 2016-01-06 Literature search method and apparatus, author's searching method and equipment

Country Status (1)

Country Link
CN (1) CN106951420A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829822A (en) * 2018-06-12 2018-11-16 腾讯科技(深圳)有限公司 The recommended method and device of media content, storage medium, electronic device
CN110334178A (en) * 2019-03-28 2019-10-15 平安科技(深圳)有限公司 Data retrieval method, device, equipment and readable storage medium storing program for executing
CN111324701A (en) * 2020-02-24 2020-06-23 腾讯科技(深圳)有限公司 Content supplement method, content supplement device, computer equipment and storage medium
CN111666371A (en) * 2020-04-21 2020-09-15 北京三快在线科技有限公司 Theme-based matching degree determination method and device, electronic equipment and storage medium
CN113591853A (en) * 2021-08-10 2021-11-02 北京达佳互联信息技术有限公司 Keyword extraction method and device and electronic equipment
WO2022116324A1 (en) * 2020-12-04 2022-06-09 中国科学院深圳先进技术研究院 Search model training method, apparatus, terminal device, and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799677A (en) * 2012-07-20 2012-11-28 河海大学 Water conservation domain information retrieval system and method based on semanteme
CN103136352A (en) * 2013-02-27 2013-06-05 华中师范大学 Full-text retrieval system based on two-level semantic analysis
CN103425710A (en) * 2012-05-25 2013-12-04 北京百度网讯科技有限公司 Subject-based searching method and device
CN103678302A (en) * 2012-08-30 2014-03-26 北京百度网讯科技有限公司 Document structuration organizing method and device
US20150186364A1 (en) * 2005-05-06 2015-07-02 John M. Nelson Database and Index Organization for Enhanced Document Retrieval
CN104834679A (en) * 2015-04-14 2015-08-12 苏州大学 Representation and inquiry method of behavior track and device therefor

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150186364A1 (en) * 2005-05-06 2015-07-02 John M. Nelson Database and Index Organization for Enhanced Document Retrieval
CN103425710A (en) * 2012-05-25 2013-12-04 北京百度网讯科技有限公司 Subject-based searching method and device
CN102799677A (en) * 2012-07-20 2012-11-28 河海大学 Water conservation domain information retrieval system and method based on semanteme
CN103678302A (en) * 2012-08-30 2014-03-26 北京百度网讯科技有限公司 Document structuration organizing method and device
CN103136352A (en) * 2013-02-27 2013-06-05 华中师范大学 Full-text retrieval system based on two-level semantic analysis
CN104834679A (en) * 2015-04-14 2015-08-12 苏州大学 Representation and inquiry method of behavior track and device therefor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贾鹏: "互联网科研文献检索系统设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829822A (en) * 2018-06-12 2018-11-16 腾讯科技(深圳)有限公司 The recommended method and device of media content, storage medium, electronic device
CN108829822B (en) * 2018-06-12 2023-10-27 腾讯科技(深圳)有限公司 Media content recommendation method and device, storage medium and electronic device
CN110334178A (en) * 2019-03-28 2019-10-15 平安科技(深圳)有限公司 Data retrieval method, device, equipment and readable storage medium storing program for executing
CN110334178B (en) * 2019-03-28 2023-06-20 平安科技(深圳)有限公司 Data retrieval method, device, equipment and readable storage medium
CN111324701A (en) * 2020-02-24 2020-06-23 腾讯科技(深圳)有限公司 Content supplement method, content supplement device, computer equipment and storage medium
CN111324701B (en) * 2020-02-24 2023-04-07 腾讯科技(深圳)有限公司 Content supplement method, content supplement device, computer equipment and storage medium
CN111666371A (en) * 2020-04-21 2020-09-15 北京三快在线科技有限公司 Theme-based matching degree determination method and device, electronic equipment and storage medium
WO2022116324A1 (en) * 2020-12-04 2022-06-09 中国科学院深圳先进技术研究院 Search model training method, apparatus, terminal device, and storage medium
CN113591853A (en) * 2021-08-10 2021-11-02 北京达佳互联信息技术有限公司 Keyword extraction method and device and electronic equipment
CN113591853B (en) * 2021-08-10 2024-04-19 北京达佳互联信息技术有限公司 Keyword extraction method and device and electronic equipment

Similar Documents

Publication Publication Date Title
Marcos-Pablos et al. Information retrieval methodology for aiding scientific database search
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
CN106951420A (en) Literature search method and apparatus, author's searching method and equipment
CN107463607B (en) Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning
US7805288B2 (en) Corpus expansion system and method thereof
CN107861939A (en) A kind of domain entities disambiguation method for merging term vector and topic model
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
CN108415902A (en) A kind of name entity link method based on search engine
WO2009154570A1 (en) System and method for aligning and indexing multilingual documents
CN113268995A (en) Chinese academy keyword extraction method, device and storage medium
Barriere et al. TerminoWeb: a software environment for term study in rich contexts
CN111221968B (en) Author disambiguation method and device based on subject tree clustering
Meena et al. Survey on graph and cluster based approaches in multi-document text summarization
CN111061828B (en) Digital library knowledge retrieval method and device
CN105893485A (en) Automatic special subject generating method based on book catalogue
CN110516074A (en) Website theme classification method and device based on deep learning
KR20220134695A (en) System for author identification using artificial intelligence learning model and a method thereof
Alami et al. Arabic text summarization based on graph theory
Subbalakshmi et al. A Gravitational Search Algorithm Study on Text Summarization Using NLP
JP3198932B2 (en) Document search device
Kardkovács et al. The ferrety algorithm for the KDD Cup 2005 problem
JP4534019B2 (en) Name and keyword grouping method, program, recording medium and apparatus thereof
CN110705306B (en) Evaluation method for consistency of written and written texts
Kanya et al. Information Extraction-a text mining approach
Hall et al. Enabling the discovery of digital cultural heritage objects through wikipedia

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170714

WD01 Invention patent application deemed withdrawn after publication