CN109902152A - Method and apparatus for retrieving information - Google Patents

Method and apparatus for retrieving information Download PDF

Info

Publication number
CN109902152A
CN109902152A CN201910217161.9A CN201910217161A CN109902152A CN 109902152 A CN109902152 A CN 109902152A CN 201910217161 A CN201910217161 A CN 201910217161A CN 109902152 A CN109902152 A CN 109902152A
Authority
CN
China
Prior art keywords
keyword
text
retrieved
words
item title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910217161.9A
Other languages
Chinese (zh)
Other versions
CN109902152B (en
Inventor
安思宇
刘明浩
朱翰闻
王乐义
黄相凯
张亦鹏
郭江亮
李旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910217161.9A priority Critical patent/CN109902152B/en
Publication of CN109902152A publication Critical patent/CN109902152A/en
Application granted granted Critical
Publication of CN109902152B publication Critical patent/CN109902152B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application discloses the method and apparatus for retrieving information.One specific embodiment of this method includes: to obtain text to be retrieved;Text to be retrieved is analyzed, keyword set is generated;Item Title set of words and article characteristics set of words are screened from keyword set;Term is determined from Item Title set of words and article characteristics set of words;It is retrieved based on term, obtains the information of the associated article of term.This embodiment improves the efficiency of information retrieval.

Description

Method and apparatus for retrieving information
Technical field
The invention relates to field of computer technology, and in particular to the method and apparatus for retrieving information.
Background technique
Information retrieval is the major way that user carries out information inquiry and acquisition, is to look for the ways and means of information.It is narrow The information retrieval of justice only refers to that information is inquired.I.e. user is as needed, using certain method, by gopher, from information collection The search procedure of information required for being found out in conjunction.The information retrieval of broad sense be information processed, arranged in a certain way, group It knits and stores, the process for specifically needing accurately to find out relevant information further according to information user.Also known as information Be stored in retrieval.Under normal circumstances, what information retrieval referred to is exactly the information retrieval of broad sense.
Current information retrieval mode usually only supports keyword retrieval mode, that is, user inputs keyword, according to key Word and search goes out information, and returns to user.Therefore, when user browses to text relevant to article, it usually needs Yong Hutong Text is read, analyzes and understand the content of text, and rule of thumb find out the keyword that article is described in text.Then, The keyword that input user finds carries out information retrieval, to obtain information relevant to article.
Summary of the invention
The embodiment of the present application proposes the method and apparatus for retrieving information.
In a first aspect, the embodiment of the present application provides a kind of method for retrieving information, comprising: obtain text to be retrieved This;Text to be retrieved is analyzed, keyword set is generated;Item Title set of words and article are screened from keyword set Feature set of words;Term is determined from Item Title set of words and article characteristics set of words;It is retrieved, is obtained based on term To the information of the associated article of term.
In some embodiments, text to be retrieved is analyzed, generates keyword set, comprising: extract text to be retrieved This at least one keyword;Keyword expansion is carried out at least one keyword, generates keyword set, wherein keyword Extension comprises at least one of the following method: synonym extension, near synonym extension, conjunctive word extension and knowledge mapping extension.
In some embodiments, at least one keyword of text to be retrieved is extracted, comprising: extremely by text input to be retrieved Trained disaggregated model in advance, obtains at least one keyword of text to be retrieved, wherein disaggregated model is for extracting text Keyword.
In some embodiments, disaggregated model includes embeding layer, coding layer, selection layer and classification layer;And it will be to be retrieved Text input obtains at least one keyword of text to be retrieved, comprising: by text to be retrieved to disaggregated model trained in advance It is input to embeding layer, obtains the dense vector of text to be retrieved;Dense vector is input to coding layer, obtains text to be retrieved Coding vector;Coding vector is input to selection layer, obtains the weighing vector of text to be retrieved;Weighing vector is input to classification Layer, obtains at least one keyword of text to be retrieved.
In some embodiments, the method for extracting at least one keyword of text to be retrieved includes the reverse file of word frequency- Frequency approach and text rank algorithm.
In some embodiments, Item Title set of words and article characteristics set of words are screened from keyword set, including The keyword in keyword set is analyzed based on part of speech and entity type;Based on the analysis results, from keyword set Filter out Item Title set of words and article characteristics set of words.
In some embodiments, the keyword in keyword set has a weight, weight characterize corresponding keyword with to Retrieve the correlation degree of text;And term is determined from Item Title set of words and article characteristics set of words, comprising: be based on Weight is ranked up Item Title set of words and article characteristics set of words respectively, obtains Item Title word sequence and article characteristics Word sequence;Threshold value truncation is carried out to Item Title word sequence and article characteristics word sequence respectively, obtains term.
Second aspect, the embodiment of the present application provide a kind of for retrieving the device of information, comprising: acquiring unit is matched It is set to and obtains text to be retrieved;Generation unit is configured to analyze text to be retrieved, generates keyword set;Screening Unit is configured to screen Item Title set of words and article characteristics set of words from keyword set;Determination unit is configured At term determining from Item Title set of words and article characteristics set of words;Retrieval unit, be configured to based on term into Row retrieval, obtains the information of the associated article of term.
In some embodiments, generation unit includes: extraction subelement, is configured to extract at least the one of text to be retrieved A keyword;Subelement is extended, is configured to carry out keyword expansion at least one keyword, generates keyword set, In, keyword expansion comprises at least one of the following method: synonym extension, near synonym extension, conjunctive word extension and knowledge mapping Extension.
In some embodiments, extracting subelement includes: categorization module, is configured to text input to be retrieved to preparatory Trained disaggregated model obtains at least one keyword of text to be retrieved, wherein disaggregated model is used to extract the key of text Word.
In some embodiments, disaggregated model includes embeding layer, coding layer, selection layer and classification layer;And categorization module Include: insertion submodule, is configured to obtain text input to be retrieved to embeding layer the dense vector of text to be retrieved;It compiles Numeral module is configured to dense vector being input to coding layer, obtains the coding vector of text to be retrieved;Submodule is selected, It is configured to for coding vector to be input to selection layer, obtains the weighing vector of text to be retrieved;Classify submodule, be configured to by Weighing vector is input to classification layer, obtains at least one keyword of text to be retrieved.
In some embodiments, the method for extracting at least one keyword of text to be retrieved includes the reverse file of word frequency- Frequency approach and text rank algorithm.
In some embodiments, screening unit includes analysis subelement, is configured to based on part of speech and entity type to pass Keyword in keyword set is analyzed;Subelement is screened, is configured to based on the analysis results, screen from keyword set Item Title set of words and article characteristics set of words out.
In some embodiments, the keyword in keyword set has a weight, weight characterize corresponding keyword with to Retrieve the correlation degree of text;And determination unit includes: sorting subunit, is configured to based on weight to Item Title word set It closes and article characteristics set of words is ranked up respectively, obtain Item Title word sequence and article characteristics word sequence;Subelement is truncated, It is configured to carry out threshold value truncation respectively to Item Title word sequence and article characteristics word sequence, obtains term.
The third aspect, the embodiment of the present application provide a kind of server, which includes: one or more processors; Storage device is stored thereon with one or more programs;When one or more programs are executed by one or more processors, so that One or more processors realize the method as described in implementation any in first aspect.
Fourth aspect, the embodiment of the present application provide a kind of computer-readable medium, are stored thereon with computer program, should The method as described in implementation any in first aspect is realized when computer program is executed by processor.
Method and apparatus provided by the embodiments of the present application for retrieving information, first to acquired text to be retrieved into Row analysis, to generate keyword set;Item Title set of words and article characteristics set of words are screened from keyword set later; Then term is determined from Item Title set of words and article characteristics set of words;Finally retrieved based on term, with To the information of the associated article of term, to improve the efficiency of information retrieval.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is that this application can be applied to exemplary system architectures therein;
Fig. 2 is the flow chart according to one embodiment of the method for retrieving information of the application;
Fig. 3 is shown in Fig. 2 for retrieving the schematic diagram of an application scenarios of the method for information;
Fig. 4 is the structural schematic diagram of one embodiment of disaggregated model;
Fig. 5 is the flow chart according to another embodiment of the method for retrieving information of the application;
Fig. 6 is the structural schematic diagram according to one embodiment of the device for retrieving information of the application;
Fig. 7 is adapted for the structural schematic diagram for the computer system for realizing the server of the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can be using the method for retrieving information of the application or the implementation of the device for retrieving information The exemplary system architecture 100 of example.
As shown in Figure 1, may include terminal device 101, network 102 and server 103 in system architecture 100.Network 102 To provide the medium of communication link between terminal device 101 and server 103.Network 102 may include various connection classes Type, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 101 and be interacted by network 102 with server 103, to receive or send message etc.. Various client softwares, such as searching class application, shopping class application etc. can be installed on terminal device 101.
Terminal device 101 can be hardware, be also possible to software.When terminal device 101 is hardware, can be with aobvious Display screen and the various electronic equipments for supporting information retrieval.Including but not limited to smart phone, tablet computer, portable meter on knee Calculation machine and desktop computer etc..When terminal device 101 is software, may be mounted in above-mentioned electronic equipment.It can be real Ready-made multiple softwares or software module, also may be implemented into single software or software module.It is not specifically limited herein.
Server 103 can be to provide the server of various services.Such as information retrieval server.Information retrieval server The data such as the text to be retrieved got can be carried out analyzing etc. with processing, generate processing result (such as the associated object of term The information of product), and processing result is pushed to terminal device 101.
It should be noted that server 103 can be hardware, it is also possible to software.It, can when server 103 is hardware To be implemented as the distributed server cluster that multiple servers form, individual server also may be implemented into.When server 103 is When software, multiple softwares or software module (such as providing Distributed Services) may be implemented into, also may be implemented into single Software or software module.It is not specifically limited herein.
It should be noted that the method provided by the embodiment of the present application for retrieving information is generally held by server 103 Row, correspondingly, the device for retrieving information is generally positioned in server 103.
It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.
With continued reference to Fig. 2, it illustrates the processes according to one embodiment of the method for retrieving information of the application 200.The method for being used to retrieve information, comprising the following steps:
Step 201, text to be retrieved is obtained.
It in the present embodiment, can be with for retrieving the executing subject (such as server 103 shown in FIG. 1) of the method for information Obtain text to be retrieved.For example, when user by its terminal device (such as terminal device 101 shown in FIG. 1) browse to it is to be checked When Suo Wenben, text to be retrieved can be replicated first, then turn on the searching class application or shopping class installed on its terminal device Using pasting text to be retrieved in frame retrieval, and click index button.At this point, terminal device can be to above-mentioned executing subject Send text to be retrieved.In general, text to be retrieved can be text relevant to article, it may for example comprise the text of the title of article Sheet, the text including article characteristics, the text that article is described etc..
Step 202, text to be retrieved is analyzed, generates keyword set.
In the present embodiment, above-mentioned executing subject can analyze text to be retrieved, to generate keyword set.It is logical Often, the keyword in keyword set has a weight, and what weight can characterize corresponding keyword and text to be retrieved is associated with journey Degree.
In some optional implementations of the present embodiment, above-mentioned executing subject can be directly by extracting text to be retrieved This at least one keyword generates keyword set.
Here, the method for extracting at least one keyword of text to be retrieved can include but is not limited at least one side Method:
1, TF-IDF (Term Frequency-Inverse Document Frequency, the reverse document-frequency of word frequency -) Method.
Wherein, TF-IDF is a kind of common weighting technique for information retrieval and data mining.Specifically, TF-IDF is A kind of statistical method, to assess a words for the important of a copy of it file in a file set or a corpus Degree.The importance of words, but simultaneously can be as it be in corpus with the directly proportional increase of number that it occurs hereof The frequency of appearance is inversely proportional decline.In general, if the frequency that occurs in an article of some word or phrase is high, and at other Seldom occur in article, then it is assumed that this word or phrase have good class discrimination ability, are adapted to classify.
Here, the step of at least one keyword of text to be retrieved is extracted using TF-IDF method is as follows:
Firstly, the processing such as full cutting method is carried out to text to be retrieved, text dividing to be retrieved at word.
Then, the weight for each word being cut into is calculated using TF-IDF method.
Finally, filtering out the word of position (such as first 100) default before weight comes, at least one of text to be retrieved is obtained Keyword.
2, TextRank (text ranking) algorithm.
Wherein, TextRank is based on PageRank, constructs network by the neighbouring relations between word, then uses PageRank The rank value of each node is iterated to calculate, keyword can be obtained in sequence rank value.
Here, the step of at least one keyword of text to be retrieved is extracted using TextRank algorithm is as follows:
Firstly, text to be retrieved is split according to complete words.
Later, for each sentence, participle and part-of-speech tagging processing are carried out, and filters out stop words, only retains specified word The word of property, the candidate keywords such as noun, verb, adjective, after being retained.
Then, candidate keywords figure G=(V, E) is constructed.Wherein, V is node collection, is made of candidate keywords, then adopts Appoint the side between two o'clock with cooccurrence relation construction, there are side is only K's in length when their corresponding vocabulary between two nodes Co-occurrence in window.K indicates window size, i.e., most K words of co-occurrence.
Then, according to formula above, the weight of each node of iterative diffusion, until convergence.
Then, Bit-reversed is carried out to node weights, to obtain most important preset number (such as 100) a word.
Finally, be marked in text to be retrieved by most important preset number word, if forming adjacent phrase, Then it is combined into more word keywords.
3, disaggregated model.
Wherein, disaggregated model can be used for extracting the keyword of text, characterize pair between text and the keyword of text It should be related to.In general, disaggregated model can be using various machine learning methods and training sample to existing machine learning model (such as various artificial neural networks etc.) carries out obtained from Training.Training sample may include sample text and sample Text categories label.Sample text class label can be used for marking the higher keyword of weight of sample text.
Here, above-mentioned executing subject can obtain to be retrieved by text input to be retrieved to disaggregated model trained in advance At least one keyword of text.Wherein, at least one keyword for the text to be retrieved that disaggregated model extracts may include The higher keyword of weight in text to be retrieved.
In some optional implementations of the present embodiment, above-mentioned executing subject can extract text to be retrieved first At least one keyword;Then keyword expansion is carried out at least one keyword, generates keyword set.Wherein, keyword Extension can include but is not limited to following at least one method: synonym extension, near synonym extension, conjunctive word extension and knowledge graph Spectrum extension etc..When being extended to keyword, the weight of the keyword expanded can be identical as the weight of the keyword, It can also be different.For example, the weight of the keyword expanded can be with the key when carrying out synonym extension to the keyword The weight of word is identical.When carrying out near synonym extension to the keyword, the weight of the keyword expanded can be equal to the key The product of the weight and similarity (similarity of the keyword and the keyword that expand) of word.It is associated when to the keyword When word extends, the weight of the keyword expanded can be equal to the keyword weight and the degree of association (keyword expanded with The degree of association of the keyword) product.When carrying out knowledge mapping extension to the keyword, the weight of the keyword expanded can To be equal to the weight of the keyword and the product of relationship weight (the relationship weight of the keyword and the keyword that expand).
Step 203, Item Title set of words and article characteristics set of words are screened from keyword set.
In the present embodiment, above-mentioned executing subject can filter out Item Title set of words and article from keyword set Feature set of words.It wherein, may include Item Title word in Item Title set of words.The part of speech of Item Title word is usually mostly name Word, entity type are usually mostly name, place name and object name etc..It may include article characteristics word in article characteristics set of words. The part of speech of article characteristics word is mostly usually adjective, is the word that article is described.
In some optional implementations of the present embodiment, above-mentioned executing subject can be based on part of speech and entity type pair Keyword in keyword set is analyzed;Then based on the analysis results, Item Title word is filtered out from keyword set Set and article characteristics set of words.For example, it is noun that above-mentioned executing subject can filter out part of speech from keyword set first Keyword, then selected from the keyword filtered out entity type be name, place name and object name keyword, with Generate Item Title set of words.Similarly, it is adjective that above-mentioned executing subject can filter out part of speech from keyword set first Keyword, select the word that article is described, from the keyword filtered out then to generate article Feature Words Set.
Step 204, term is determined from Item Title set of words and article characteristics set of words.
In the present embodiment, above-mentioned executing subject can determine inspection from Item Title set of words and article characteristics set of words Rope word.It is retrieved in general, above-mentioned executing subject can be determined from Item Title set of words and article characteristics set of words according to weight Word.For example, above-mentioned executing subject can be ranked up Item Title set of words and article characteristics set of words based on weight respectively, Obtain Item Title word sequence and article characteristics word sequence;Then to Item Title word sequence and article characteristics word sequence respectively into The truncation of row threshold value, obtains term.
Step 205, it is retrieved based on term, obtains the information of the associated article of term.
In the present embodiment, above-mentioned executing subject can be retrieved based on term, to obtain the associated object of term The information of product.It is retrieved in the information bank of article in general, above-mentioned executing subject can use term.If the information of article There is the information of the article comprising term in library, then can be using the information of the article comprising term as retrieval word association The information of article be pushed to user.If the information of the article comprising term is not present in the information bank of article, then can be with User is pushed to using the information comprising the article with the associated word of term as the information of the associated article of term.For example, If term is " pan ", the information of pan is not present in the information bank of article, then can will be in the information bank of article The information of pot be pushed to user.
It is shown in Fig. 2 for retrieving the schematic diagram of an application scenarios of the method for information with continued reference to Fig. 3, Fig. 3.? It, can be with when user browses to a text 301 that article is introduced on its handset in application scenarios shown in Fig. 3 Text 301 is replicated, and opens the shopping class application installed on its mobile phone, paste text 301 in frame retrieval, and click retrieval and press Button.At this point, mobile phone can be to server sending information 301.After receiving text 301, server can first with point Keyword 303, the keyword 304 of the extraction text 301 of class model 302.Later, keyword expansion is carried out to keyword 303, obtained Expanded keyword 305 carries out keyword expansion to keyword 304, and be expanded keyword 306.Then, keyword 303, key Word 304, expanded keyword 305 and expanded keyword 306 are combined into keyword set 307.Then, from keyword set 307 Screen Item Title set of words 308 and article characteristics set of words 309.Then, from Item Title set of words 308 and article characteristics word Term 310 is determined in set 309.Finally, being retrieved based on term 310, the information of the associated article of term is obtained 311, and the information of article 311 is pushed to user.
Method provided by the embodiments of the present application for retrieving information, first divides acquired text to be retrieved Analysis, to generate keyword set;Item Title set of words and article characteristics set of words are screened from keyword set later;Then Term is determined from Item Title set of words and article characteristics set of words;It is finally retrieved based on term, to be examined The information of the article of rope word association, to improve the efficiency of information retrieval.
With further reference to Fig. 4, it illustrates the structural schematic diagrams of one embodiment of disaggregated model.As shown in Figure 4, classify Model may include embeding layer, coding layer, selection layer and classification layer.Text to be retrieved is inputted from the input side of disaggregated model, according to It is secondary by the embeding layer of disaggregated model, coding layer, selection layer and classify layer processing, from outlet side export text to be retrieved to A few keyword.
With further reference to Fig. 5, it illustrates it illustrates according to the method for retrieving information of the application another The process 500 of embodiment.The method for being used to retrieve information, comprising the following steps:
Step 501, text to be retrieved is obtained.
In the present embodiment, the concrete operations of step 501 have carried out in step 201 in detail in the embodiment shown in Figure 2 Thin introduction, details are not described herein.
Step 502, by text input to be retrieved to embeding layer, the dense vector of text to be retrieved is obtained.
It in the present embodiment, can be with for retrieving the executing subject (such as server 103 shown in FIG. 1) of the method for information Text input to be retrieved to embeding layer is exported into the dense vector of text to be retrieved by the processing of embeding layer.Wherein, dense The value of vector is exactly a common even numbers group.
Step 503, dense vector is input to coding layer, obtains the coding vector of text to be retrieved.
In the present embodiment, dense vector can be input to coding layer by above-mentioned executing subject, by the processing of coding layer, Export the coding vector of text to be retrieved.Wherein, coding is that information from a kind of form or format is converted to another form of mistake Journey.
Step 504, coding vector is input to selection layer, obtains the weighing vector of text to be retrieved.
In the present embodiment, coding vector can be input to selection layer by above-mentioned executing subject, by selecting the processing of layer, Export the weighing vector of text to be retrieved.Wherein, the weight in weighing vector can characterize corresponding keyword and text to be retrieved This correlation degree.
Step 505, weighing vector is input to classification layer, obtains at least one keyword of text to be retrieved.
In the present embodiment, weighing vector can be input to classification layer by above-mentioned executing subject, through classification layer processing, Export at least one keyword of text to be retrieved.Wherein, at least one for the text to be retrieved that disaggregated model extracts is crucial Word may include the higher keyword of weight in text to be retrieved.
Step 506, keyword expansion is carried out at least one keyword, generates keyword set.
In the present embodiment, above-mentioned executing subject can carry out keyword expansion at least one keyword, generate crucial Set of words.Wherein, keyword expansion can include but is not limited to following at least one method: synonym extends, near synonym extend, Conjunctive word extension and knowledge mapping extension etc..When being extended to keyword, the weight of the keyword expanded can be with The weight of the keyword is identical, can also be different.For example, when carrying out synonym extension to the keyword, the key that expands The weight of word can be identical as the weight of the keyword.When carrying out near synonym extension to the keyword, the keyword that expands Weight can be equal to the weight of the keyword and the product of similarity (similarity of the keyword that expands and the keyword). When being associated word extension to the keyword, the weight of the keyword expanded can be equal to the weight of the keyword be associated with The product of degree (degree of association of the keyword and the keyword that expand).When carrying out knowledge mapping extension to the keyword, expand The weight of the keyword of exhibition can be equal to the weight and relationship weight (keyword expanded and the keyword of the keyword Relationship weight) product.
Step 507, Item Title set of words and article characteristics set of words are screened from keyword set.
Step 508, term is determined from Item Title set of words and article characteristics set of words.
Step 509, it is retrieved based on term, obtains the information of the associated article of term.
In the present embodiment, the concrete operations of step 507-509 are in the embodiment shown in Figure 2 in step 203-205 It is described in detail, details are not described herein.
From figure 5 it can be seen that the method for retrieving information compared with the corresponding embodiment of Fig. 2, in the present embodiment Process 500 highlight generate keyword set the step of.The scheme of the present embodiment description is based on comprising embeding layer, compiles as a result, The disaggregated model of code layer, selection layer and layer of classifying extracts at least one keyword of text to be retrieved, improves from text to be retrieved The efficiency of keyword is extracted in this.Also, keyword expansion is carried out at least one keyword, generates keyword set, is introduced More keyword related datas, to realize the information of more fully article.
With further reference to Fig. 6, as the realization to method shown in above-mentioned each figure, this application provides one kind for retrieving letter One embodiment of the device of breath, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, which can specifically answer For in various electronic equipments.
As shown in fig. 6, the device 600 for retrieving information of the present embodiment may include: acquiring unit 601, generate list Member 602, screening unit 603, determination unit 604 and retrieval unit 605.Wherein, acquiring unit 601 are configured to obtain to be checked Suo Wenben;Generation unit 602 is configured to analyze text to be retrieved, generates keyword set;Screening unit 603, quilt It is configured to screen Item Title set of words and article characteristics set of words from keyword set;Determination unit 604, be configured to from Term is determined in Item Title set of words and article characteristics set of words;Retrieval unit 605 is configured to carry out based on term Retrieval, obtains the information of the associated article of term.
In the present embodiment, in the device 600 for retrieving information: acquiring unit 601, generation unit 602, screening unit 603, the specific processing of determination unit 604 and retrieval unit 605 and its brought technical effect can be corresponding real with reference to Fig. 2 respectively Apply step 201 in example, step 202, step 202, step 203,204 and step 205 related description, details are not described herein.
In some optional implementations of the present embodiment, generation unit 602 includes: to extract subelement (not show in figure Out), it is configured to extract at least one keyword of text to be retrieved;Subelement (not shown) is extended, is configured to pair At least one keyword carries out keyword expansion, generates keyword set, wherein the keyword expansion side of comprising at least one of the following Method: synonym extension, near synonym extension, conjunctive word extension and knowledge mapping extension.
In some optional implementations of the present embodiment, extracting subelement includes: categorization module (not shown), It is configured to text input to be retrieved to disaggregated model trained in advance obtaining at least one keyword of text to be retrieved, Wherein, disaggregated model is used to extract the keyword of text.
In some optional implementations of the present embodiment, disaggregated model includes embeding layer, coding layer, selection layer and divides Class layer;And categorization module includes: insertion submodule (not shown), is configured to text input to be retrieved to insertion Layer, obtains the dense vector of text to be retrieved;Encoding submodule (not shown) is configured to dense vector being input to volume Code layer obtains the coding vector of text to be retrieved;Submodule (not shown) is selected, is configured to for coding vector being input to Layer is selected, the weighing vector of text to be retrieved is obtained;Classification submodule (not shown), is configured to input weighing vector To classification layer, at least one keyword of text to be retrieved is obtained.
In some optional implementations of the present embodiment, the method for extracting at least one keyword of text to be retrieved Including the reverse document-frequency method of word frequency-and text rank algorithm.
In some optional implementations of the present embodiment, screening unit 603 includes that analysis subelement (does not show in figure Out), it is configured to analyze the keyword in keyword set based on part of speech and entity type;Subelement is screened (in figure It is not shown), it is configured to based on the analysis results, Item Title set of words and article characteristics word set is filtered out from keyword set It closes.
In some optional implementations of the present embodiment, the keyword in keyword set has weight, weight table Levy the correlation degree of corresponding keyword Yu text to be retrieved;And determination unit 604 includes: that sorting subunit (is not shown in figure Out), it is configured to be ranked up Item Title set of words and article characteristics set of words respectively based on weight, obtains Item Title Word sequence and article characteristics word sequence;Subelement (not shown) is truncated, is configured to Item Title word sequence and article Feature word sequence carries out threshold value truncation respectively, obtains term.
Below with reference to Fig. 7, it illustrates the server for being suitable for being used to realize the embodiment of the present application (such as clothes shown in FIG. 1 Be engaged in device 103) computer system 700 structural schematic diagram.Server shown in Fig. 7 is only an example, should not be to this Shen Please embodiment function and use scope bring any restrictions.
As shown in fig. 7, computer system 700 includes central processing unit (CPU) 701, it can be read-only according to being stored in Program in memory (ROM) 702 or be loaded into the program in random access storage device (RAM) 703 from storage section 708 and Execute various movements appropriate and processing.In RAM 703, also it is stored with system 700 and operates required various programs and data. CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input/output (I/O) interface 705 is also connected to always Line 704.
I/O interface 705 is connected to lower component: the importation 706 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 708 including hard disk etc.; And the communications portion 709 of the network interface card including LAN card, modem etc..Communications portion 709 via such as because The network of spy's net executes communication process.Driver 710 is also connected to I/O interface 705 as needed.Detachable media 711, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 710, in order to read from thereon Computer program be mounted into storage section 708 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communications portion 709, and/or from detachable media 711 are mounted.When the computer program is executed by central processing unit (CPU) 701, limited in execution the present processes Above-mentioned function.It should be noted that computer-readable medium described herein can be computer-readable signal media or Computer-readable medium either the two any combination.Computer-readable medium for example can be --- but it is unlimited In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates The more specific example of machine readable medium can include but is not limited to: electrical connection, portable meter with one or more conducting wires Calculation machine disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or The above-mentioned any appropriate combination of person.In this application, computer-readable medium, which can be, any includes or storage program has Shape medium, the program can be commanded execution system, device or device use or in connection.And in the application In, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, wherein Carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to electric Magnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable Jie Any computer-readable medium other than matter, the computer-readable medium can be sent, propagated or transmitted for being held by instruction Row system, device or device use or program in connection.The program code for including on computer-readable medium It can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. or above-mentioned any conjunction Suitable combination.
The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof Machine program code, described program design language include object-oriented programming language-such as Java, Smalltalk, C+ +, further include conventional procedural programming language-such as " C " language or similar programming language.Program code can Fully to execute, partly execute on the user computer on the user computer, be executed as an independent software package, Part executes on the remote computer or executes on a remote computer or server completely on the user computer for part. In situations involving remote computers, remote computer can pass through the network of any kind --- including local area network (LAN) Or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize Internet service Provider is connected by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet Include acquiring unit, generation unit, screening unit, determination unit and retrieval unit.Wherein, the title of these units is in certain situation Under do not constitute restriction to the unit itself, for example, acquiring unit is also described as " obtaining the list of text to be retrieved Member ".
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be Included in server described in above-described embodiment;It is also possible to individualism, and without in the supplying server.It is above-mentioned Computer-readable medium carries one or more program, when said one or multiple programs are executed by the server, So that the server: obtaining text to be retrieved;Text to be retrieved is analyzed, keyword set is generated;From keyword set Middle screening Item Title set of words and article characteristics set of words;Inspection is determined from Item Title set of words and article characteristics set of words Rope word;It is retrieved based on term, obtains the information of the associated article of term.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims (16)

1. a kind of method for retrieving information, comprising:
Obtain text to be retrieved;
The text to be retrieved is analyzed, keyword set is generated;
Item Title set of words and article characteristics set of words are screened from the keyword set;
Term is determined from the Item Title set of words and the article characteristics set of words;
It is retrieved based on the term, obtains the information of the associated article of the term.
2. it is described that the text to be retrieved is analyzed according to the method described in claim 1, wherein, generate keyword set It closes, comprising:
Extract at least one keyword of the text to be retrieved;
Keyword expansion is carried out at least one described keyword, generates keyword set, wherein the keyword expansion includes Following at least one method: synonym extension, near synonym extension, conjunctive word extension and knowledge mapping extension.
3. according to the method described in claim 2, wherein, described at least one keyword for extracting the text to be retrieved wraps It includes:
By the text input to be retrieved to disaggregated model trained in advance, at least one key of the text to be retrieved is obtained Word, wherein the disaggregated model is used to extract the keyword of text.
4. according to the method described in claim 3, wherein, the disaggregated model includes embeding layer, coding layer, selection layer and classification Layer;And
It is described by the text input to be retrieved to disaggregated model trained in advance, obtain at least one of the text to be retrieved Keyword, comprising:
By the text input to be retrieved to the embeding layer, the dense vector of the text to be retrieved is obtained;
The dense vector is input to the coding layer, obtains the coding vector of the text to be retrieved;
The coding vector is input to the selection layer, obtains the weighing vector of the text to be retrieved;
The weighing vector is input to the classification layer, obtains at least one keyword of the text to be retrieved.
5. according to the method described in claim 1, wherein, extracting the method packet of at least one keyword of the text to be retrieved Include the reverse document-frequency method of word frequency-and text rank algorithm.
6. method described in one of -5 according to claim 1, wherein described to screen Item Title word from the keyword set Set and article characteristics set of words, including
The keyword in the keyword set is analyzed based on part of speech and entity type;
Based on the analysis results, Item Title set of words and article characteristics set of words are filtered out from the keyword set.
7. method described in one of -5 according to claim 1, wherein the keyword in the keyword set has weight, power The correlation degree of corresponding keyword Yu the text to be retrieved is characterized again;And
It is described that term is determined from the Item Title set of words and the article characteristics set of words, comprising:
The Item Title set of words and the article characteristics set of words are ranked up respectively based on weight, obtain Item Title Word sequence and article characteristics word sequence;
Threshold value truncation is carried out to the Item Title word sequence and the article characteristics word sequence respectively, obtains term.
8. a kind of for retrieving the device of information, comprising:
Acquiring unit is configured to obtain text to be retrieved;
Generation unit is configured to analyze the text to be retrieved, generates keyword set;
Screening unit is configured to screen Item Title set of words and article characteristics set of words from the keyword set;
Determination unit is configured to determine term from the Item Title set of words and the article characteristics set of words;
Retrieval unit is configured to be retrieved based on the term, obtains the information of the associated article of the term.
9. device according to claim 8, wherein the generation unit includes:
Subelement is extracted, is configured to extract at least one keyword of the text to be retrieved;
Subelement is extended, is configured to carry out keyword expansion at least one described keyword, generates keyword set, In, the keyword expansion comprises at least one of the following method: synonym extension, near synonym extension, conjunctive word extension and knowledge Map extension.
10. device according to claim 9, wherein the extraction subelement includes:
Categorization module is configured to disaggregated model trained in advance obtain the text input to be retrieved described to be retrieved At least one keyword of text, wherein the disaggregated model is used to extract the keyword of text.
11. device according to claim 10, wherein the disaggregated model includes embeding layer, coding layer, selection layer and divides Class layer;And
The categorization module includes:
It is embedded in submodule, is configured to the text input to be retrieved to the embeding layer obtaining the text to be retrieved Dense vector;
Encoding submodule is configured to the dense vector being input to the coding layer, obtains the volume of the text to be retrieved Code vector;
Submodule is selected, is configured to for the coding vector to be input to the selection layer, obtains adding for the text to be retrieved Weight vector;
Classification submodule, is configured to for the weighing vector to be input to the classification layer, obtains the text to be retrieved extremely A few keyword.
12. device according to claim 8, wherein the method for extracting at least one keyword of the text to be retrieved Including the reverse document-frequency method of word frequency-and text rank algorithm.
13. the device according to one of claim 8-12, wherein the screening unit includes:
Subelement is analyzed, is configured to analyze the keyword in the keyword set based on part of speech and entity type;
Screen subelement, be configured to based on the analysis results, filtered out from the keyword set Item Title set of words and Article characteristics set of words.
14. the device according to one of claim 8-12, wherein the keyword in the keyword set has weight, Weight characterizes the correlation degree of corresponding keyword Yu the text to be retrieved;And
The determination unit includes:
Sorting subunit, be configured to based on weight to the Item Title set of words and the article characteristics set of words respectively into Row sequence, obtains Item Title word sequence and article characteristics word sequence;
Subelement is truncated, is configured to carry out the Item Title word sequence and the article characteristics word sequence respectively threshold value and cuts It is disconnected, obtain term.
15. a kind of server, comprising:
One or more processors;
Storage device is stored thereon with one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in claim 1-7.
16. a kind of computer-readable medium, is stored thereon with computer program, wherein the computer program is held by processor The method as described in any in claim 1-7 is realized when row.
CN201910217161.9A 2019-03-21 2019-03-21 Method and apparatus for retrieving information Active CN109902152B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910217161.9A CN109902152B (en) 2019-03-21 2019-03-21 Method and apparatus for retrieving information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910217161.9A CN109902152B (en) 2019-03-21 2019-03-21 Method and apparatus for retrieving information

Publications (2)

Publication Number Publication Date
CN109902152A true CN109902152A (en) 2019-06-18
CN109902152B CN109902152B (en) 2021-07-06

Family

ID=66952860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910217161.9A Active CN109902152B (en) 2019-03-21 2019-03-21 Method and apparatus for retrieving information

Country Status (1)

Country Link
CN (1) CN109902152B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110502613A (en) * 2019-08-12 2019-11-26 腾讯科技(深圳)有限公司 A kind of model training method, intelligent search method, device and storage medium
CN112825078A (en) * 2019-11-21 2021-05-21 北京沃东天骏信息技术有限公司 Information processing method and device
CN113379499A (en) * 2021-06-18 2021-09-10 北京沃东天骏信息技术有限公司 Article screening method and apparatus, electronic device, and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101860801A (en) * 2010-05-28 2010-10-13 杭州王道电子商务有限公司 Information aggregation and push method and system thereof
WO2013099328A1 (en) * 2011-12-28 2013-07-04 楽天株式会社 Search device, search method, search program, and recording medium
CN103473317A (en) * 2013-09-12 2013-12-25 百度在线网络技术(北京)有限公司 Method and equipment for extracting keywords
CN103744835A (en) * 2014-01-02 2014-04-23 上海大学 Text keyword extracting method based on subject model
CN104933068A (en) * 2014-03-19 2015-09-23 阿里巴巴集团控股有限公司 Method and device for information searching
CN105243143A (en) * 2015-10-14 2016-01-13 湖南大学 Recommendation method and system based on instant voice content detection
CN106156204A (en) * 2015-04-23 2016-11-23 深圳市腾讯计算机系统有限公司 The extracting method of text label and device
WO2017050149A1 (en) * 2015-09-22 2017-03-30 阿里巴巴集团控股有限公司 Information search method and device
CN106651393A (en) * 2016-12-19 2017-05-10 广东技术师范学院 Drug appearance identification-based drug guiding method and system
CN107203507A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 Feature vocabulary extracting method and device
CN108984745A (en) * 2018-07-16 2018-12-11 福州大学 A kind of neural network file classification method merging more knowledge mappings

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101860801A (en) * 2010-05-28 2010-10-13 杭州王道电子商务有限公司 Information aggregation and push method and system thereof
WO2013099328A1 (en) * 2011-12-28 2013-07-04 楽天株式会社 Search device, search method, search program, and recording medium
CN103473317A (en) * 2013-09-12 2013-12-25 百度在线网络技术(北京)有限公司 Method and equipment for extracting keywords
CN103744835A (en) * 2014-01-02 2014-04-23 上海大学 Text keyword extracting method based on subject model
CN104933068A (en) * 2014-03-19 2015-09-23 阿里巴巴集团控股有限公司 Method and device for information searching
CN106156204A (en) * 2015-04-23 2016-11-23 深圳市腾讯计算机系统有限公司 The extracting method of text label and device
WO2017050149A1 (en) * 2015-09-22 2017-03-30 阿里巴巴集团控股有限公司 Information search method and device
CN105243143A (en) * 2015-10-14 2016-01-13 湖南大学 Recommendation method and system based on instant voice content detection
CN107203507A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 Feature vocabulary extracting method and device
CN106651393A (en) * 2016-12-19 2017-05-10 广东技术师范学院 Drug appearance identification-based drug guiding method and system
CN108984745A (en) * 2018-07-16 2018-12-11 福州大学 A kind of neural network file classification method merging more knowledge mappings

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MICHAEL GUBANOV等: ""ReadFast: Optimizing structural search relevance for big biomedical text"", 《2013 IEEE 14TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE & INTEGRATION (IRI)》 *
钟敏娟等: ""基于分类和关键词组抽取的信息检索算法"", 《系统仿真学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110502613A (en) * 2019-08-12 2019-11-26 腾讯科技(深圳)有限公司 A kind of model training method, intelligent search method, device and storage medium
CN110502613B (en) * 2019-08-12 2022-03-08 腾讯科技(深圳)有限公司 Model training method, intelligent retrieval method, device and storage medium
CN112825078A (en) * 2019-11-21 2021-05-21 北京沃东天骏信息技术有限公司 Information processing method and device
CN113379499A (en) * 2021-06-18 2021-09-10 北京沃东天骏信息技术有限公司 Article screening method and apparatus, electronic device, and storage medium

Also Published As

Publication number Publication date
CN109902152B (en) 2021-07-06

Similar Documents

Publication Publication Date Title
CN107491534B (en) Information processing method and device
US11023505B2 (en) Method and apparatus for pushing information
US20210165955A1 (en) Methods and systems for modeling complex taxonomies with natural language understanding
CN108153901A (en) The information-pushing method and device of knowledge based collection of illustrative plates
CN106960030B (en) Information pushing method and device based on artificial intelligence
CN108090162A (en) Information-pushing method and device based on artificial intelligence
CN106919711B (en) Method and device for labeling information based on artificial intelligence
CN109190124B (en) Method and apparatus for participle
US11651015B2 (en) Method and apparatus for presenting information
CN114385780B (en) Program interface information recommendation method and device, electronic equipment and readable medium
CN109902152A (en) Method and apparatus for retrieving information
CN110347428A (en) A kind of detection method and device of code similarity
US20220121668A1 (en) Method for recommending document, electronic device and storage medium
CN110362815A (en) Text vector generation method and device
CN109858045A (en) Machine translation method and device
CN108073708A (en) Information output method and device
CN109190123A (en) Method and apparatus for output information
JP7172187B2 (en) INFORMATION DISPLAY METHOD, INFORMATION DISPLAY PROGRAM AND INFORMATION DISPLAY DEVICE
CN110245357A (en) Principal recognition methods and device
KR20210084641A (en) Method and apparatus for transmitting information
US20140372076A1 (en) Providing known distribution patterns associated with specific measures and metrics
CN111737607A (en) Data processing method, data processing device, electronic equipment and storage medium
JP5700007B2 (en) Information processing apparatus, method, and program
CN110852078A (en) Method and device for generating title
CN109241296A (en) Method and apparatus for generating information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant