CN105912631B - Search processing method and device - Google Patents

Search processing method and device Download PDF

Info

Publication number
CN105912631B
CN105912631B CN201610214481.5A CN201610214481A CN105912631B CN 105912631 B CN105912631 B CN 105912631B CN 201610214481 A CN201610214481 A CN 201610214481A CN 105912631 B CN105912631 B CN 105912631B
Authority
CN
China
Prior art keywords
theme
information
subject area
purport
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610214481.5A
Other languages
Chinese (zh)
Other versions
CN105912631A (en
Inventor
吕雅娟
丁长林
肖欣延
朱少杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201610214481.5A priority Critical patent/CN105912631B/en
Publication of CN105912631A publication Critical patent/CN105912631A/en
Application granted granted Critical
Publication of CN105912631B publication Critical patent/CN105912631B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of search processing method and devices, wherein method includes: to carry out theme cutting to webpage information, determines each subject area;The purport information and topic abstraction of each theme are determined according to the content of each subject area;Index corresponding with webpage information is established according to the purport information and topic abstraction of each theme, to be retrieved according to index.The present invention is the theme by that will index granularity setting, is improved the degree of correlation of search result and user demand, is improved the satisfaction of user.

Description

Search processing method and device
Technical field
The present invention relates to Internet technical field more particularly to a kind of search processing methods and device.
Background technique
In the related technology, the retrieval of search engine and to result of page searching show process with webpage be most granule The term that degree, i.e. search engine are inputted according to user calculates the correlation of webpage and the term, to retrieve correlation Higher webpage, and related web page is ranked up according to correlation and is shown in result of page searching, to be supplied to user.
But it is above-mentioned using webpage as the retrieval mode of minimum particle size, in the term and web page correlation of user's input In calculating process, the information of many webpages itself can be lost, so as to will lead to the result of retrieval and the term phase of user Guan Du is not high.For example, when the term " treatment method of obesity " that search engine is inputted according to user is retrieved, Ke Nengyou It is influenced, is retrieved with " treatment method of obesity " as a very big piece in the webpage of title in by correlative factors such as title weights Width describes the cause of disease and prevention of obesity, and there is no corresponding explanation is provided to " treatment method of obesity ", this causes to mention The related web page information of supply user can not be met the needs of users well.
Summary of the invention
The purpose of the present invention is intended to solve above-mentioned one of technical problem at least to a certain extent.
For this purpose, the first purpose of this invention is to propose a kind of search processing method, this method will be by that will index granularity Setting is the theme, and determines relevant purport information and topic abstraction according to the content of subject area, and according to each theme Purport information and topic abstraction establish index corresponding with webpage information, realize the knot retrieved according to the index Fruit is more in line with the demand of user, improves the satisfaction of user.
Second object of the present invention is to propose a kind of retrieval process device.
In order to achieve the above object, the search processing method of first aspect present invention embodiment, comprising: led to webpage information Cutting is inscribed, determines each subject area;The purport information and topic abstraction of each theme are determined according to the content of each subject area;Root Index corresponding with the webpage information is established according to the purport information and topic abstraction of each theme, so as to according to the rope Introduce row retrieval.
The search processing method of the embodiment of the present invention is the theme, according in subject area by that will index granularity setting Hold and determine relevant purport information and topic abstraction, and according to the foundation of the purport information and topic abstraction of each theme and net The corresponding index of page information, realize according to the index retrieved as a result, be more in line with the demand of user, improve The satisfaction of user.
In addition, in one embodiment of the invention, it is described that theme cutting is carried out to webpage information, determine each theme model It encloses, comprising: the webpage information is led using the cutting feature of the segmentation model corresponding with type of theme of training in advance Cutting is inscribed, determines each subject area.
In one embodiment of the invention, the type of theme includes at least one of: aobvious comprising segmentation mark Formula type of theme;Semi-explicit type of theme comprising subtitle;Implicit type of theme not comprising subtitle and segmentation mark;Nothing Single type of theme of structure.
In one embodiment of the invention, in the segmentation model corresponding with type of theme for using and training in advance Cutting feature carries out the webpage information before theme cutting, further includes: will believe with the webpage of the explicit type of theme Breath is converted into the corpus training segmentation model of other type of theme according to actual distribution.
In one embodiment of the invention, further includes: in the training process of the segmentation model, in training corpus Cutting feature empty at random.
In one embodiment of the invention, the content according to each subject area determines the purport information of each theme, It include: the subtitle for extracting each subject area, or, extracting the keyword of the subtitle of each subject area.
In one embodiment of the invention, the content according to each subject area determines the purport information of each theme, Include: extract each subject area Feature Words go forward side by side row major grade sequence;The Feature Words are carried out according to preset knowledge base Analysis obtains purport information.
In one embodiment of the invention, the content according to each subject area determines the topic abstraction of each theme, Include: to be fitted using the extraction feature in analysis model trained in advance to the content of each subject area, obtains each theme Topic abstraction.
In one embodiment of the invention, further includes: receive the retrieval information of input;According to index acquisition and institute The relevant topic abstraction of retrieval information and purport information are stated, and is shown in result of page searching.
In one embodiment of the invention, further includes: when the purport information of described search results page is triggered, jump Go to information interface corresponding with the purport information.
For up to above-described embodiment, the retrieval process device of second aspect of the present invention embodiment, comprising: the first determining module, For carrying out theme cutting to webpage information, each subject area is determined;Second determining module, for according in each subject area Hold the purport information and topic abstraction for determining each theme;Establish module, for according to the purport information of each theme and Topic abstraction establishes index corresponding with the webpage information, to be retrieved according to the index.
The retrieval process device of the embodiment of the present invention is the theme, according in subject area by that will index granularity setting Hold and determine relevant purport information and topic abstraction, and according to the foundation of the purport information and topic abstraction of each theme and net The corresponding index of page information, realize according to the index retrieved as a result, be more in line with the demand of user, improve The satisfaction of user.
In addition, in one embodiment of the invention, first determining module is specifically used for: using in advance training with The cutting feature of the corresponding segmentation model of type of theme carries out theme cutting to the webpage information, determines each subject area.
In one embodiment of the invention, the type of theme includes at least one of: aobvious comprising segmentation mark Formula type of theme;Semi-explicit type of theme comprising subtitle;Implicit type of theme not comprising subtitle and segmentation mark;Nothing Single type of theme of structure.
In one embodiment of the invention, further includes: conversion module, for that will have the net of the explicit type of theme Page information is converted into the corpus training segmentation model of other type of theme according to actual distribution.
In one embodiment of the invention, further includes: module is emptied, for the training process in the segmentation model In, the cutting feature in training corpus is emptied at random.
In one embodiment of the invention, second determining module includes: the first extraction unit, for extracting each master The subtitle for inscribing range, or, extracting the keyword of the subtitle of each subject area.
In one embodiment of the invention, second determining module includes: the second extraction unit, for extracting each master Row major grade of the Feature Words of topic range going forward side by side sorts;First acquisition unit is used for according to preset knowledge base to the Feature Words It carries out analysis and obtains purport information.
In one embodiment of the invention, second determining module, comprising: second acquisition unit, for using pre- First the extraction feature in trained analysis model is fitted the content of each subject area, obtains the topic abstraction of each theme.
In one embodiment of the invention, further includes: receiving module, retrieval information for receiving input;Acquisition exhibition Show module, for obtaining topic abstraction relevant to the retrieval information and purport information according to the index, and shows Result of page searching.
In one embodiment of the invention, further includes: jump module, for the purport letter in described search results page When breath is triggered, information interface corresponding with the purport information is jumped to.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.
Detailed description of the invention
Above-mentioned and/or additional aspect of the invention and advantage will become from the description of the embodiment in conjunction with the following figures Obviously and it is readily appreciated that, in which:
Fig. 1 is the flow chart of search processing method according to an embodiment of the invention;
Fig. 2 (a)-Fig. 2 (c) is the webpage information exemplary diagram according to different type of theme;
Fig. 3 is the flow diagram that search processing method according to an embodiment of the present invention carries out retrieval process;
Fig. 4 is the flow chart of search processing method accord to a specific embodiment of that present invention;
Fig. 5 (a)-Fig. 5 (b) is according to result of page searching example on the line of the search processing method of the embodiment of the present invention Figure;
Fig. 6 is the structural schematic diagram of retrieval process device according to an embodiment of the invention;
Fig. 7 is the structural schematic diagram of retrieval process device accord to a specific embodiment of that present invention;
Fig. 8 is the structural schematic diagram of retrieval process device in accordance with another embodiment of the present invention;
Fig. 9 is the structural schematic diagram of the retrieval process device of another embodiment according to the present invention;
Figure 10 is the structural schematic diagram of the retrieval process device of a still further embodiment according to the present invention;And
Figure 11 is the structural schematic diagram of the retrieval process device of further embodiment according to the present invention.
Specific embodiment
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.
Below with reference to the accompanying drawings the search processing method and device of the embodiment of the present invention are described.
The embodiment of the present invention proposes a kind of search processing method, comprising: carries out theme cutting to webpage information, determines each Subject area;The purport information and topic abstraction of each theme are determined according to the content of each subject area;According to the master of each theme Purport information and topic abstraction establish index corresponding with webpage information, to be retrieved according to index.
Fig. 1 is the flow chart of search processing method according to an embodiment of the invention.
As shown in Figure 1, the search processing method includes:
S110 carries out theme cutting to webpage information, determines each subject area.
It is appreciated that the content in webpage information may express one or more theme, for example, one about First segment information of the webpage information of obesity may specifically illustrate the treatment method of obesity, and the information of second segment may The cause of disease etc. of obesity is specifically illustrated, the search processing method in the embodiment of the present invention is then desirable to as retrieval obesity The user for the treatment of method directly shows the first segment information in above-mentioned webpage information in result of page searching, fat for retrieval The user of the cause of disease of disease directly shows the second segment information in above-mentioned webpage information in result of page searching.That is, can More meet the inspection of user so that retrieved message more has specific aim by the cutting to webpage information theme Rope demand.
Therefore, in order to improve the accuracy of retrieval, need in advance that (webpage information can be unified resource to webpage information The relevant information of the webpages such as finger URL, Web page text content, web page download time itself) theme cutting is carried out, to determine webpage Theme expressed by information, and determine the paragraph range where each theme.For example, to one with " obesity is controlled Treatment method " is the webpage information of title, and the particular content cutting according to webpage information is " cause of disease ", " prevention " and " treatment " three A theme, and determine above three theme respectively place paragraph range.
Specifically, due to theme have multiple types, such as the explicit type of theme comprising segmentation mark, comprising son mark The semi-explicit type of theme of topic, the implicit type of theme not comprising subtitle and segmentation mark and structureless single type of theme Deng, and different themes has different thematic structures, therefore in order to accurately carry out theme cutting to webpage information, needs needle Corresponding different segmentation models are selected different type of theme.
Therefore, the cutting feature that the segmentation model corresponding with type of theme of training in advance can be used carries out webpage information Theme cutting, and determine each subject area.That is, the selection of segmentation model is related to the type of theme, different cuttings The cutting feature of model has different expressions, therefore accurately to carry out theme cutting to webpage information, needs according to theme Type selects corresponding segmentation model.Such as the type that shows topics comprising segmentation mark, corresponding segmentation model exists It, can be each by accounting of the consideration webpage information content first floor list in article, list when carrying out theme cutting to it The cuttings features such as the distribution of item, and can use disaggregated model and webpage information content is fitted;It is for another example aobvious for half The cutting type of formula, corresponding segmentation model can use sequence by increasing other features when carrying out theme cutting to it The model of mark the modes such as is fitted to handle the webpage information.
In order to which more clearly how special using the cutting of the segmentation model corresponding with type of theme of training in advance description is Sign carries out theme cutting to webpage information, and determines each subject area, and 2 (a) to Fig. 2 (c) for example, say with reference to the accompanying drawing It is bright as follows:
If above-mentioned type of theme is the explicit type of theme comprising segmentation mark, such as comprising apparent, regular Html tag mark or cut-off have the segmentation mark of apparent group labelled notation, then can be used training in advance with explicit theme The cutting feature of the corresponding segmentation model of type carries out the cutting of theme to it.For example, bright to having as shown in Fig. 2 (a) The webpage information of aobvious label, can by the segmentation mark A and B in corresponding segmentation model according to fig. 2 (a) to the webpage information into The content cutting of the webpage information is the theme A and B by row theme cutting, and determine A and B range be its respectively corresponding to Content where range, i.e. subject area corresponding to theme A is A1 section, and content corresponding to theme B is B1 sections;
If above-mentioned type of theme is the semi-explicit type of theme comprising subtitle, i.e., cut-off is that the son of corresponding theme is marked Topic etc., then can be used the cutting feature of segmentation model trained in advance corresponding with the theme to the content of the webpage information into Row theme cutting.For example, for that with the webpage information of subtitle C and D, then can be shown using with half as shown in Fig. 2 (b) The corresponding segmentation model of formula type of theme carries out theme cutting to the webpage information according to above-mentioned subtitle C and D, i.e., should The theme cutting of webpage information is C and D, and determines that the subject area of theme C and D are the area where its corresponding content Domain, i.e. subject area corresponding to theme C are C1 section, and content corresponding to theme D is D1 sections;
If above-mentioned type of theme is the implicit type of theme not comprising subtitle and segmentation mark, i.e., do not deposited between theme In subtitle as the mark converted between theme, then using the cutting of the segmentation model corresponding with implicit theme of training in advance Feature carries out theme cutting to the content in the webpage information where it.Although for example, for as shown in Fig. 2 (c) without son The implicit type of theme of title and segmentation mark, but every section of content is directed to the webpage information of a theme, i.e. E segment table respectively The advantages of theme reached is viton seal ring, the theme that F segment table reaches are the determinations of viton seal ring, it be can be used in advance Trained segmentation model corresponding with implicit type of theme carries out theme cutting to it, passes through the interior of the paragraph E and F in Fig. 2 (c) Appearance carries out calculating analysis, obtains the theme of E and F, and determine the subject area of each theme, that is, determines that the subject area of theme E is E Paragraph where theme, the subject area of theme F are the paragraph where F theme;
If above-mentioned type of theme is structureless single type of theme, for example the content of entire webpage information expresses one The cutting feature of the segmentation model corresponding with structureless list type of theme of training in advance then can be used, to net in a theme Page information carries out cutting, the calculating etc. of theme, and determines the subject area of each theme.
S120 determines the purport information and topic abstraction of each theme according to the content of each subject area.
It is appreciated that after determining the paragraph range of each theme, in order to which the search result further provided for user is straight The Search Requirement of sufficient user is filled, purport information and the master of each theme can be determined according to the corresponding content of paragraph range of theme Topic abstract, wherein subtitle, the keyword etc. that the purport information of theme can be the theme, the meaning of the subject information is to lead to The main contents of brief sentence or phrase general subject matter are crossed, such as pass through subject information " cause of disease " summary " cause of disease of obesity " Theme, and the purport information can be to be multiple, and user can recognize the whole contents where current topic according to the purport information The structure etc. of information;
In addition, above-mentioned topic abstraction refers to the representative sentence extracted in the content of corresponding subject area The corresponding maximally related content of corresponding theme can be understood according to the topic abstraction is open-and-shut Deng, user, for example, to theme Content is the theme of " cause of disease of obesity ", and corresponding topic abstraction can be the sentence in the content of subject area " cause of disease of obesity is specific as follows: 1, inherent cause;2, psychoneural factor;3, hyperinsulinemia;4, brown adipose tissue It is abnormal ".
Specifically, being determined according to the content of subject area there are many modes of the purport information of each theme, below with root The purport information of theme is determined according to subtitle and keyword and according to modes such as the Feature Words sequences of content of each subject area For be specifically described:
First to be illustrated for determining the purport information of theme according to subtitle and keyword:
It in one embodiment of the invention, can be according to the subtitle of each subject area of contents extraction of subject area come really The purport information of fixed each theme, specifically, since the cut-off in not all webpage information is all the son mark of this theme Topic, as webpage information content in the theme finally summarized be not the subtitle of this theme, so being determined according to keyword Before the purport information of theme, a possibility that cut-off for needing to calculate the theme in webpage information may be subtitle.In addition, Based on the above embodiment, since the content of subtitle often has redundant content, cause to be not easy to be shown in search result In page, therefore after determining subtitle and extracting, need to also in the distortionless situation of content for guaranteeing subtitle to its into Row compression appropriate.In an embodiment of the present invention, using morphological analysis, the result of syntactic analysis, subtitle itself spy Sign etc. compresses the content of subtitle, can also be pressed by way of sequence labelling models fitting the content of subtitle Contracting.
It based on the above embodiment, can be according to each theme of contents extraction of subject area as a kind of specific compress mode The keyword of the subtitle of range completes the compression to sub- title content, wherein the purpose for extracting the keyword is to select The core content that word appropriate describes this theme is selected, which is that key content is again representative.For example, right In content be " cause of disease of obesity ", " prevention of obesity ", " treatment of obesity " subtitle, it is compressible for " cause of disease ", " prevention ", " treatment ";In another example can be by complicated if the content in news is " the following are 2015 that SplashData is announced Be easiest to be cracked 25 passwords: " subtitle, boil down to " password for being easiest to be cracked in 2015 ".
In addition, original word in content of the keyword in above-described embodiment in addition to can be current topic range, it can also To be the other words summarized according to subject content, for example, describing a variety of of certain company in the content of such as certain subject area Be the specifically content such as mode, including phone, mailbox, address, in order to determine the subject area content purport information, can be It summarizes when extracting keyword to the content information of subject area, it can to the subject key words " phone extracted Email address ", which is summarized, generates keyword " contact method ".
Further, below to be carried out for determining the purport information of each theme according to the characteristic value of the content of subject area Illustrate:
In one embodiment of the invention, can first extract each subject area Feature Words go forward side by side row major grade sequence, The core that middle features described above word can be word relatively representative in subject area content, may represent subject area content The word etc. of thought, and during stating Feature Words in the choice, need to consider the importance and discrimination of the Feature Words extracted, and It can be fitted to extract the Feature Words for the core content for being best suitable for theme mesh range content, extracted by order models After Feature Words, Feature Words can analyze according to preset knowledge base and then obtain purport information.
Based on above embodiments, after the purport information of each theme has been determined, it is also necessary to according to the content of each subject area The topic abstraction for determining each theme, specifically includes: using the extraction feature in analysis model trained in advance to each subject area Content be fitted, obtain the topic abstraction of each theme.Wherein, in order to more accurately determine to be best suitable for current topic Topic abstraction can be modeled by granularity of sentence, and by the importance and representativeness of character representation sentence, and then can be with Selected and sorted model, graph model etc. are fitted the importance and representativeness of above-mentioned sentence, accurately determine out to be best suitable for and work as Sentence in the topic abstraction of preceding theme.
S130 establishes index corresponding with webpage information according to the purport information and topic abstraction of each theme, so as to root It is retrieved according to index.
Specifically, it is the corresponding result of page searching for retrieving granularity to finally provide for user with theme, needs root Index corresponding with webpage information is established according to the purport information of each theme and the topic abstraction of theme, so as to according to the rope The corresponding retrieval of row is introduced, that is, when receiving the retrieval information of user, needed for can going out to be best suitable for user's inspection according to the indexed search The result of page searching for the response asked.
The search processing method described based on the above embodiment, it should be noted that determined in above-mentioned steps S110 each Before subject area, need to pre-establish segmentation model.During establishing segmentation model, in order to guarantee the dividing die established Type accurately can carry out cutting to the theme in webpage information, need to consider various data informations, for example combine user in net Number of clicks information, browsing historical information on page etc..The information training such as cutting feature to use corpus, in corpus below is cut It is illustrated, is described as follows for sub-model:
In order to establish the segmentation model that can more accurately carry out cutting to webpage information theme, corpus can be used in advance To train segmentation model, and in order to solve the problems, such as corpus deficiency, need using existing corpus symphysis into training corpus.Tool For body, in order to guarantee to train segmentation model training corpus quantity is big, structure close to true webpage information distribution situation, Other kinds of corpus can be converted into according to actual distribution in advance by the webpage information with explicit type of theme.For example, can be with The content of webpage information by Baidupedia etc. with obvious subject information is converted into being more in line with general webpage information distribution The related corpus such as corpus of the semi-explicit type of theme of situation, implicit type of theme.In addition, can also be by different web pages information The mode for carrying out content splicing, constructs a large amount of training corpus.
Further, after constructing a large amount of training corpus, according to the cutting feature of training corpus training segmentation model.Tool It, can be according to the essential characteristic of training corpus, list characteristics, title probability, hint information, webpage format information, word point for body The cuttings feature such as cloth characteristic and bout length distribution character is trained segmentation model.Wherein, essential characteristic may include section Fall length, paragraph position, word/part of speech distribution, clause's number and similarity of title etc.;Corresponding list characteristics are to identify Continuous serial number structure in webpage information content, and the partially cutting feature of the identification of explicit chapter;Title probability is corresponding Be the cutting feature that a possibility that paragraph may be a title is identified with prior probability;Corresponding hint information is title In common word or mode, such as ordinal number, punctuate, typical words (explanation, such as inferior), part of speech arrange in pairs or groups cutting feature;Webpage lattice It is the overstriking grade of paragraph, the cuttings feature such as whether have hyperlink that formula information is corresponding;Corresponding word distribution characteristics is with up and down The cutting feature for a possibility that previous paragraphs are theme transition sections is worked as in the word distribution of text, estimation;Bout length distribution characteristics is corresponding It is to use context paragraph distribution of lengths, the cutting feature for a possibility that previous paragraphs are current topic subtitles is worked as in estimation.Namely It says, is trained by the cutting feature to training corpus, the cutting feature and the cutting feature phase one in real web pages information It causes, to ensure that the segmentation model trained can complete the theme cutting to webpage information, and to the cutting of training corpus spy Sign is trained, and be ensure that and is carried out theme cutting to webpage information according to the cutting feature of the corresponding segmentation model of type of theme Feasibility.
In addition, in the above-described embodiments, during the cutting feature of training corpus is trained segmentation model, being Guarantee that the segmentation model for training has practicability, accurately can carry out cutting by the theme to webpage information, it is ensured that special The quantization of value indicative have maximum discrimination, can by the cutting feature in training corpus is emptied at random solve training it is bigoted and The problem of over-fitting.
Further, in the application for specifically using the cutting feature of segmentation model to carry out theme cutting to webpage information In, specific cutting character representation is related with the segmentation model of selection, while cutting feature can there are many calculation, The similarity of such as paragraph and title can be used to be calculated based on Term co-occurrence, based on a variety of methods of term vector.
It should be understood that determining that the purport of each theme is believed according to the content of each subject area in above-mentioned steps S120 During whether breath is subtitle, it can be multiplexed the methods of the feature of above-mentioned theme cutting, corpus construction of theme cutting, This is repeated no more.
In order to enable those skilled in the art can more be apparent from the retrieval process side in the embodiment of the present invention Method illustrates the workflow of the search processing method in the embodiment below with reference to Fig. 3.As shown in figure 3, the present invention is implemented Traditional webpage information is carried out theme cutting processing (31), obtained corresponding to the webpage information by the search processing method in example Thematic structure (32), wherein the thematic structure includes the content etc. of each subject area and subject area, and then according to true The content of fixed each subject area carries out theme expression processing (33), that is, the purport information (34) of each theme is determined, in determination After the purport information of theme, multi-threaded abstract processing (35) is carried out according to the content of determining each subject area to get net is arrived Page information full text abstract and each topic abstraction (36), and then to webpage information retrieval process after, obtain that there is each master The purport information of topic and the index (37) of topic abstraction.It is appreciated that Fig. 3 shows the embodiment of the present invention as an example Search processing method line under flow diagram.
In conclusion the search processing method of the embodiment of the present invention, will be the theme, according to master by that will index granularity setting The content of topic range determines relevant purport information and topic abstraction, and is plucked according to the purport information and theme of each theme Establish index corresponding with webpage information, realize according to the index retrieved as a result, being more in line with user's Demand improves the satisfaction of user.
In order to more clearly illustrate the search processing method of the embodiment of the present invention, below can with the exposition on line into Row explanation.It specifically, based on the above embodiment, will when being retrieved according to the different terms and index that user inputs Related pages information corresponding to the highest index of the correlation retrieved shows user on result of page searching, and Fig. 4 is The flow chart of search processing method accord to a specific embodiment of that present invention, as shown in figure 4, on the basis of as shown in Figure 1, The search processing method further include:
S140 receives the retrieval information of input.
Wherein, the mode that above-mentioned user inputs retrieval information can pass through the modes such as voice, text input.For example, can connect Receive the retrieval information that user inputs in the search box of search engine.
S150 obtains topic abstraction relevant to retrieval information and purport information according to index, and shows and tie in search The fruit page.
According to the term received for input, search immediate and the search term is maximally related with term Index, to showing the corresponding topic abstraction of the index and purport information in result of page searching.As shown in Fig. 5 (a), receive To user input retrieval information be " cause of disease of obesity " after, then on result of page searching show search retrieval as a result, Wherein, the G in Fig. 5 (a) is theme, and H is the theme part of making a summary, and the content of the abstract primarily illustrates the cause of disease of obesity, should As long as " cause of disease of obesity " that theme and user input is directly related, I is purport information relevant to theme, the purport information The theme of the entire chapter of index tape is contained, user can pass through the purport information in I, it is thus understood that the chapter mainly describes fertilizer Several themes such as the cause of disease, the clinical manifestation of fat disease, and the purport information plays the role of exciting user demand simultaneously.
It should be noted that in order to increase the readability of result of page searching, it can be to the position where theme, purport information etc. It sets and highlight processing, such as color, font of change theme, purport information etc. etc., such as general rise of prices of the stocks and other securities is carried out to theme and is shown Show.
S160 jumps to information interface corresponding with purport information when the purport information of result of page searching is triggered.
It is understood that the purport information of result of page searching is relevant link, user can be by purport information Trigger action, to obtain the corresponding information interface of purport information.
For example, the corresponding corresponding link of each purport information in the I of Fig. 5 (a), user can be corresponding by clicking Link, it is thus understood that the more and relevant content of theme, such as user click the clinical manifestation in I, then can trigger clinical manifestation institute Corresponding link, and then the search results pages as shown in Fig. 5 (b) will be obtained, the theme of the search results pages is obesity Clinical manifestation provides the correlated results page of the clinical manifestation of obesity.
It should be understood that above example is only a kind of display form of thematic structure, topic abstraction, so in example Topic abstraction selection be not it is optimal, can basis to the topic abstraction in Fig. 5 (a) as another more excellent example The content of subject area replaces with that " external cause is with hyperphagia and based on Lack of Movement.Internal cause is that people's body internal factor makes fatty generation It thanks to disorder and causes fat fat.It is specific as follows: 1, inherent cause;2, psychoneural factor;3, hyperinsulinemia;4, brown fat group Knit exception ".
In conclusion the search processing method of the embodiment of the present invention, receives the retrieval information of the input of user, and according to rope Draw acquisition topic abstraction relevant to retrieval information and purport information, and show on result of page searching, to increase Specific aim, the readability of search results pages, further directly meet the Search Requirement of user.
In order to realize above-described embodiment, the invention also provides a kind of retrieval process devices.Fig. 6 is according to the present invention one The structural schematic diagram of the retrieval process device of embodiment.As shown in fig. 6, the retrieval process device includes: the first determining module 610, the second determining module 620 and module 630 is established.
Wherein, the first determining module 610 determines each subject area for carrying out theme cutting to webpage information.
In order to improve the accuracy of retrieval, needing the first determining module 610, to webpage information, (webpage information can be in advance It is the relevant information of the webpages such as uniform resource locator, Web page text content, web page download time itself) theme cutting is carried out, To determine theme expressed by webpage information, and determine the paragraph range where each theme.For example, to one with " treatment method of obesity " is the webpage information of title, and the first determining module 610 can be cut according to the particular content of webpage information It is divided into " cause of disease ", " prevention " and " treatment " three themes, and determines the paragraph range at the respective place of above three theme.
Specifically, due to theme have multiple types, such as the explicit type of theme comprising segmentation mark, comprising son mark The semi-explicit type of theme of topic, the implicit type of theme not comprising subtitle and segmentation mark and structureless single type of theme Deng, and different themes has different thematic structures, therefore the first determining module 610 is in order to accurately to webpage information progress Theme cutting needs to select corresponding different segmentation models for different type of theme.
Therefore, the cutting spy of the segmentation model corresponding with type of theme of training in advance can be used in the first determining module 610 Sign carries out theme cutting to webpage information, and determines each subject area.That is, the selection of segmentation model and the type of theme Correlation, the cutting feature of different segmentation models has different expressions, therefore accurately to carry out theme to webpage information and cut Point, it needs to select corresponding segmentation model according to type of theme.
Second determining module 620, for determining the purport information and theme of each theme according to the content of each subject area Abstract.
It is appreciated that the first determining module 610 is after determining the paragraph range of each theme, in order to further mention for user The search result of confession directly meets the Search Requirement of user, and the second determining module 620 can be corresponding according to the paragraph range of theme Content determines the purport information and topic abstraction of each theme, wherein subtitle that the purport information of theme can be the theme closes Keyword etc., the meaning of the subject information is the main contents by brief sentence or phrase general subject matter, such as passes through master Inscribe information " cause of disease " summarys " cause of disease of obesity " theme, and the purport information can be it is multiple, user can believe according to the purport Breath recognizes the structure etc. of the whole contents information where current topic.
In addition, above-mentioned topic abstraction refers to the representative sentence extracted in the content of corresponding subject area The corresponding maximally related content of corresponding theme can be understood according to the topic abstraction is open-and-shut Deng, user, for example, to theme Content is the theme of " cause of disease of obesity ", and corresponding topic abstraction can be the sentence in the content of subject area " cause of disease of obesity is specific as follows: 1. inherent causes;2. psychoneural factor;3. hyperinsulinemia;4. brown adipose tissue It is abnormal ".
Specifically, as shown in fig. 7, the second determining module 620 may include the first extraction unit 621.Of the invention one In a embodiment, the first extraction unit 621 can determine respectively according to the subtitle of each subject area of the contents extraction of subject area The purport information of theme, specifically, since the cut-off in not all webpage information is all the subtitle of this theme, such as The theme finally summarized in the content of webpage information is not the subtitle of this theme, so determining theme according to keyword Before purport information, the second determining module 620 need to calculate the theme in webpage information cut-off may be subtitle can It can property.In addition, based on the above embodiment, since the content of subtitle often has redundant content, causing to be not easy to be shown It, need to also be in the distortionless feelings of content for guaranteeing subtitle in search results pages, therefore after determining subtitle and extracting Compression appropriate is carried out to it under condition.In an embodiment of the present invention, using morphological analysis, the result of syntactic analysis, sub- mark Feature of itself etc. is inscribed to compress the content of subtitle, it can also be by way of sequence labelling models fitting to subtitle Content is compressed.
Based on the above embodiment, as a kind of specific compress mode, the first extraction unit 621 can be according to subject area The keyword of the subtitle of each subject area of contents extraction completes the compression to sub- title content, wherein extracts the keyword Purpose be core content in order to select word appropriate to describe this theme, which is that key content has representative again Property.
In addition, the keyword that the first extraction unit 621 in above-described embodiment extracts is in addition to can be current topic range Content in original word, be also possible to according to subject content summarize other words, for example, as certain subject area content in The specifically content such as a variety of contact methods, including phone, mailbox, address of certain company is described, in order to determine the subject area Content purport information, the first extraction unit 621 can carry out the content information of subject area when extracting keyword It summarizes, it can summarize to the subject key words " phone mailbox address " extracted and generate keyword " contact method ".
Further, as shown in figure 8, the second determining module 620 can include: the second extraction unit 622 and first obtains list Member 623.
In one embodiment of the invention, the Feature Words that the second extraction unit 622 can first extract each subject area are gone forward side by side The sequence of row major grade, wherein features described above word can be word relatively representative in subject area content, may represent master The word etc. of the core concept of range content is inscribed, and during stating Feature Words in the choice, needs to consider the Feature Words extracted Importance and discrimination, and can be fitted by order models to extract the core content for being best suitable for subject area content Feature Words, after extracting Feature Words, first acquisition unit 623 can be analyzed in turn Feature Words according to preset knowledge base Obtain purport information.
Based on above embodiments, after the purport information of each theme has been determined, it is also necessary to according to the content of each subject area Determine that the topic abstraction of each theme, Fig. 9 are according to the structural schematic diagram of the retrieval process device of another embodiment of the invention, such as Shown in Fig. 9, which may also include second acquisition unit 624, for using in analysis model trained in advance Extraction feature the content of each subject area is fitted, obtain the topic abstraction of each theme.Wherein, in order to more accurately Determine the topic abstraction for being best suitable for current topic, second acquisition unit 624 can be modeled by granularity of sentence, and be passed through The importance and representativeness of character representation sentence, and then can choose order models, graph model etc. to the importance of above-mentioned sentence It is fitted with representativeness, accurately determines out the sentence being best suitable in the topic abstraction of current topic.
Module 630 is established, it is corresponding with webpage information for being established according to the purport information and topic abstraction of each theme Index, to be retrieved according to index.
Specifically, it is the corresponding result of page searching for retrieving granularity to finally provide for user with theme, establishes mould Block 630 needs to establish index corresponding with webpage information according to the purport information of each theme and the topic abstraction of theme, from And can be retrieved accordingly according to the index, that is, when receiving the retrieval information of user, can most it be accorded with out according to the indexed search Share the result of page searching that required response is examined at family.
The retrieval process device described based on the above embodiment, it should be noted that determined in the first determining module 610 each Before subject area, need to pre-establish segmentation model.During establishing segmentation model, in order to guarantee the dividing die established Type accurately can carry out cutting to the theme in webpage information, need to consider various data informations, for example combine user in net Number of clicks information, browsing historical information on page etc..The information training such as cutting feature to use corpus, in corpus below is cut It is illustrated, is described as follows for sub-model:
Figure 10 is the structural schematic diagram according to the retrieval process device of a still further embodiment of the present invention, as shown in Figure 10, On the basis of as shown in Figure 6, the retrieval process device further include: conversion module 640 and empty module 650.
Wherein, in order to which the training corpus quantity for guaranteeing trained segmentation model is big, structure is divided close to true webpage information Cloth situation, the webpage information that conversion module 640 is used to have explicit type of theme, is converted into other types according to actual distribution Corpus.
And during the cutting feature of training corpus is trained segmentation model, come to guarantee to train Segmentation model there is practicability, accurately theme to webpage information can carry out cutting, it is ensured that the quantization of characteristic value has Maximum discrimination can empty at random the bigoted and mistake of solution training to the cutting feature in training corpus by emptying module 650 The problem of fitting.
It should be noted that establishing the specific steps and original of cutting module in retrieval process device in the embodiment of the present invention Reason is consistent with establishing for the cutting module in search processing method embodiment, and details are not described herein.
In conclusion the retrieval process device of the embodiment of the present invention, will be the theme, according to master by that will index granularity setting The content of topic range determines relevant purport information and topic abstraction, and is plucked according to the purport information and theme of each theme Establish index corresponding with webpage information, realize according to the index retrieved as a result, being more in line with user's Demand improves the satisfaction of user.
In order to more clearly illustrate the search processing method of the embodiment of the present invention, below can with the exposition on line into Row explanation.It specifically, based on the above embodiment, will when being retrieved according to the different terms and index that user inputs Related pages information corresponding to the highest index of the correlation retrieved shows user, Tu11Wei on result of page searching The structural schematic diagram of the retrieval process device of further embodiment according to the present invention, as shown in figure 11, on basis as shown in FIG. 6 On, the retrieval process device further include: receiving module 660 obtains display module 670 and jump module 680.
Wherein, the retrieval information for receiving input of receiving module 660.
Wherein, the mode that above-mentioned user inputs retrieval information can pass through the modes such as voice, text input.For example, can connect Receive the retrieval information that user inputs in the search box of search engine.
Display module 670 is obtained, for obtaining topic abstraction relevant to retrieval information and purport information according to index, And it shows in result of page searching.
Obtain display module 670 according to the term received for input, search it is immediate with term, and should The maximally related index of search term, to showing the corresponding topic abstraction of the index and purport information in result of page searching.
Jump module 680, for jumping to corresponding with purport information when the purport information of result of page searching is triggered Information interface.
It is understood that the purport information of result of page searching is relevant link, jump module 680 can according to Family jumps to information interface corresponding with purport information to the trigger action of purport information.
In conclusion the retrieval process device of the embodiment of the present invention, receives the retrieval information of the input of user, and according to rope Draw acquisition topic abstraction relevant to retrieval information and purport information, and show on result of page searching, to increase Specific aim, the readability of search results pages, further directly meet the Search Requirement of user.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.
Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, modifies, replacement and variant.

Claims (18)

1. a kind of search processing method, which comprises the following steps:
Theme cutting is carried out to webpage information, determines each subject area;
The purport information and topic abstraction of each theme are determined according to the content of each subject area;
Index corresponding with the webpage information is established according to the purport information of each theme and topic abstraction, so as to basis The index is retrieved;
Wherein, described that theme cutting is carried out to webpage information, determine each subject area, comprising:
Theme is carried out to the webpage information using the cutting feature of the segmentation model corresponding with type of theme of training in advance to cut Point, determine each subject area, wherein the segmentation model is number of clicks information, the browsing history in conjunction with user on webpage What information was established.
2. the method as described in claim 1, which is characterized in that the type of theme includes at least one of:
Explicit type of theme comprising segmentation mark;
Semi-explicit type of theme comprising subtitle;
Implicit type of theme not comprising subtitle and segmentation mark;
Structureless list type of theme.
3. method according to claim 2, which is characterized in that cut described using the corresponding with type of theme of training in advance The cutting feature of sub-model carries out the webpage information before theme cutting, further includes:
By the webpage information with the explicit type of theme, the corpus training of other type of theme is converted into according to actual distribution Segmentation model.
4. method as claimed in claim 3, which is characterized in that further include:
In the training process of the segmentation model, the cutting feature in training corpus is emptied at random.
5. the method as described in claim 1, which is characterized in that the content according to each subject area determines the master of each theme Purport information, comprising:
The subtitle of each subject area is extracted, or,
Extract the keyword of the subtitle of each subject area.
6. the method as described in claim 1, which is characterized in that the content according to each subject area determines the master of each theme Purport information, comprising:
Extract each subject area Feature Words go forward side by side row major grade sequence;
Analysis is carried out to the Feature Words according to preset knowledge base and obtains purport information.
7. the method as described in claim 1, which is characterized in that the content according to each subject area determines the master of each theme Topic abstract, comprising:
The content of each subject area is fitted using the extraction feature in analysis model trained in advance, obtains each theme Topic abstraction.
8. the method according to claim 1 to 7, which is characterized in that further include:
Receive the retrieval information of input;
Topic abstraction relevant to the retrieval information and purport information are obtained according to the index, and is shown in search result The page.
9. method according to claim 8, which is characterized in that further include:
When the purport information of described search results page is triggered, information interface corresponding with the purport information is jumped to.
10. a kind of retrieval process device characterized by comprising
First determining module determines each subject area for carrying out theme cutting to webpage information;
Second determining module, for determining the purport information and topic abstraction of each theme according to the content of each subject area;
Module is established, it is corresponding with the webpage information for being established according to the purport information and topic abstraction of each theme Index, to be retrieved according to the index;
Wherein, first determining module is specifically used for:
Theme is carried out to the webpage information using the cutting feature of the segmentation model corresponding with type of theme of training in advance to cut Point, determine each subject area, wherein the segmentation model is number of clicks information, the browsing history in conjunction with user on webpage What information was established.
11. device as claimed in claim 10, which is characterized in that the type of theme includes at least one of:
Explicit type of theme comprising segmentation mark;
Semi-explicit type of theme comprising subtitle;
Implicit type of theme not comprising subtitle and segmentation mark;
Structureless list type of theme.
12. device as claimed in claim 11, which is characterized in that further include:
Conversion module is converted into other themes according to actual distribution for that will have the webpage information of the explicit type of theme The corpus training segmentation model of type.
13. device as claimed in claim 12, which is characterized in that further include:
Module is emptied, for being emptied at random in the training process of the segmentation model to the cutting feature in training corpus.
14. device as claimed in claim 10, which is characterized in that second determining module includes:
First extraction unit, for extracting the subtitle of each subject area, or,
Extract the keyword of the subtitle of each subject area.
15. device as claimed in claim 10, which is characterized in that second determining module includes:
Second extraction unit, row major grade of the Feature Words for extracting each subject area going forward side by side sort;
First acquisition unit obtains purport information for carrying out analysis to the Feature Words according to preset knowledge base.
16. device as claimed in claim 10, which is characterized in that second determining module, comprising:
Second acquisition unit, for being carried out using content of the extraction feature in analysis model trained in advance to each subject area Fitting, obtains the topic abstraction of each theme.
17. such as the described in any item devices of claim 10-16, which is characterized in that further include:
Receiving module, retrieval information for receiving input;
Display module is obtained, for obtaining topic abstraction relevant to the retrieval information and purport letter according to the index Breath, and show in result of page searching.
18. device as claimed in claim 17, which is characterized in that further include:
Jump module, for when the purport information of described search results page is triggered, jumping to and the purport information pair The information interface answered.
CN201610214481.5A 2016-04-07 2016-04-07 Search processing method and device Active CN105912631B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610214481.5A CN105912631B (en) 2016-04-07 2016-04-07 Search processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610214481.5A CN105912631B (en) 2016-04-07 2016-04-07 Search processing method and device

Publications (2)

Publication Number Publication Date
CN105912631A CN105912631A (en) 2016-08-31
CN105912631B true CN105912631B (en) 2019-07-05

Family

ID=56744713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610214481.5A Active CN105912631B (en) 2016-04-07 2016-04-07 Search processing method and device

Country Status (1)

Country Link
CN (1) CN105912631B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844340B (en) * 2017-01-10 2020-04-07 北京百度网讯科技有限公司 News abstract generating and displaying method, device and system based on artificial intelligence
CN110633407B (en) * 2018-06-20 2022-05-24 百度在线网络技术(北京)有限公司 Information retrieval method, device, equipment and computer readable medium
CN109800326B (en) * 2019-01-24 2021-07-02 广州虎牙信息科技有限公司 Video processing method, device, equipment and storage medium
CN117668206A (en) * 2022-08-24 2024-03-08 华为云计算技术有限公司 Knowledge searching method and related equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1716244A (en) * 2003-12-29 2006-01-04 西安迪戈科技有限责任公司 Intelligent search, intelligent files system and automatic intelligent assistant
CN103678277A (en) * 2013-12-04 2014-03-26 东软集团股份有限公司 Theme-vocabulary distribution establishing method and system based on document segmenting
CN104504069A (en) * 2014-12-22 2015-04-08 北京奇虎科技有限公司 Building method and device for file index
CN104679730A (en) * 2015-02-13 2015-06-03 刘秀磊 Webpage summarization extraction method and device thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1716244A (en) * 2003-12-29 2006-01-04 西安迪戈科技有限责任公司 Intelligent search, intelligent files system and automatic intelligent assistant
CN103678277A (en) * 2013-12-04 2014-03-26 东软集团股份有限公司 Theme-vocabulary distribution establishing method and system based on document segmenting
CN104504069A (en) * 2014-12-22 2015-04-08 北京奇虎科技有限公司 Building method and device for file index
CN104679730A (en) * 2015-02-13 2015-06-03 刘秀磊 Webpage summarization extraction method and device thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于主题词迭代提取的信息检索算法";赵英环 郭贵锁;《华南理工大学学报(自然科学版)》;20041130;第32卷;77-80

Also Published As

Publication number Publication date
CN105912631A (en) 2016-08-31

Similar Documents

Publication Publication Date Title
CN110059271B (en) Searching method and device applying tag knowledge network
CN108280155B (en) Short video-based problem retrieval feedback method, device and equipment
CN110298033B (en) Keyword corpus labeling training extraction system
US8005815B2 (en) Search engine
CN105912631B (en) Search processing method and device
US8918717B2 (en) Method and sytem for providing collaborative tag sets to assist in the use and navigation of a folksonomy
CN111783394B (en) Training method of event extraction model, event extraction method, system and equipment
CN107315738B (en) A kind of innovation degree appraisal procedure of text information
CN104428769B (en) The information of text file reader is provided
CN102043812A (en) Method and system for retrieving medical information
CN102119385A (en) Method and subsystem for searching media content within a content-search-service system
CN102163213B (en) Voice browsing method and browser
JP6033697B2 (en) Image evaluation device
CN112948575B (en) Text data processing method, apparatus and computer readable storage medium
CN110489649B (en) Method and device for associating content with tag
CN109451147A (en) A kind of information displaying method and device
CN112035675A (en) Medical text labeling method, device, equipment and storage medium
CN105653562A (en) Calculation method and apparatus for correlation between text content and query request
CN106777080B (en) Short abstract generation method, database establishment method and man-machine conversation method
CN109918555A (en) Method, apparatus, equipment and the medium suggested for providing search
CN107239564A (en) A kind of text label based on supervision topic model recommends method
KR101607468B1 (en) Keyword tagging method and system for contents
WO2021257178A1 (en) Provide knowledge answers for knowledge-intention queries
CN105260396A (en) Word retrieval method and apparatus
JP6409071B2 (en) Sentence sorting method and calculator

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant