CN107301253A - A kind of method and device for improving multi-site search key accuracy - Google Patents
A kind of method and device for improving multi-site search key accuracy Download PDFInfo
- Publication number
- CN107301253A CN107301253A CN201710732432.5A CN201710732432A CN107301253A CN 107301253 A CN107301253 A CN 107301253A CN 201710732432 A CN201710732432 A CN 201710732432A CN 107301253 A CN107301253 A CN 107301253A
- Authority
- CN
- China
- Prior art keywords
- information
- site
- search
- targeted website
- search key
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of method and device for improving multi-site search key accuracy, it is related to internet information field, this method includes:The related information between the site information and preset search word of targeted website is obtained, wherein, site information is the fresh web information of current target website, and site information includes web site contents and station address;Word segmentation processing is carried out to the search information that user inputs, search key is obtained, wherein, search information is the information scanned for targeted website data;The web site contents matched with search key are searched according to related information, web site contents are pushed to user.When the present invention alleviates the web page contents matched by the search of the searching method of prior art with search key, the poor technical problem of the accuracy that exists.
Description
Technical field
The present invention relates to technical field of Internet information, multi-site search key accuracy is improved more particularly, to one kind
Method and device.
Background technology
Internet (Internet) is developed rapidly from after 1990s mid-term commercial operations in the whole world.With
The high speed development of today's society internet, internet has penetrated into the every field of our daily lifes.Internet can be with
Let us understands topical news in time, obtains various newest knowledge and information, our open visual field, improves us usually
Entertainment life.
However, when we experience the facility of internet, we also experience the numerous and complicated of internet content, internet
Content not only dabbles content extensively, and updates very fast, is all being continually changing all the time, be embodied in content conversion,
The increase of content, deletion of content etc..Also, in the case of the new conversion of content day, numerous repetitions are had in internet unavoidably
Content.
For above-mentioned situation, under existing search technique background, closed when user inputs search in the search box of webpage
After key word, following several situations are will appear from:It can not find the content to be searched for, or the content and search key searched not phase
Close, or search the content of multiple repetitions, thus matched by the searching method search of prior art with search key
Web page contents when, often there is the poor technical problem of accuracy.
The content of the invention
In view of this, it is an object of the invention to provide a kind of method and dress for improving multi-site search key accuracy
Put, when searching for the web page contents matched with search key with the searching method alleviated by prior art, what is existed is accurate
The poor technical problem of property.
In a first aspect, the embodiments of the invention provide a kind of method for improving multi-site search key accuracy, including:
The related information between the site information and preset search word of targeted website is obtained, wherein, the site information is
The fresh web information of targeted website described in current time, the site information includes web site contents and station address;
To user input search information carry out word segmentation processing, obtain search key, wherein, it is described search information for pair
The information that the targeted website data are scanned for;
The web site contents matched according to being searched in the related information with the search key, are pushed to the user
The web site contents.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of the first of first aspect, wherein,
Obtain before the related information between the site information and preset search word of targeted website, methods described also includes:
Obtain target and crawl the time;
The execution of time control reptile is crawled in the target and currently crawls task, to crawl targeted website, obtains the first net
Stand information;
The web site contents included according to first site information determine preset search word, and set up the preset search
Related information between word and first site information;
The related information is stored in data server.
With reference to the first possible embodiment of first aspect, the embodiments of the invention provide second of first aspect
Possible embodiment, wherein, crawl the execution of time control reptile in the target and currently crawl task, to crawl the target
Website, obtains the first site information, including:
Perform it is described it is current crawl task when, the homepage to the targeted website is crawled, and obtains the target network
The hyperlink connection interface included in the homepage content and the First page information of the targeted website stood;
The hyperlink connection interface is analyzed, whether determine the hyperlink connection interface is target hyperlink connection interface, wherein,
The target hyperlink connection interface is the interface not being crawled, and the target hyperlink connection interface is correct hyperlink connection interface,
And the web page contents pre-set are included in the web page contents corresponding to the target hyperlink connection interface;
In the case where determining the target hyperlink connection interface, webpage progress time corresponding to the hyperlink connection interface
Go through, obtain the web site contents of the target hyperlink connection interface;
It regard the web site contents and station address of each target hyperlink connection interface as first site information.
With reference to second of possible embodiment of first aspect, the embodiments of the invention provide the third of first aspect
Possible embodiment, wherein, the homepage to the targeted website is crawled, including:
Judge whether to the targeted website be to perform to crawl task first;
Judge it is no in the case of, the second site information is analyzed, it is determined whether the target network can be passed through
Stand and the webpage indicated by target network address is conducted interviews, or with the presence or absence of in webpage in the webpage indicated by the target network address
Hold, second site information is that reptile execution first crawls the information crawled during task, and described first crawls task
Task is crawled for current crawl task upper one, the target network address is any one in second site information
Station address,
Wherein, it is determined that in the case of, then the homepage to the targeted website is crawled, to obtain the target
The hyperlink connection interface included in the homepage content and the First page information of website;
Determine it is no in the case of, by the related information associated with the target network address from the data server
Delete.
With reference to the first possible embodiment of first aspect, the embodiment of the present invention additionally provides the 4th of first aspect
Possible embodiment is planted, wherein, methods described also includes:
Judge whether to the targeted website be to perform described currently to crawl task first;
Judge it is no in the case of, the second site information is analyzed, it is determined whether the target network can be passed through
Stand and the webpage indicated by target network address is conducted interviews, or with the presence or absence of in webpage in the webpage indicated by the target network address
Hold, second site information is that reptile execution first crawls the information crawled during task, and described first crawls task
Task is crawled for current crawl task upper one, the target network address is any one in second site information
Station address,
Wherein, it is determined that in the case of, then perform the step of being crawled to the homepage of the targeted website;
Determine it is no in the case of, by the related information associated with the target network address from the data server
Delete.
With reference to second of possible embodiment of first aspect, the embodiments of the invention provide the 5th of first aspect kind
Possible embodiment, wherein, obtain target and crawl the time, including:
Java timer quartz is configured in advance, the time is crawled with set the reptile, wherein, Java's
Timer quartz is used for reptile execution described in clocked flip and crawls task;
From it is described crawl the time in extract target crawl the time.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of the 6th of first aspect kind, wherein, it is right
The search information of user's input carries out word segmentation processing, obtains search key, including:
By IKAnalyzer segmenter, word segmentation processing is carried out to the search information that user inputs, search key is obtained.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of gas kind of first aspect, wherein, root
The web site contents matched with the search key are searched according to the related information, including:From the search of the related information
The search key is searched in word;According to the matching degree between the search term and the search key, it is determined that and institute
State the associated site information of search term;
Pushing the web site contents to the user includes:According to the matching degree, by the net in the site information
Content push of standing gives the user.
Second aspect, the embodiment of the present invention also provides a kind of device for improving multi-site search key accuracy, including:
First acquisition module, for obtaining the related information between the site information of targeted website and preset search word, its
In, the site information is the fresh web information of targeted website described in current time, and the site information includes web site contents
And station address;
Word-dividing mode, the search information for being inputted to user carries out word segmentation processing, obtains search key, wherein, institute
It is the information scanned for the targeted website data to state search information;
Pushing module, for searching the web site contents matched with the search key according to the related information, to
The user pushes the web site contents.
The third aspect, the embodiment of the present invention also provides a kind of meter for the non-volatile program code that can perform with processor
Calculation machine computer-readable recording medium, described program code makes the raising multi-site search key described in the computing device first aspect accurate
The method of true property.
The embodiment of the present invention brings following beneficial effect:Between the site information and preset search word that obtain targeted website
Related information, wherein, site information be current target website fresh web information, site information include web site contents
And station address;To user input search information carry out word segmentation processing, obtain search key, wherein, search information for pair
The information that targeted website data are scanned for;The web site contents matched with search key, Xiang Yong are searched according to related information
Family pushes the web site contents.In the embodiment of the present invention, site information is the fresh web information of current target website, is closed
Join the newest related information that information is also current time, the Real-time ensuring technology of the related information accuracy of related information, from
And when alleviating the web page contents matched by the searching method search of prior art with search key, the accuracy existed
Poor technical problem.
Other features and advantages of the present invention will be illustrated in the following description, also, partly be become from specification
Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages are in specification, claims
And specifically noted structure is realized and obtained in accompanying drawing.
To enable the above objects, features and advantages of the present invention to become apparent, preferred embodiment cited below particularly, and coordinate
Appended accompanying drawing, is described in detail below.
Brief description of the drawings
, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical scheme of the prior art
The accompanying drawing used required in embodiment or description of the prior art is briefly described, it should be apparent that, in describing below
Accompanying drawing is some embodiments of the present invention, for those of ordinary skill in the art, before creative work is not paid
Put, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of method flow diagram for raising multi-site search key accuracy that the embodiment of the present invention one is provided;
Fig. 2 is the method flow diagram crawled to the homepage of targeted website that the embodiment of the present invention one is provided;
Fig. 3 is a kind of schematic device for raising multi-site search key accuracy that the embodiment of the present invention two is provided;
Fig. 4 is the device signal for another raising multi-site search key accuracy that the embodiment of the present invention two is provided
Figure.
Icon:The acquisition modules of 100- first;200- word-dividing modes;300- pushing modules;The acquisition modules of 400- second;500-
Crawl module;600- sets up module;700- memory modules.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with accompanying drawing to the present invention
Technical scheme be clearly and completely described, it is clear that described embodiment is a part of embodiment of the invention, rather than
Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise
Lower obtained every other embodiment, belongs to the scope of protection of the invention.
Internet content not only dabbles content extensively, and updates very fast, thus, after user's input search key,
The content and search key that either can not find the content to be searched for or search are uncorrelated, or search multiple repetitions
, often there is the poor technical problem of accuracy with the web page contents that search key matches in content.Based on this, the present invention is real
A kind of method and device of raising multi-site search key accuracy of example offer is provided, searching by prior art can be alleviated
When Suo Fangfa searches for the web page contents matched with search key, the poor technical problem of the accuracy that exists.
Embodiment one
A kind of method for improving multi-site search key accuracy provided in an embodiment of the present invention, as shown in figure 1, the party
Method comprises the following steps:
Step S102, obtains the related information between the site information and preset search word of targeted website, wherein, website letter
The fresh web information for current target website is ceased, site information includes web site contents and station address.
Specifically, targeted website includes single web-site or multiple web-sites.
Further, since the information of number of site is all in real-time update, above-mentioned site information is current target net
The fresh web information stood, thus site information is real-time site information.
Step S104, carries out word segmentation processing to the search information that user inputs, obtains search key, wherein, search letter
Cease the information to be scanned for targeted website data.
Specifically, the search information of user's input is generally character string, by carrying out word segmentation processing to character string, is searched for
Keyword.
Step S106, the web site contents matched with search key are searched according to related information, and website is pushed to user
Content.
It should be noted that step described by above-mentioned steps S102 to step S106 can by a performs device come
Carry out, the performs device can be located between company Intranet and targeted website (targeted website is the outer net of company), and performs device is led to
Cross and outer net communication and obtain the related information between the site information of targeted website and preset search word, and by related information
Preserved.In addition, performs device pre-sets the word segmentation regulation that word segmentation processing is carried out to the search information that user inputs.
When the user of company Intranet will scan for the data of targeted website, performs device by with the client communication in Intranet and
The search information of user's input is obtained, and obtains the related information pre-saved, is then searched and is closed with searching for according to related information
The web site contents that key word matches, web site contents are pushed to user.
It is emphasized that site information is the fresh web information of current target website, related information is also to work as
The newest related information at preceding moment, the Real-time ensuring technology of the related information accuracy of related information, passes through so as to alleviate
During the web page contents that the searching method search of prior art matches with search key, the poor technology of the accuracy that exists is asked
Topic.
On performs device by and outer net communication and obtain between the site information of targeted website and preset search word
Related information, give detailed embodiment in an optional embodiment of the embodiment of the present invention, specifically include as follows
Step:
Before the related information between the site information and preset search word of targeted website is obtained, when acquisition target is crawled
Between;
The execution of time control reptile is crawled in target and currently crawls task, to crawl targeted website, obtains the first website letter
Breath;
The web site contents included according to the first site information determine preset search word, and set up preset search word and first
Related information between site information;
Related information is stored in data server.
Specifically, can also by the address depth of web-site, and for judge address whether be targeted website ground
The entity class info web such as location is stored in data server, so as to crawling in task and can more efficiently crawl afterwards.
It should be noted that the related information stored in data server includes following two situations:The first situation is
Related information between web site contents and preset search word, second of situation is the association between station address and preset search word
Information.For the first situation, after user's input search information, searched and search key phase directly from related information
The web site contents matched somebody with somebody, and web site contents are pushed to user;For second of situation, after user's input search information, from pass
The station address matched with search key is searched in connection information, is then searched indicated by station address in the website of webpage
Hold, and web site contents are pushed to user.
Wherein, crawl the execution of time control reptile in target and currently crawl task, and target crawls the detailed acquisition of time
Method is specifically, as follows referring to another optional embodiment of the embodiment of the present invention:
Java timer quartz is configured in advance, the time is crawled with set reptile, wherein, Java timing
Device quartz is used for the execution of clocked flip reptile and crawls task;Then target is extracted from the time is crawled and crawls the time.
It should be noted that Java timer quartz has triggering reptile to perform the preset time for the task that crawls, and hold
Crawled before the trade task target crawl the time in above-mentioned preset time apart from the previous time that current time is nearest.
In another optional embodiment of the embodiment of the present invention, crawl the execution of time control reptile in target and currently crawl
Task, to crawl targeted website, obtains the first site information, including:
When execution currently crawls task, the homepage to targeted website is crawled, and obtains the homepage content of targeted website
With the hyperlink connection interface included in the First page information of targeted website, for example, href interfaces, src interfaces.
Hyperlink connection interface is analyzed, whether determine hyperlink connection interface is target hyperlink connection interface, wherein, target hyperlink
Connection interface is the interface that was not crawled, and target hyperlink connection interface is correct hyperlink connection interface, and target hyperlink connection interface
The web page contents pre-set are included in corresponding web page contents.Here the web page contents pre-set in advance are wanted to obtain
Web page contents, if the hyperlink connection interface of uninterested web page contents then webpage is not connect for above-mentioned target hyperlink
Mouthful.
In the case where determining target hyperlink connection interface, webpage corresponding to hyperlink connection interface is traveled through, and obtains mesh
Mark the web site contents of hyperlink connection interface;
It regard the web site contents and station address of each target hyperlink connection interface as the first site information.
It should be noted that the embodiment of the present invention gives the mode that reptile is crawled to website homepage, for difference
The website of depth, reptile crawls to deeper webpage, can equally take above-mentioned crawling mode.
In another optional embodiment of the embodiment of the present invention, as shown in Fig. 2 the homepage to targeted website is climbed
Take, comprise the following steps:
Step S201, judge to targeted website whether be first perform crawl task, wherein, judge it is no in the case of
Step S202 is performed, step S203 is performed in the case where judging to be;
Step S202, is analyzed the second site information, it is determined whether can be by targeted website to target network address institute
Whether the webpage of instruction conducts interviews (that is, the webpage of target network address whether there is), or deposited in the webpage indicated by target network address
Web page contents (that is, the info web of target network address whether there is), the second site information is that reptile execution first crawls task
When the information that crawls, first to crawl task be currently to crawl upper one of task to crawl task, and target network address is the second website
Any one station address in information.Wherein, it is determined that in the case of perform step S203, determining no situation
Lower execution step S204;
Step S203, the homepage to targeted website is crawled, to obtain the homepage content and First page information of targeted website
In the hyperlink connection interface that includes;
Step S204, the related information associated with target network address is deleted from data server, continuation is then back to
Perform step S203.
It should be noted that in the embodiment of the present invention, target network address is any one website in the second site information
Location, the second site information is that reptile execution first crawls the information crawled during task, is realized by above-mentioned steps to upper one
It is secondary crawl the purpose that obtained site information is verified, it is to avoid search result occurs the information of target network address, but clicks on
The phenomenon of the web site contents of correlation is not got after the link of target network address, it is to avoid the subsequent searches therefore caused are wrong
By mistake.
In another optional embodiment of the embodiment of the present invention, related information is being stored in it in data server
Before, improving the method for multi-site search key accuracy also includes:
Whether judge data server is to store related information first;
In the case where judging data server for storage related information first, by the number stored in data server
According to carrying out emptying processing, thus avoid remain former crawled to the website outside targeted website in data server and
Obtained related information, or avoid remaining some dirty datas in data server.
In another optional embodiment of the embodiment of the present invention, word segmentation processing is carried out to the search information that user inputs,
Search key is obtained, including:
By IKAnalyzer segmenter, word segmentation processing is carried out to the search information that user inputs, search key is obtained.
Specifically, the common approach of participle is managed by the general-purpose interface of a participle first.Wherein, at for participle
The returning result of reason, it is necessary in view of a variety of situations, including:The first, returning result is gathered with key-value pair Map;Second,
Returning result is that String character strings are gathered for the Set of mark.
In addition, carrying out participle by IKAnalyzer segmenter, two kinds of participle types can be added, one kind is that intelligence is cut
Point, also one kind is fine granularity cutting, can thus go to carry out the cutting character string of different modes as needed.
Furthermore, it is possible to timely be updated to IKAnalyzer dictionary, participle is allowed to reach more preferably participle
Effect.
The embodiment of the present invention uses IKAnalyzer segmenter, by the way of based on text matches, it is not necessary to which input is big
Amount manpower is trained and marked, and can make dictionary by oneself, and the convenient word for adding domain specific can separate many granularities
Result.
In another optional embodiment of the embodiment of the present invention, searched and matched with search key according to related information
Web site contents, including:Search key is searched from the search term of related information;According between search term and search key
Matching degree, it is determined that the site information associated with search term;
Pushing web site contents to user includes:According to matching degree, the web site contents in site information are pushed to user.
Specifically, can be implemented by search server (such as solr, solr are an independent data servers,
The search information that solr can be inputted according to user, generation indexes to rapidly search for result and return);User can also lead to
Http Get are crossed to propose to include the searching request of search information.
In the case where search key is many phrases, search key can be connected by OR, Ran Houtong
Crossing Client goes matching preset search word to obtain web site contents.Wherein, the web site contents in site information are pushed to user,
Can be the web site contents filtered out, the quantity number for containing search key according to the corresponding preset search word of web site contents is arranged
Sequence scans for the page presentation of result.
Embodiment two
A kind of device for improving multi-site search key accuracy provided in an embodiment of the present invention, as shown in figure 3, bag
Include:
First acquisition module 100, for obtaining the related information between the site information of targeted website and preset search word,
Wherein, site information is the fresh web information of current target website, and site information includes web site contents and station address;
Word-dividing mode 200, the search information for being inputted to user carries out word segmentation processing, obtains search key, wherein,
Search information is the information scanned for targeted website data;
Pushing module 300, for searching the web site contents matched with search key according to related information, is pushed away to user
Send web site contents.
In embodiments of the present invention, the first acquisition module 100 obtain targeted website site information and preset search word it
Between related information, wherein, site information be current target website fresh web information, site information include website in
Hold and station address;The search information that word-dividing mode 200 is inputted to user carries out word segmentation processing, obtains search key, wherein,
Search information is the information scanned for targeted website data;Pushing module 300 is searched with searching for crucial according to related information
The web site contents that word matches, web site contents are pushed to user.In the embodiment of the present invention, site information is current target net
The fresh web information stood, related information is also the newest related information at current time, the Real-time ensuring technology of related information
The accuracy of related information, so as to alleviate the webpage matched by the searching method search of prior art with search key
During content, the poor technical problem of the accuracy that exists.
In another optional embodiment of the embodiment of the present invention, as shown in figure 4, it is accurate to improve multi-site search key
The device of property also includes:
Second acquisition module 400, the time is crawled for obtaining target;
Module 500 is crawled, currently task is crawled for crawling the execution of time control reptile in target, to crawl target network
Stand, obtain the first site information;
Module 600 is set up, the web site contents for including according to the first site information determine preset search word, and set up
Related information between preset search word and the first site information;
Memory module 700, for related information to be stored in data server.
In another optional embodiment of the embodiment of the present invention, crawling module includes:
Unit is crawled, for when execution currently crawls task, the homepage to targeted website to be crawled, and obtains target network
The hyperlink connection interface included in the homepage content and the First page information of targeted website stood;
Determining unit, whether for analyzing hyperlink connection interface, it is target hyperlink connection interface to determine hyperlink connection interface,
Wherein, target hyperlink connection interface is the interface that was not crawled, and target hyperlink connection interface is correct hyperlink connection interface, and mesh
The web page contents pre-set are included in web page contents corresponding to mark hyperlink connection interface;
Traversal Unit, in the case where determining target hyperlink connection interface, webpage corresponding to hyperlink connection interface to enter
Row traversal, obtains the web site contents of target hyperlink connection interface;
Determining unit, for the web site contents and station address of each target hyperlink connection interface to be believed as the first website
Breath.
In another optional embodiment of the embodiment of the present invention, crawl unit and be additionally operable to:
Judge whether to targeted website be to perform to crawl task first;
Judge it is no in the case of, the second site information is analyzed, it is determined whether targeted website pair can be passed through
Webpage indicated by target network address conducts interviews, or whether there is web page contents, the second net in the webpage indicated by target network address
Information of standing is that reptile execution first crawls the information crawled during task, and first crawls task currently to crawl upper one of task
Task is crawled, target network address is any one station address in the second site information,
Wherein, it is determined that in the case of, then the homepage to targeted website is crawled, to obtain the head of targeted website
The hyperlink connection interface included in page content and First page information;
Determine it is no in the case of, the related information associated with target network address is deleted from data server.
In another optional embodiment of the embodiment of the present invention, the device of multi-site search key accuracy is improved also
Including:
Judge module, for judging whether data server is to store related information first;
Module is emptied, in the case where judging data server for storage related information first, by data, services
The data stored in device carry out emptying processing.
In another optional embodiment of the embodiment of the present invention, the second acquisition module is used for:
Java timer quartz is configured in advance, the time is crawled with set reptile, wherein, Java timing
Device quartz is used for the execution of clocked flip reptile and crawls task;
Target is extracted from the time is crawled and crawls the time.
In another optional embodiment of the embodiment of the present invention, word-dividing mode is used for:
By IKAnalyzer segmenter, word segmentation processing is carried out to the search information that user inputs, search key is obtained.
In another optional embodiment of the embodiment of the present invention, pushing module is used for:
Search key is searched from the search term of related information;According to the matching journey between search term and search key
Degree, it is determined that the site information associated with search term;
According to matching degree, the web site contents in site information are pushed to user.
Embodiment three
The embodiments of the invention provide a kind of the computer-readable of non-volatile program code that can perform with processor
Medium, program code makes a kind of method of raising multi-site search key accuracy of computing device embodiment, wherein, by
In the fresh web information that site information is current target website, the newest association letter of related information also for current time
Breath, the Real-time ensuring technology of the related information accuracy of related information, is searched so as to alleviate by the searching method of prior art
During the web page contents that rope and search key match, the poor technical problem of the accuracy that exists.
The computer journey of the method and device for the raising multi-site search key accuracy that the embodiment of the present invention is provided
Sequence product, including the computer-readable recording medium of program code is stored, the instruction that described program code includes can be used for holding
Method described in row previous methods embodiment, implements and can be found in embodiment of the method, will not be repeated here.
It is apparent to those skilled in the art that, for convenience and simplicity of description, the system of foregoing description
With the specific work process of device, the corresponding process in preceding method embodiment is may be referred to, be will not be repeated here.
In addition, in the description of the embodiment of the present invention, unless otherwise clearly defined and limited, term " installation ", " phase
Even ", " connection " should be interpreted broadly, for example, it may be being fixedly connected or being detachably connected, or be integrally connected;Can
To be mechanical connection or electrical connection;Can be joined directly together, can also be indirectly connected to by intermediary, Ke Yishi
The connection of two element internals.For the ordinary skill in the art, with concrete condition above-mentioned term can be understood at this
Concrete meaning in invention.
If the function is realized using in the form of SFU software functional unit and is used as independent production marketing or in use, can be with
It is stored in a computer read/write memory medium.Understood based on such, technical scheme is substantially in other words
The part contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter
Calculation machine software product is stored in a storage medium, including some instructions are to cause a computer equipment (can be individual
People's computer, server, or network equipment etc.) perform all or part of step of each of the invention embodiment methods described.
And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
In the description of the invention, it is necessary to explanation, term " " center ", " on ", " under ", "left", "right", " vertical ",
The orientation or position relationship of the instruction such as " level ", " interior ", " outer " be based on orientation shown in the drawings or position relationship, merely to
Be easy to the description present invention and simplify description, rather than indicate or imply signified device or element must have specific orientation,
With specific azimuth configuration and operation, therefore it is not considered as limiting the invention.
In addition, term " first ", " second ", " the 3rd " are only used for describing purpose, and it is not intended that indicating or implying phase
To importance.
Finally it should be noted that:Embodiment described above, is only the embodiment of the present invention, to illustrate the present invention
Technical scheme, rather than its limitations, protection scope of the present invention is not limited thereto, although with reference to the foregoing embodiments to this hair
It is bright to be described in detail, it will be understood by those within the art that:Any one skilled in the art
The invention discloses technical scope in, it can still modify to the technical scheme described in previous embodiment or can be light
Change is readily conceivable that, or equivalent substitution is carried out to which part technical characteristic;And these modifications, change or replacement, do not make
The essence of appropriate technical solution departs from the spirit and scope of technical scheme of the embodiment of the present invention, should all cover the protection in the present invention
Within the scope of.Therefore, protection scope of the present invention described should be defined by scope of the claims.
Claims (10)
1. a kind of method for improving multi-site search key accuracy, it is characterised in that including:
The related information between the site information and preset search word of targeted website is obtained, wherein, the site information is current
The fresh web information of targeted website described in moment, the site information includes web site contents and station address;
Word segmentation processing is carried out to the search information that user inputs, search key is obtained, wherein, the search information is to described
The information that targeted website data are scanned for;
The web site contents matched with the search key are searched according to the related information, the net is pushed to the user
Stand content.
2. according to the method described in claim 1, it is characterised in that obtaining the site information and preset search word of targeted website
Between related information before, methods described also includes:
Obtain target and crawl the time;
The execution of time control reptile is crawled in the target and currently crawls task, to crawl targeted website, obtains the first website letter
Breath;
The web site contents included according to first site information determine preset search word, and set up the preset search word and
Related information between first site information;
The related information is stored in data server.
3. method according to claim 2, it is characterised in that crawl the execution of time control reptile in the target and currently climb
Task is taken, to crawl the targeted website, the first site information is obtained, including:
Perform it is described it is current crawl task when, the homepage to the targeted website is crawled, and obtains the targeted website
The hyperlink connection interface included in the First page information of homepage content and the targeted website;
The hyperlink connection interface is analyzed, whether determine the hyperlink connection interface is target hyperlink connection interface, wherein, it is described
Target hyperlink connection interface is the interface that was not crawled, and the target hyperlink connection interface is correct hyperlink connection interface, and institute
State in the web page contents corresponding to target hyperlink connection interface comprising the web page contents pre-set;
In the case where determining the target hyperlink connection interface, webpage corresponding to the hyperlink connection interface is traveled through, and is obtained
To the web site contents of the target hyperlink connection interface;
It regard the web site contents and station address of each target hyperlink connection interface as first site information.
4. method according to claim 3, it is characterised in that the homepage to the targeted website is crawled, including:
Judge whether to the targeted website be to perform to crawl task first;
Judge it is no in the case of, the second site information is analyzed, so that determine whether can be by the targeted website
Webpage indicated by target network address is conducted interviews, or with the presence or absence of web page contents in the webpage indicated by the target network address,
Second site information is that the reptile performs first and crawls the information crawled during task, and described first to crawl task be institute
State current crawl task upper one and crawl task, the target network address is any one website in second site information
Address,
Wherein, it is determined that in the case of, then the homepage to the targeted website is crawled, to obtain the targeted website
Homepage content and the First page information in the hyperlink connection interface that includes;
Determine it is no in the case of, the related information associated with the target network address is deleted from the data server
Remove.
5. method according to claim 2, it is characterised in that the related information is being stored in it in data server
Before, methods described also includes:
Whether judge the data server is to store the related information first;
Judging the data server in the case of storing the related information first, by the data server
The data of storage carry out emptying processing.
6. method according to claim 2, it is characterised in that obtain target and crawl the time, including:
Java timer quartz is configured in advance, the time is crawled with set the reptile, wherein, Java timing
Device quartz is used for reptile execution described in clocked flip and crawls task;
From it is described crawl the time in extract target crawl the time.
7. according to the method described in claim 1, it is characterised in that word segmentation processing is carried out to the search information that user inputs, obtained
To search key, including:
By IKAnalyzer segmenter, word segmentation processing is carried out to the search information that user inputs, search key is obtained.
8. according to the method described in claim 1, it is characterised in that
The web site contents matched with the search key are searched according to the related information, including:From the related information
Search term in search the search key;According to the matching degree between the search term and the search key, really
The fixed site information associated with the search term;
Pushing the web site contents to the user includes:According to the matching degree, by the website in the site information
Appearance is pushed to the user.
9. a kind of device for improving multi-site search key accuracy, it is characterised in that including:
First acquisition module, for obtaining the related information between the site information of targeted website and preset search word, wherein, institute
The fresh web information that site information is targeted website described in current time is stated, the site information includes web site contents and website
Address;
Word-dividing mode, the search information for being inputted to user carries out word segmentation processing, obtains search key, wherein, it is described to search
Rope information is the information scanned for the targeted website data;
Pushing module, for searching the web site contents matched with the search key according to the related information, to described
User pushes the web site contents.
10. a kind of computer-readable medium for the non-volatile program code that can perform with processor, it is characterised in that described
Program code makes the raising multi-site search key accuracy any one of the computing device claim 1-8
Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710732432.5A CN107301253B (en) | 2017-08-23 | 2017-08-23 | Method and device for improving accuracy of multi-site search keywords |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710732432.5A CN107301253B (en) | 2017-08-23 | 2017-08-23 | Method and device for improving accuracy of multi-site search keywords |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107301253A true CN107301253A (en) | 2017-10-27 |
CN107301253B CN107301253B (en) | 2020-02-04 |
Family
ID=60132524
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710732432.5A Active CN107301253B (en) | 2017-08-23 | 2017-08-23 | Method and device for improving accuracy of multi-site search keywords |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107301253B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108519984A (en) * | 2018-02-07 | 2018-09-11 | 平安科技(深圳)有限公司 | weather data processing method, server and computer readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102135967A (en) * | 2010-01-27 | 2011-07-27 | 华为技术有限公司 | Webpage keywords extracting method, device and system |
CN102446225A (en) * | 2012-01-11 | 2012-05-09 | 深圳市爱咕科技有限公司 | Real-time search method, device and system |
CN102456057A (en) * | 2010-11-01 | 2012-05-16 | 阿里巴巴集团控股有限公司 | Retrieval method, device and server based on online trading platform |
CN102591992A (en) * | 2012-02-15 | 2012-07-18 | 苏州亚新丰信息技术有限公司 | Webpage classification identifying system and method based on vertical search and focused crawler technology |
CN103778122A (en) * | 2012-10-17 | 2014-05-07 | 腾讯科技(深圳)有限公司 | Searching method and system |
-
2017
- 2017-08-23 CN CN201710732432.5A patent/CN107301253B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102135967A (en) * | 2010-01-27 | 2011-07-27 | 华为技术有限公司 | Webpage keywords extracting method, device and system |
CN102456057A (en) * | 2010-11-01 | 2012-05-16 | 阿里巴巴集团控股有限公司 | Retrieval method, device and server based on online trading platform |
CN102446225A (en) * | 2012-01-11 | 2012-05-09 | 深圳市爱咕科技有限公司 | Real-time search method, device and system |
CN102591992A (en) * | 2012-02-15 | 2012-07-18 | 苏州亚新丰信息技术有限公司 | Webpage classification identifying system and method based on vertical search and focused crawler technology |
CN103778122A (en) * | 2012-10-17 | 2014-05-07 | 腾讯科技(深圳)有限公司 | Searching method and system |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108519984A (en) * | 2018-02-07 | 2018-09-11 | 平安科技(深圳)有限公司 | weather data processing method, server and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107301253B (en) | 2020-02-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10146785B2 (en) | Operator-guided application crawling architecture | |
US8751466B1 (en) | Customizable answer engine implemented by user-defined plug-ins | |
Kumar et al. | Keyword query based focused Web crawler | |
US8386495B1 (en) | Augmented resource graph for scoring resources | |
CN103365924B (en) | A kind of method of internet information search, device and terminal | |
JP4848388B2 (en) | How to calculate a score for a search query | |
US20130290319A1 (en) | Performing application searches | |
CN102760151B (en) | Implementation method of open source software acquisition and searching system | |
CN107145496A (en) | The method for being matched image with content item based on keyword | |
CN108052632B (en) | Network information acquisition method and system and enterprise information search system | |
CN102930059A (en) | Method for designing focused crawler | |
CN103049542A (en) | Domain-oriented network information search method | |
CN105159930A (en) | Search keyword pushing method and apparatus | |
US20230229714A1 (en) | Identifying Information Using Referenced Text | |
CN102385613A (en) | Web page positioning method and system | |
CN107679226B (en) | Tourism body constructing method based on theme | |
CN102270331A (en) | Network shopping navigating method based on visual search | |
CN103530339A (en) | Mobile application information push method and device | |
CN112948547B (en) | Logging knowledge graph construction query method, device, equipment and storage medium | |
CN108572971B (en) | Method and device for mining keywords related to search terms | |
CN102760150A (en) | Webpage extraction method based on attribute reproduction and labeled path | |
CN102222098A (en) | Method and system for pre-fetching webpage | |
CN105022775A (en) | Apparatus and method for structuring web page access history | |
US20160103913A1 (en) | Method and system for calculating a degree of linkage for webpages | |
CN103942211B (en) | A kind of recognition methods of text page and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 310051 No. 188 Lianhui Street, Xixing Street, Binjiang District, Hangzhou City, Zhejiang Province Applicant after: Hangzhou Annan information technology Limited by Share Ltd Address before: Zhejiang Zhongcai Building No. 68 Binjiang District road Hangzhou City, Zhejiang Province, the 310051 and 15 layer Applicant before: Dbappsecurity Co.,ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |