CN107301253B - Method and device for improving accuracy of multi-site search keywords - Google Patents

Method and device for improving accuracy of multi-site search keywords Download PDF

Info

Publication number
CN107301253B
CN107301253B CN201710732432.5A CN201710732432A CN107301253B CN 107301253 B CN107301253 B CN 107301253B CN 201710732432 A CN201710732432 A CN 201710732432A CN 107301253 B CN107301253 B CN 107301253B
Authority
CN
China
Prior art keywords
website
information
target
search
crawling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710732432.5A
Other languages
Chinese (zh)
Other versions
CN107301253A (en
Inventor
李成
范渊
黄进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dbappsecurity Technology Co Ltd
Original Assignee
Hangzhou Dbappsecurity Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dbappsecurity Technology Co Ltd filed Critical Hangzhou Dbappsecurity Technology Co Ltd
Priority to CN201710732432.5A priority Critical patent/CN107301253B/en
Publication of CN107301253A publication Critical patent/CN107301253A/en
Application granted granted Critical
Publication of CN107301253B publication Critical patent/CN107301253B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention provides a method and a device for improving accuracy of multi-site search keywords, which relate to the field of internet information, and the method comprises the following steps: acquiring associated information between website information of a target website and a preset search word, wherein the website information is latest website information of the target website at the current moment, and the website information comprises website content and a website address; performing word segmentation processing on search information input by a user to obtain search keywords, wherein the search information is information for searching target website data; and searching the website content matched with the search keyword according to the associated information, and pushing the website content to the user. The invention alleviates the technical problem of poor accuracy when searching the webpage content matched with the search keyword by the search method in the prior art.

Description

Method and device for improving accuracy of multi-site search keywords
Technical Field
The invention relates to the technical field of internet information, in particular to a method and a device for improving accuracy of multi-site search keywords.
Background
The Internet (Internet) has been rapidly developed globally after its commercial operation in the middle of the 90 s of the 20 th century. With the rapid development of the internet in the current society, the internet has penetrated into various fields of our daily lives. The internet can help people to know current affair news in time, acquire various latest knowledge and information, widen the visual field of people and improve the ordinary entertainment life of people.
However, when people feel the convenience of the internet, people also feel that the internet content is complicated, the internet content is not only related to a wide range of contents, but also is updated quickly, and the internet content changes constantly at any moment, specifically, the contents are changed, increased, deleted and the like. Moreover, when the content is changed every day, the internet has many duplicate contents.
In view of the above situation, in the background of the existing search technology, after a user inputs a search keyword in a search box of a web page, the following situations occur: the content to be searched cannot be found, or the searched content is not related to the search keyword, or a plurality of repeated contents are searched, so that the technical problem of poor accuracy often exists when the web page content matched with the search keyword is searched by the search method in the prior art.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for improving accuracy of a multi-site search keyword, so as to alleviate the technical problem of poor accuracy when searching for web content matched with the search keyword by using a search method in the prior art.
In a first aspect, an embodiment of the present invention provides a method for improving accuracy of a multi-site search keyword, including:
acquiring associated information between website information of a target website and a preset search word, wherein the website information is the latest website information of the target website at the current moment, and the website information comprises website content and a website address;
performing word segmentation processing on search information input by a user to obtain search keywords, wherein the search information is information for searching the target website data;
and searching the website content matched with the search keyword according to the associated information, and pushing the website content to the user.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where before acquiring association information between website information of a target website and a preset search term, the method further includes:
acquiring target crawling time;
controlling a crawler to execute a current crawling task at the target crawling time so as to crawl a target website to obtain first website information;
determining a preset search word according to website content included in the first website information, and establishing associated information between the preset search word and the first website information;
and storing the associated information in a data server.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where controlling, at the target crawling time, a crawler to execute a current crawling task to crawl the target website to obtain first website information includes:
when the current crawling task is executed, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information of the target website;
analyzing the hyperlink interface to determine whether the hyperlink interface is a target hyperlink interface, wherein the target hyperlink interface is an interface which is not crawled, the target hyperlink interface is a correct hyperlink interface, and the webpage content corresponding to the target hyperlink interface contains preset webpage content;
traversing the webpage corresponding to the hyperlink interface under the condition that the target hyperlink interface is determined to obtain the website content of the target hyperlink interface;
and taking the website content and the website address of each target hyperlink interface as the first website information.
With reference to the second possible implementation manner of the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where crawling the top page of the target website includes:
judging whether the target website executes the crawling task for the first time;
under the condition that no, analyzing second website information to determine whether a webpage indicated by a target website can be accessed through the target website or whether webpage content exists in the webpage indicated by the target website, wherein the second website information is information crawled by the crawler when executing a first crawling task, the first crawling task is a last crawling task of the current crawling task, and the target website is any website address in the second website information,
if yes, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information;
and in the case of no determination, deleting the associated information associated with the target website from the data server.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention further provides a fourth possible implementation manner of the first aspect, where the method further includes:
judging whether the current crawling task is executed for the first time on the target website;
under the condition that no, analyzing second website information to determine whether a webpage indicated by a target website can be accessed through the target website or whether webpage content exists in the webpage indicated by the target website, wherein the second website information is information crawled by the crawler when executing a first crawling task, the first crawling task is a last crawling task of the current crawling task, and the target website is any website address in the second website information,
if yes, crawling the home page of the target website;
and in the case of no determination, deleting the associated information associated with the target website from the data server.
With reference to the second possible implementation manner of the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the obtaining a target crawling time includes:
setting a timer quartz of Java in advance to set the crawling time of the crawler, wherein the timer quartz of Java is used for triggering the crawler to execute a crawling task at regular time;
and extracting target crawling time from the crawling time.
With reference to the first aspect, an embodiment of the present invention provides a sixth possible implementation manner of the first aspect, where performing a word segmentation process on search information input by a user to obtain a search keyword, where the word segmentation process includes:
and performing word segmentation processing on the search information input by the user through an IKAnalyzer word segmenter to obtain search keywords.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where finding, according to the association information, website content that matches the search keyword includes: searching the search keyword from the search word of the associated information; determining the website information associated with the search terms according to the matching degree between the search terms and the search keywords;
pushing the website content to the user comprises: and pushing the website content in the website information to the user according to the matching degree.
In a second aspect, an embodiment of the present invention further provides an apparatus for improving accuracy of a multi-site search keyword, where the apparatus includes:
the system comprises a first acquisition module, a first search module and a second acquisition module, wherein the first acquisition module is used for acquiring the website information of a target website and the association information between preset search terms, the website information is the latest website information of the target website at the current moment, and the website information comprises website content and a website address;
the word segmentation module is used for carrying out word segmentation on search information input by a user to obtain search keywords, wherein the search information is information for searching the target website data;
and the pushing module is used for searching the website content matched with the search keyword according to the associated information and pushing the website content to the user.
In a third aspect, an embodiment of the present invention further provides a computer-readable medium having a non-volatile program code executable by a processor, where the program code causes the processor to execute the method for improving accuracy of a multi-site search keyword according to the first aspect.
The embodiment of the invention has the following beneficial effects: acquiring associated information between website information of a target website and a preset search word, wherein the website information is latest website information of the target website at the current moment, and the website information comprises website content and a website address; performing word segmentation processing on search information input by a user to obtain search keywords, wherein the search information is information for searching target website data; and searching the website content matched with the search keyword according to the associated information, and pushing the website content to the user. In the embodiment of the invention, the website information is the latest website information of the target website at the current moment, the associated information is also the latest associated information at the current moment, and the accuracy of the associated information is ensured by the real-time property of the associated information, so that the technical problem of poor accuracy when the webpage content matched with the search keyword is searched by the search method in the prior art is solved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a method for improving accuracy of a multi-site search keyword according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for crawling a home page of a target website according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an apparatus for improving accuracy of a multi-site search keyword according to a second embodiment of the present invention;
fig. 4 is a schematic diagram of another apparatus for improving accuracy of a multi-site search keyword according to a second embodiment of the present invention.
Icon: 100-a first acquisition module; 200-word segmentation module; 300-a push module; 400-a second acquisition module; 500-a crawling module; 600-establishing a module; 700-storage module.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The internet content is not only wide in hunting content, but also fast in updating, so that after a user inputs a search keyword, the content to be searched cannot be found, the searched content is not related to the search keyword, or a plurality of repeated contents are searched, and the webpage content matched with the search keyword is often in a poor accuracy technical problem. Based on this, the method and the device for improving the accuracy of the multi-site search keyword provided by the embodiment of the invention can solve the technical problem of poor accuracy when the search method in the prior art is used for searching the webpage content matched with the search keyword.
Example one
The method for improving the accuracy of the multi-site search keyword provided by the embodiment of the invention comprises the following steps as shown in fig. 1:
step S102, acquiring the associated information between the website information of the target website and the preset search word, wherein the website information is the latest website information of the target website at the current moment, and the website information comprises website content and website address.
Specifically, the target website includes a single website site or a plurality of website sites.
In addition, since the information of some websites is updated in real time, the website information is the latest website information of the target website at the current time, and thus the website information is real-time website information.
And step S104, performing word segmentation processing on the search information input by the user to obtain search keywords, wherein the search information is information for searching target website data.
Specifically, most of the search information input by the user is character strings, and the search keywords are obtained by performing word segmentation processing on the character strings.
And step S106, searching the website content matched with the search keyword according to the associated information, and pushing the website content to the user.
It should be noted that the steps described in the above steps S102 to S106 may be executed by an execution device, the execution device may be disposed between an intranet of a company and a target website (the target website is an extranet of the company), and the execution device acquires, through communication with the extranet, association information between website information of the target website and a preset search word, and stores the association information. Furthermore, the execution means sets in advance a word segmentation rule for performing word segmentation processing on the search information input by the user. When a user in an intranet of a company searches data of a target website, an execution device acquires search information input by the user through communication with a client in the intranet, acquires pre-stored associated information, searches website content matched with a search keyword according to the associated information, and pushes the website content to the user.
It should be emphasized that the website information is the latest website information of the target website at the current time, the associated information is also the latest associated information at the current time, and the accuracy of the associated information is ensured by the real-time property of the associated information, so that the technical problem of poor accuracy existing when the web page content matched with the search keyword is searched by the search method in the prior art is solved.
Regarding the execution device obtaining the associated information between the website information of the target website and the preset search term through communication with the external network, a detailed implementation is given in an optional implementation of the embodiment of the present invention, which specifically includes the following steps:
acquiring target crawling time before acquiring the website information of a target website and the associated information between preset search terms;
controlling a crawler to execute a current crawling task at a target crawling time so as to crawl a target website to obtain first website information;
determining a preset search word according to website content included in the first website information, and establishing associated information between the preset search word and the first website information;
the association information is stored in a data server.
Specifically, the address depth of the website and the entity webpage information for judging whether the address is the address of the target website can be stored in the data server, so that the website can be more efficiently crawled in the following crawling task.
It should be noted that the association information stored in the data server includes the following two cases: the first case is the association information between the contents of the website and the preset search words, and the second case is the association information between the address of the website and the preset search words. For the first case, after the user inputs the search information, directly searching the website content matched with the search keyword from the associated information, and pushing the website content to the user; for the second case, after the user inputs the search information, the website address matched with the search keyword is searched from the associated information, then the website content of the webpage indicated by the website address is searched, and the website content is pushed to the user.
Referring to another optional implementation manner of the embodiment of the present invention, specifically, the following is a detailed acquisition method of the target crawling time, for controlling a crawler to execute a current crawling task at the target crawling time:
setting a timer quartz of Java in advance to set crawling time of the crawler, wherein the timer quartz of Java is used for triggering the crawler to execute a crawling task at regular time; and then extracting the target crawling time from the crawling time.
It should be noted that the Java timer quartz has a preset time for triggering the crawler to execute the crawling task, and the target crawling time for executing the current crawling task is a previous time closest to the current time in the preset time.
In another optional implementation manner of the embodiment of the present invention, controlling a crawler to execute a current crawling task at a target crawling time to crawl a target website to obtain first website information includes:
when the current crawling task is executed, the home page of the target website is crawled to obtain the home page content of the target website and hyperlink interfaces, such as an href interface and an src interface, contained in the home page information of the target website.
And analyzing the hyperlink interface to determine whether the hyperlink interface is a target hyperlink interface, wherein the target hyperlink interface is an interface which is not crawled, the target hyperlink interface is a correct hyperlink interface, and the webpage content corresponding to the target hyperlink interface contains preset webpage content. The preset web page content is the web page content which is desired to be obtained in advance, and if the web page content is the web page content which is not of interest, the hyperlink interface of the web page is not the target hyperlink interface.
Traversing the web page corresponding to the hyperlink interface under the condition that the target hyperlink interface is determined to obtain the website content of the target hyperlink interface;
and taking the website content and the website address of each target hyperlink interface as first website information.
It should be noted that, the embodiment of the present invention provides a way for crawling a home page of a website by a crawler, and for websites with different depths, the crawler crawls deeper webpages, and the crawling way can also be adopted.
In another optional implementation manner of the embodiment of the present invention, as shown in fig. 2, crawling the home page of the target website includes the following steps:
step S201, judging whether the crawling task is executed for the first time on the target website, wherein the step S202 is executed under the condition that no crawling task is judged, and the step S203 is executed under the condition that yes crawling task is judged;
step S202, analyzing second website information, and determining whether a webpage indicated by a target website can be accessed through the target website (namely, whether the webpage of the target website exists) or whether webpage content exists in the webpage indicated by the target website (namely, whether the webpage information of the target website exists), wherein the second website information is information crawled by a crawler when executing a first crawling task, the first crawling task is a last crawling task of a current crawling task, and the target website is any website address in the second website information. Wherein, the step S203 is executed if the determination is yes, and the step S204 is executed if the determination is no;
step S203, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information;
in step S204, the associated information associated with the target website is deleted from the data server, and then the process returns to continue to step S203.
It should be noted that, in the embodiment of the present invention, the target website is any website address in the second website information, and the second website information is information crawled by the crawler when executing the first crawling task, through the above steps, the purpose of verifying the website information obtained by crawling at the last time is achieved, a phenomenon that the target website information appears in a search result, but no related website content is obtained after clicking a link of the target website is avoided, and thus, a subsequent search error is avoided.
In another optional implementation manner of the embodiment of the present invention, before storing the association information in the data server, the method for improving accuracy of the multi-site search keyword further includes:
judging whether the data server stores the associated information for the first time;
and under the condition that the data server is judged to store the associated information for the first time, emptying the stored data in the data server, so that the associated information obtained by crawling websites except the target website in the prior art is prevented from being left in the data server, or some dirty data is prevented from being left in the data server.
In another optional implementation manner of the embodiment of the present invention, performing word segmentation processing on search information input by a user to obtain a search keyword includes:
and performing word segmentation processing on the search information input by the user through an IKAnalyzer word segmenter to obtain search keywords.
Specifically, a common method of word segmentation is first managed through a common interface of word segmentation. For the returned result of word segmentation processing, various situations need to be considered, including: first, the returned result is a set of key-value pairs Map; second, the returned result is a Set with String strings as the identifier.
In addition, word segmentation is carried out through the IKAnalyzer word segmentation device, two word segmentation types can be added, one is intelligent segmentation, and the other is fine-grained segmentation, so that character strings can be segmented in different modes according to needs.
In addition, the word stock of the IKAnalyzer can be updated in time, and word segmentation can achieve a more ideal word segmentation effect.
The embodiment of the invention adopts the IKAnalyzer word segmentation device, adopts a text matching-based mode, does not need to invest a large amount of manpower for training and labeling, can self-define a dictionary, is convenient to add the words of domain specific, and can segment multi-granularity results.
In another optional implementation manner of the embodiment of the present invention, searching for website content matching a search keyword according to associated information includes: searching search keywords from the search terms of the associated information; determining website information associated with the search terms according to the matching degree between the search terms and the search keywords;
pushing website content to the user includes: and pushing the website content in the website information to the user according to the matching degree.
Specifically, the method can be implemented by a search server (for example, solr, which is an independent data server, can generate an index to quickly search for and return results according to search information input by a user); the user can also make a search request containing search information through Http Get.
In the case where the search keyword is a multi-word group, the search keywords may be connected by OR, and then the preset search word may be matched by the Client to obtain the website content. The website contents in the website information are pushed to the user, and the screened website contents can be displayed on the pages of the search results according to the quantity and the number of the search keywords contained in the preset search words corresponding to the website contents.
Example two
The apparatus for improving accuracy of a multi-site search keyword provided by an embodiment of the present invention, as shown in fig. 3, includes:
the first obtaining module 100 is configured to obtain associated information between website information of a target website and a preset search term, where the website information is latest website information of the target website at a current time, and the website information includes website content and a website address;
the word segmentation module 200 is configured to perform word segmentation processing on search information input by a user to obtain search keywords, where the search information is information for searching target website data;
the pushing module 300 is configured to search for website content matched with the search keyword according to the association information, and push the website content to the user.
In the embodiment of the present invention, the first obtaining module 100 obtains associated information between website information of a target website and a preset search term, where the website information is latest website information of the target website at a current time, and the website information includes website content and a website address; the word segmentation module 200 performs word segmentation processing on search information input by a user to obtain search keywords, wherein the search information is information for searching target website data; the pushing module 300 searches for the website content matched with the search keyword according to the association information, and pushes the website content to the user. In the embodiment of the invention, the website information is the latest website information of the target website at the current moment, the associated information is also the latest associated information at the current moment, and the accuracy of the associated information is ensured by the real-time property of the associated information, so that the technical problem of poor accuracy when the webpage content matched with the search keyword is searched by the search method in the prior art is solved.
In another optional implementation manner of the embodiment of the present invention, as shown in fig. 4, the apparatus for improving accuracy of a multi-site search keyword further includes:
a second obtaining module 400, configured to obtain a target crawling time;
the crawling module 500 is used for controlling a crawler to execute a current crawling task at a target crawling time so as to crawl a target website to obtain first website information;
an establishing module 600, configured to determine a preset search term according to website content included in the first website information, and establish associated information between the preset search term and the first website information;
a storage module 700 for storing the association information in the data server.
In another optional implementation of the embodiment of the present invention, the crawling module includes:
the crawling unit is used for crawling the home page of the target website when executing the current crawling task to obtain the home page content of the target website and a hyperlink interface contained in the home page information of the target website;
the determining unit is used for analyzing the hyperlink interface and determining whether the hyperlink interface is a target hyperlink interface or not, wherein the target hyperlink interface is an interface which is not crawled, the target hyperlink interface is a correct hyperlink interface, and the webpage content corresponding to the target hyperlink interface comprises preset webpage content;
the traversing unit is used for traversing the webpage corresponding to the hyperlink interface under the condition that the target hyperlink interface is determined to obtain the website content of the target hyperlink interface;
and the determining unit is used for taking the website content and the website address of each target hyperlink interface as the first website information.
In another optional implementation manner of the embodiment of the present invention, the crawling unit is further configured to:
judging whether a crawling task is executed for the first time on a target website;
under the condition that the judgment is negative, analyzing the second website information to determine whether the webpage indicated by the target website can be accessed through the target website or whether webpage content exists in the webpage indicated by the target website, wherein the second website information is information crawled by a crawler when executing a first crawling task, the first crawling task is a previous crawling task of the current crawling task, the target website is any website address in the second website information,
if yes, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information;
in the event that a determination is made of no, the associated information associated with the target web address is deleted from the data server.
In another optional implementation manner of the embodiment of the present invention, the apparatus for improving accuracy of a multi-site search keyword further includes:
the judging module is used for judging whether the data server stores the associated information for the first time;
and the clearing module is used for clearing the data stored in the data server under the condition that the data server is judged to store the associated information for the first time.
In another optional implementation manner of the embodiment of the present invention, the second obtaining module is configured to:
setting a timer quartz of Java in advance to set crawling time of the crawler, wherein the timer quartz of Java is used for triggering the crawler to execute a crawling task at regular time;
and extracting target crawling time from the crawling time.
In another optional implementation manner of the embodiment of the present invention, the word segmentation module is configured to:
and performing word segmentation processing on the search information input by the user through an IKAnalyzer word segmenter to obtain search keywords.
In another optional implementation manner of the embodiment of the present invention, the pushing module is configured to:
searching search keywords from the search terms of the associated information; determining website information associated with the search terms according to the matching degree between the search terms and the search keywords;
and pushing the website content in the website information to the user according to the matching degree.
EXAMPLE III
The embodiment of the invention provides a computer readable medium with a nonvolatile program code executable by a processor, wherein the program code enables the processor to execute the method for improving the accuracy of the multi-site search keyword, which is provided by the embodiment, wherein the website information is the latest website information of a target website at the current moment, the associated information is also the latest associated information at the current moment, and the accuracy of the associated information is ensured by the real-time property of the associated information, so that the technical problem of poor accuracy when the webpage content matched with the search keyword is searched by using the search method in the prior art is solved.
The computer program product of the method and the apparatus for improving accuracy of multi-site search keywords provided by the embodiments of the present invention includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments, and will not be described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (7)

1. A method for improving accuracy of a multi-site search keyword, comprising:
acquiring associated information between website information of a target website and a preset search word, wherein the website information is the latest website information of the target website at the current moment, and the website information comprises website content and a website address;
performing word segmentation processing on search information input by a user to obtain search keywords, wherein the search information is information for searching the target website data;
searching website content matched with the search keywords according to the associated information, and pushing the website content to the user;
before acquiring the website information of the target website and the associated information between the preset search terms, the method further comprises the following steps:
acquiring target crawling time;
controlling a crawler to execute a current crawling task at the target crawling time so as to crawl a target website to obtain first website information;
determining a preset search word according to website content included in the first website information, and establishing associated information between the preset search word and the first website information;
storing the associated information in a data server;
the target crawling time is used for controlling a crawler to execute a current crawling task so as to crawl the target website, and first website information is obtained, wherein the method comprises the following steps:
when the current crawling task is executed, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information of the target website;
analyzing the hyperlink interface to determine whether the hyperlink interface is a target hyperlink interface, wherein the target hyperlink interface is an interface which is not crawled, the target hyperlink interface is a correct hyperlink interface, and the webpage content corresponding to the target hyperlink interface contains preset webpage content;
traversing the webpage corresponding to the hyperlink interface under the condition that the target hyperlink interface is determined to obtain the website content of the target hyperlink interface;
taking the website content and the website address of each target hyperlink interface as the first website information;
wherein, crawling the home page of the target website comprises:
judging whether the target website executes the crawling task for the first time;
under the condition that no, analyzing second website information to determine whether a webpage indicated by a target website can be accessed through the target website or whether webpage content exists in the webpage indicated by the target website, wherein the second website information is information crawled by the crawler when executing a first crawling task, the first crawling task is a last crawling task of the current crawling task, and the target website is any website address in the second website information,
if yes, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information;
and in the case of no determination, deleting the associated information associated with the target website from the data server.
2. The method of claim 1, wherein prior to storing the association information in a data server, the method further comprises:
judging whether the data server stores the associated information for the first time;
and under the condition that the data server is judged to store the associated information for the first time, emptying the data stored in the data server.
3. The method of claim 1, wherein obtaining a target crawl time comprises:
setting a timer quartz of Java in advance to set the crawling time of the crawler, wherein the timer quartz of Java is used for triggering the crawler to execute a crawling task at regular time;
and extracting target crawling time from the crawling time.
4. The method of claim 1, wherein performing a word segmentation process on the search information input by the user to obtain a search keyword comprises:
and performing word segmentation processing on the search information input by the user through an IKAnalyzer word segmenter to obtain search keywords.
5. The method of claim 1,
searching the website content matched with the search keyword according to the associated information, wherein the searching comprises the following steps: searching the search keyword from the search word of the associated information; determining the website information associated with the search terms according to the matching degree between the search terms and the search keywords;
pushing the website content to the user comprises: and pushing the website content in the website information to the user according to the matching degree.
6. An apparatus for improving accuracy of a multi-site search keyword, comprising:
the system comprises a first acquisition module, a first search module and a second acquisition module, wherein the first acquisition module is used for acquiring the website information of a target website and the association information between preset search terms, the website information is the latest website information of the target website at the current moment, and the website information comprises website content and a website address;
the word segmentation module is used for carrying out word segmentation on search information input by a user to obtain search keywords, wherein the search information is information for searching the target website data;
the pushing module is used for searching the website content matched with the search keyword according to the associated information and pushing the website content to the user;
wherein the apparatus further comprises: the second acquisition module is used for acquiring target crawling time;
the crawling module is used for controlling a crawler to execute a current crawling task at a target crawling time so as to crawl a target website to obtain first website information;
the establishing module is used for determining a preset search word according to website content included in the first website information and establishing associated information between the preset search word and the first website information;
the storage module is used for storing the associated information in the data server;
wherein, the crawling module comprises:
the crawling unit is used for crawling the home page of the target website when executing the current crawling task to obtain the home page content of the target website and a hyperlink interface contained in the home page information of the target website;
the determining unit is used for analyzing the hyperlink interface and determining whether the hyperlink interface is a target hyperlink interface or not, wherein the target hyperlink interface is an interface which is not crawled, the target hyperlink interface is a correct hyperlink interface, and the webpage content corresponding to the target hyperlink interface comprises preset webpage content;
the traversing unit is used for traversing the webpage corresponding to the hyperlink interface under the condition that the target hyperlink interface is determined to obtain the website content of the target hyperlink interface;
the determining unit is used for taking the website content and the website address of each target hyperlink interface as first website information;
wherein the crawling unit is further configured to:
judging whether a crawling task is executed for the first time on a target website;
under the condition that the judgment is negative, analyzing the second website information to determine whether the webpage indicated by the target website can be accessed through the target website or whether webpage content exists in the webpage indicated by the target website, wherein the second website information is information crawled by a crawler when executing a first crawling task, the first crawling task is a previous crawling task of the current crawling task, the target website is any website address in the second website information,
if yes, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information;
in the event that a determination is made of no, the associated information associated with the target web address is deleted from the data server.
7. A computer-readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method of improving accuracy of a multi-site search keyword of any one of claims 1 to 5.
CN201710732432.5A 2017-08-23 2017-08-23 Method and device for improving accuracy of multi-site search keywords Active CN107301253B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710732432.5A CN107301253B (en) 2017-08-23 2017-08-23 Method and device for improving accuracy of multi-site search keywords

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710732432.5A CN107301253B (en) 2017-08-23 2017-08-23 Method and device for improving accuracy of multi-site search keywords

Publications (2)

Publication Number Publication Date
CN107301253A CN107301253A (en) 2017-10-27
CN107301253B true CN107301253B (en) 2020-02-04

Family

ID=60132524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710732432.5A Active CN107301253B (en) 2017-08-23 2017-08-23 Method and device for improving accuracy of multi-site search keywords

Country Status (1)

Country Link
CN (1) CN107301253B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108519984B (en) * 2018-02-07 2022-11-04 平安科技(深圳)有限公司 Weather data processing method, server and computer readable storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102135967B (en) * 2010-01-27 2013-06-05 华为技术有限公司 Webpage keywords extracting method, device and system
CN102456057B (en) * 2010-11-01 2016-08-17 阿里巴巴集团控股有限公司 Search method based on online trade platform, device and server
CN102446225A (en) * 2012-01-11 2012-05-09 深圳市爱咕科技有限公司 Real-time search method, device and system
CN102591992A (en) * 2012-02-15 2012-07-18 苏州亚新丰信息技术有限公司 Webpage classification identifying system and method based on vertical search and focused crawler technology
CN103778122B (en) * 2012-10-17 2018-01-23 腾讯科技(深圳)有限公司 Searching method and system

Also Published As

Publication number Publication date
CN107301253A (en) 2017-10-27

Similar Documents

Publication Publication Date Title
US8812520B1 (en) Augmented resource graph for scoring resources
US8751466B1 (en) Customizable answer engine implemented by user-defined plug-ins
US10250526B2 (en) Method and apparatus for increasing subresource loading speed
US9304979B2 (en) Authorized syndicated descriptions of linked web content displayed with links in user-generated content
US9734149B2 (en) Clustering repetitive structure of asynchronous web application content
US20060095430A1 (en) Web page ranking with hierarchical considerations
US20110173177A1 (en) Sightful cache: efficient invalidation for search engine caching
US10216716B2 (en) Method and system for electronic resource annotation including proposing tags
US8631097B1 (en) Methods and systems for finding a mobile and non-mobile page pair
CN108566399B (en) Phishing website identification method and system
WO2018095411A1 (en) Web page clustering method and device
US10810181B2 (en) Refining structured data indexes
US20080281827A1 (en) Using structured database for webpage information extraction
CN103294732A (en) Web page crawling method and spider
CN102222098A (en) Method and system for pre-fetching webpage
US20220405312A1 (en) Methods and systems for modifying a search result
US20150302093A1 (en) Method and system for filtering of a website
CN110889023A (en) Distributed multifunctional search engine of elastic search
CN103399872A (en) Method and device for optimizing webpage capture
CN105550359A (en) Webpage sorting method and device based on vertical search and server
CN108845925B (en) Web page testing method and device, electronic equipment and computer readable medium
WO2018053024A1 (en) Organizing datasets for adaptive responses to queries
CN105528357A (en) Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
WO2022179128A1 (en) Crawler-based data crawling method and apparatus, computer device, and storage medium
CN107430614B (en) Application local deep linking to corresponding resources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 310051 No. 188 Lianhui Street, Xixing Street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Hangzhou Annan information technology Limited by Share Ltd

Address before: Zhejiang Zhongcai Building No. 68 Binjiang District road Hangzhou City, Zhejiang Province, the 310051 and 15 layer

Applicant before: Dbappsecurity Co.,ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant