CN111159527A - Method, device, equipment and storage medium for identifying and processing homepage - Google Patents

Method, device, equipment and storage medium for identifying and processing homepage Download PDF

Info

Publication number
CN111159527A
CN111159527A CN201811321529.8A CN201811321529A CN111159527A CN 111159527 A CN111159527 A CN 111159527A CN 201811321529 A CN201811321529 A CN 201811321529A CN 111159527 A CN111159527 A CN 111159527A
Authority
CN
China
Prior art keywords
url
frequency
predicted
result
search request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811321529.8A
Other languages
Chinese (zh)
Inventor
陈雪飞
谢海华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Original Assignee
Pku Founder Information Industry Group Co ltd
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pku Founder Information Industry Group Co ltd, Peking University Founder Group Co Ltd filed Critical Pku Founder Information Industry Group Co ltd
Priority to CN201811321529.8A priority Critical patent/CN111159527A/en
Publication of CN111159527A publication Critical patent/CN111159527A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The application provides a homepage identification processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring feature data to be predicted corresponding to a user search request, wherein the feature data to be predicted comprises keyword features and URL identification features; determining a target URL corresponding to the user search request by adopting a pre-trained Xgboost model and an SVM model based on the characteristic data to be predicted; responding to the user search request based on the target URL. By adopting a pre-trained Xgboost model and an SVM model based on the characteristic data to be predicted, a target URL corresponding to the user search request is determined, and the user search request is responded based on the target URL, the most useful webpage information can be provided for the user, the accuracy of the user search result is improved, and the user experience is improved.

Description

Method, device, equipment and storage medium for identifying and processing homepage
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a storage medium for identifying and processing a homepage.
Background
As internet related technologies mature and develop, network information is growing explosively. When the user searches the information, the user finds that most of the information has little value and belongs to the junk information. Valuable information is hidden in a large amount of junk information, and the presentation forms of the information are various, so that a user cannot effectively acquire the useful information.
Therefore, how to acquire the most desirable information from the internet becomes an urgent technical problem to be solved.
Disclosure of Invention
The application provides a homepage identification processing method, a homepage identification processing device, equipment and a storage medium, which are used for solving the defects that in the prior art, a user search result is inaccurate and the like.
A first aspect of the present application provides a home page identification processing method, including:
acquiring feature data to be predicted corresponding to a user search request, wherein the feature data to be predicted comprises keyword features and URL identification features;
determining a target URL corresponding to the user search request by adopting a pre-trained Xgboost model and an SVM model based on the characteristic data to be predicted;
responding to the user search request based on the target URL.
A second aspect of the present application provides a home page identification processing apparatus, including:
the device comprises an acquisition module, a prediction module and a prediction module, wherein the acquisition module is used for acquiring feature data to be predicted corresponding to a user search request, and the feature data to be predicted comprises keyword features and URL identification features;
the determining module is used for determining a target URL corresponding to the user search request by adopting a pre-trained Xgboost model and a pre-trained SVM model based on the characteristic data to be predicted;
a processing module for responding to the user search request based on the target URL.
A third aspect of the present application provides a terminal device, including: at least one processor and memory;
the memory stores a computer program; the at least one processor executes the computer program stored by the memory to implement the method provided by the first aspect.
A fourth aspect of the present application provides a search server, comprising: at least one processor and memory;
the memory stores a computer program; the at least one processor executes the computer program stored by the memory to implement the method provided by the first aspect.
A fifth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed, implements the method provided by the first aspect.
According to the method, the device, the equipment and the storage medium for identifying and processing the homepage, the target URL corresponding to the user search request is determined by adopting the pre-trained Xgboost model and the SVM model based on the characteristic data to be predicted, the user search request is responded based on the target URL, the most useful webpage information can be provided for the user, the accuracy of the user search result is improved, and therefore the user experience is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a schematic structural diagram of a search system suitable for use in an embodiment of the present application;
FIG. 2 is a flowchart illustrating a home page identification processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an HTML document provided in an embodiment of the present application;
FIG. 4 is a schematic diagram of a conference home page identification search result list according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a school homepage identification search result list according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a learner's home page identification search result list according to an embodiment of the present application;
FIG. 7 is a schematic structural diagram of a home page identification processing apparatus according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a home page identification processing device according to another embodiment of the present application;
fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a search server according to an embodiment of the present application.
With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms referred to in this application are explained first:
xgboost: xgboost is the expansion and improvement of GDBT (Gradient boosting Tree), and the xgboost algorithm is faster and has higher accuracy. GDBT is a member of the Boosting family of ensemble learning, and the iteration of GDBT uses a forward distribution algorithm and uses a CART regression tree model.
SVM: support Vector Machine, referred to as Support Vector Machine, is a discrimination method. In the field of machine learning, a supervised learning model is typically used for pattern recognition, classification, and regression analysis.
URL: a uniform resource locator is a compact representation of the location and access method of a resource available from the internet, and is the address of a standard resource on the internet. Each file on the internet has a unique URL that contains information indicating the location of the file and how the browser should handle it.
The method for identifying and processing a homepage provided in the embodiment of the present application is suitable for the following search system, and is a schematic structural diagram of the search system suitable for the embodiment of the present application, as shown in fig. 1. The search system can comprise a search server and at least one terminal device, a user inputs a search keyword in a search engine through the terminal device and sends a search request to the search server, the search server searches related webpage result URLs according to the search keyword of the user and returns the related webpage result URLs to the terminal device, the terminal device displays a search result list to the user, and the user can click and check each search result. The identification processing method for the homepage provided by the embodiment of the present application may be executed by the search server or the terminal device, and may be specifically set according to actual requirements, which is not limited in the embodiment of the present application. For example, if the search is executed by the search server, after the search server obtains the relevant webpage result URLs according to the search keywords of the user, the method for identifying and processing the homepage according to the embodiment of the application is adopted, the most desired target URL of the user is further obtained from the result URLs, the target URL is returned to the terminal device, and the terminal device displays relevant information corresponding to the target URL to the user. If the terminal device executes the target URL, the search server returns the searched relevant webpage result URL to the terminal device, and the terminal device executes the identification processing method of the homepage in the embodiment of the application, further screens out the most desired target URL of the user, and displays relevant information of the target URL to the user.
Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the description of the following examples, "plurality" means two or more unless specifically limited otherwise.
The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
Example one
The embodiment provides a homepage identification processing method, which is used for acquiring a most desired webpage address of a user. The main execution body of the present embodiment is a home page identification processing device, and the device may be provided in a terminal device or a search server.
As shown in fig. 2, a flow chart of a home page identification processing method provided in the present embodiment is schematically illustrated, and the method includes:
step 101, feature data to be predicted corresponding to a user search request is obtained, and the feature data to be predicted comprises keyword features and URL identification features.
Specifically, when the user wants to perform a search, a search keyword, such as "XX university", "nlpc 2018", or the like, may be input in the search engine through the terminal device. After the user clicks search, the terminal device sends a user search request carrying a search keyword of the user to the search server, the search server searches result URLs of one or more related webpages related to the search keyword according to the search keyword, and after the result URLs are obtained, the result URLs need to be further screened to obtain a target URL which is most desired by the user. First, feature data corresponding to the result URLs, that is, feature data to be predicted corresponding to a user search request, needs to be acquired. The feature data to be predicted includes a keyword feature and a URL identification feature. And for each result URL, generating a keyword characteristic and a URL identification characteristic corresponding to the result URL. The keyword features may be obtained by segmenting a search keyword of a user, for example, if the search keyword is a chinese character, each word is a word, if the search keyword is an english word, a space is used as a segment, each english word is a word, the number of words included in the search keyword, the frequency of occurrence of each word in an HTML text corresponding to the result URL, and the like are used as the keyword features.
The URL identification feature may be a URL identification generation feature for a URL identification corresponding to the resulting URL. Illustratively, a unique number is generated for each result URL, and if there are 10 result URLs with numbers of 0-9, a 10-dimensional feature is created for the number of each result URL, for example, the 10-dimensional feature corresponding to the number 0 is "1000100001", which is only an exemplary illustration here, and the specific feature may be set according to actual requirements.
And 102, determining a target URL corresponding to the user search request by adopting a pre-trained Xgboost model and an SVM model based on the characteristic data to be predicted.
After the feature data to be predicted is obtained, a target URL corresponding to a user search request can be determined by adopting a pre-trained Xgboost model and an SVM model.
Illustratively, the feature data to be predicted comprises keyword features and URL identification features corresponding to 10 result URLs respectively, and one of the result URLs can be predicted to be a target URL most desired by a user by adopting an Xgboost model and an SVM model.
It should be noted that the Xgboost model and the SVM model are two independent classification models, and may be obtained by respectively training a pre-established Xgboost network and a pre-established SVM network with the same training feature data, or may be obtained by respectively training two networks with different training feature data, which may be specifically selected according to actual requirements.
And when the prediction results of the two models are the same, the obtained prediction result can be considered as the target URL most desired by the user.
The prediction probability threshold values of the models may be set to be the same or different, and may be specifically set according to actual requirements, which is not limited in this embodiment.
For example, a first probability threshold may be set, for example, 0.8, when the probability that the Xgboost model predicts that a result URL is a homepage is greater than 0.8, the label of whether the result URL is the homepage is set to 1, if the probability is less than 0.8, the label is set to 0, the model output result may be each URL identifier and a corresponding label value, or may be a URL identifier that outputs only a label value of 1 or other related information, a specific output form may be set according to an actual requirement, and this embodiment is not limited. When the probability that the SVM model predicts that a certain result URL is the homepage is greater than 0.8, setting the label of whether the result URL is the homepage to be 1, and if the result URL is less than 0.8, setting the label to be 0. The following is consistent with the Xgboost model, and the description is omitted here. It is also possible to set a second probability threshold, such as 0.9, for the Xgboost model, a third probability threshold, such as 0.85, for the SVM model, etc. The setting can be specifically set according to the actual situation, and is not described herein again.
Alternatively, the number of target URLs may be one or more, for example, 3 particularly relevant target URLs may be filtered for the user to view.
Step 103, responding to the user search request based on the target URL.
Specifically, after the target URL is determined, the user search request is responded based on the target URL, for example, relevant information of the target URL, such as a name corresponding to the target URL, brief description information, and the like, is presented to the user. Allowing the user to click on a detailed web page. The specific display mode is not limited.
According to the method for identifying and processing the homepage, the target URL corresponding to the user search request is determined by adopting the pre-trained Xgboost model and SVM model based on the characteristic data to be predicted, and the user search request is responded based on the target URL, so that the most useful webpage information can be provided for the user, the accuracy of the user search result is improved, and the user experience is improved.
Example two
The present embodiment further supplements the method provided in the first embodiment.
As a practical manner, on the basis of the first embodiment, optionally, the step 101 specifically includes:
step 1011, obtaining a search keyword corresponding to the search request of the user, and obtaining at least two results URL identifiers and HTML texts corresponding to the identifiers according to the search keyword.
Step 1012, obtaining keyword characteristics according to the search keyword and the HTML text corresponding to each identifier, and obtaining URL identifier characteristics according to each identifier.
Optionally, obtaining the keyword features according to the search keyword and the HTML text corresponding to each identifier, including:
step 2011, acquiring the number N of words included in the search keyword;
step 2012, for each result URL, calculating the number of times n1 that each word in the search keyword appears in the result URL and a first frequency, the number of times n2 that each word appears in the title tag in the HTML text corresponding to the result URL and a second frequency, and the number of times n3 that each word appears in the non-title tag in the HTML text corresponding to the result URL and a tfidf value of each word.
The keyword feature includes n1, n2, n3, the first frequency, the second frequency, the third frequency and the tfidf value corresponding to each result URL (the tfidf value corresponding to one result URL is the sum of the tfidf values of the words in the keyword). That is, the keyword feature corresponding to each result URL is a 7-dimensional feature. Of course, the keyword features may also include other features.
The HTML text refers to a source file of a web page corresponding to the URL. Exemplarily, as shown in fig. 3, a schematic diagram of HTML text provided for the present embodiment is provided.
tfidf (term frequency-inverse document frequency) refers to a commonly used weighting technique for information retrieval and data mining. TF means Term Frequency (Term Frequency), and IDF means Inverse text Frequency index (Inverse Document Frequency). the specific calculation method of the tfidf value is the prior art, and is not described herein again. For each URL, the tfidf value feature in the keyword features is the sum of the tfidf values of the words.
The first frequency is N1/N, the second frequency is N2/N, and the third frequency is N3/N.
Optionally, n3 may also be the number of times that a word appears in the first 50% of lines in the non-title tag in the HTML text corresponding to the result URL, and the specific proportion may be set according to actual requirements, for example, may also be 40%, 60%, and the like, which is not limited herein.
Optionally, the keyword feature may also include some other feature.
As another implementable manner, on the basis of the first embodiment, optionally, the step 102 specifically includes:
and step 1021, inputting the characteristic data to be predicted into the Xgboost model and the SVM model respectively for prediction.
Specifically, for each result URL, feature concatenation is required before the keyword feature and the URL identification feature are input into the model, and for example, 10-dimensional URL identification feature and 7-dimensional keyword feature are concatenated and input into the model.
And 1022, if the predicted results of the Xgboost model and the SVM model are the same result URL, using the result URL as a target URL corresponding to the user search request.
Specifically, the Xgboost model and the SVM model are used for prediction at the same time, when the result URLs of the homepages predicted by the two models are the same, the result URLs are used as target URLs corresponding to user search requests, and prediction is carried out by combining the two models, so that the accuracy of prediction results is improved.
Optionally, if the target URL is not found in the above manner, both the result URLs with the label values of 1 of the prediction results of the two models may be used as the target URL, or the result URL predicted by one of the models may be used as the target URL, or the Xgboost model may be further used for prediction, and the probability threshold of the Xgboost model may be reset, and so on.
As another practical way, on the basis of the first embodiment, optionally before prediction is performed by using the Xgboost model and the SVM model, the method further includes:
step 2021, obtaining training feature data, where the training feature data includes keyword training features, URL identification training features, and label data.
Step 2022, training the pre-established Xgboost network and SVM network based on the training feature data to obtain an Xgboost model and an SVM model.
The training characteristic data is obtained in the following mode:
1. data crawling and data annotation
(1) Search key preparation (named entity + some relevant information): such as: meeting, company name, school name + location, scholars + work units, etc.
(2) According to search keyword retrieval, crawling an HTML page: keywords are searched in a browser, advertisement URLs are removed, and HTML pages of URLs with search results listed at the top 10 (the specific number can be set according to actual needs, and is only exemplarily illustrated here) are crawled. The number corresponding to the URL is 0-9, the number corresponding to the URL with the first search ranking is 0, the number corresponding to the URL with the second search ranking is 1, and so on, each keyword corresponds to a group of URLs.
(3) Labeling the URL: manually tagging each URL: the homepage label value is 1, and the non-homepage label value is 0. In a group of 10 URLs corresponding to a keyword, the label value of one URL is 1, and the label values of the other 9 URLs are 0.
2. Feature extraction
Traversing a set of URLs, for each URL performing the following steps:
(1) according to the number of the URL, a 10-dimensional feature is created for each URL, for example, the 10-dimensional feature corresponding to the number 0 is "1000100001", which is only an exemplary illustration here, and the specific feature may be set according to actual requirements.
(2) And calculating the number of words in the search keyword as N. And calculating the number n1 of times of each word in the search keyword appearing in the result URL and a first frequency, the number n2 of times of each word appearing in the title tag in the HTML text corresponding to the result URL and a second frequency, and the number n3 of times of each word appearing in the non-title tag in the HTML text corresponding to the result URL and a third frequency, and calculating the tfidf value of each word and adding the values.
(3) Check if there are groups of URLs that have not been traversed. And if the URL exists, performing the steps (1) - (3) on the un-traversed URL again.
3. Model training
And training the pre-established Xgboost network and SVM network based on the obtained keyword features and URL identification features to obtain an Xgboost model and an SVM model, and storing the Xgboost model and the SVM model.
It should be noted that the respective implementable modes in the present embodiment may be implemented individually, or may be implemented in combination in any combination without conflict, and the present application is not limited thereto.
Prediction example:
1. conference home page identification
The home page of "nlpc 2018" is predicted. A search keyword "nlpc 2018" is input into the search engine, and as shown in fig. 4, a schematic diagram of a conference home page identification search result list provided in this embodiment is shown. Namely, a plurality of result URL display lists acquired by the search server, the method provided by the embodiment of the present application is used to predict the homepage of the "nlpc 2018", that is, the webpage that the user most wants to access. The homepage of nlpc 2018 is predicted to be: http:// tcci. ccf. org. cn/reference/2018/, it was confirmed that the homepage address of nlpc 2018 is: http:// tcci. ccf. org. cn/conference/2018/. The prediction is correct.
2. School homepage identification
The homepage of the university of beijing is predicted, and the search keyword "beijing university" is shown in fig. 5, which is a schematic diagram of a list of school homepages recognition search results provided in this embodiment. The homepage of "Beijing university" is predicted based on the search result list. The prediction result is that the homepage of the Beijing university is as follows: http:// english. pku. edu. cn/. After confirmation, the homepage website of Beijing university is: http:// english. pku. edu. cn/, predicted correctly.
3. Learner's homepage identification
As shown in fig. 6, a schematic diagram of a learner homepage identification search result list is provided in the present embodiment. The homepage of the scholars "Christina c. christara" is predicted based on the search result list. The prediction result is as follows: christina c. christara homepage is: http:// www.cs.toronto.edu/. about ccc/. Christina c. christara homepage address was confirmed as: http:// www.cs.toronto.edu/. about ccc/. The prediction is correct.
According to the method for identifying and processing the homepage, the target URL corresponding to the user search request is determined by adopting the pre-trained Xgboost model and SVM model based on the characteristic data to be predicted, and the user search request is responded based on the target URL, so that the most useful webpage information can be provided for the user, the accuracy of the user search result is improved, and the user experience is improved.
EXAMPLE III
The present embodiment provides a device for identifying and processing a homepage, which is used for executing the method of the first embodiment.
As shown in fig. 7, a schematic structural diagram of a home page identification processing device provided in the present embodiment is shown. The home page identification processing means 30 includes an acquisition module 31, a determination module 32, and a processing module 33.
The acquiring module 31 is configured to acquire feature data to be predicted corresponding to a user search request, where the feature data to be predicted includes a keyword feature and a URL identification feature; the determining module 32 is configured to determine, based on the feature data to be predicted, a target URL corresponding to the user search request by using a pre-trained Xgboost model and an SVM model; the processing module 33 is used to respond to user search requests based on the target URL.
The specific manner in which the respective modules perform operations has been described in detail in relation to the apparatus in this embodiment, and will not be elaborated upon here.
According to the device for identifying and processing the homepage provided by the embodiment, the target URL corresponding to the user search request is determined by adopting the pre-trained Xgboost model and SVM model based on the characteristic data to be predicted, and the user search request is responded based on the target URL, so that the most useful webpage information can be provided for the user, the accuracy of the user search result is improved, and the user experience is improved.
Example four
The present embodiment further supplements the description of the apparatus provided in the third embodiment.
As an implementable manner, on the basis of the third embodiment, optionally, the obtaining module is specifically configured to:
acquiring search keywords corresponding to a user search request, and acquiring identifiers of at least two result URLs and HTML texts corresponding to the identifiers according to the search keywords;
and acquiring the characteristics of the keywords according to the search keywords and the HTML texts corresponding to the marks, and acquiring the characteristics of the URL marks according to the marks.
Optionally, the obtaining module is specifically configured to:
acquiring the number N of words included in the search keyword;
aiming at each result URL, calculating the frequency n1 and the first frequency of each word in the search keyword appearing in the result URL, the frequency n2 and the second frequency of each word appearing in a title tag in an HTML text corresponding to the result URL, the frequency n3 and the third frequency of each word appearing in a non-title tag in the HTML text corresponding to the result URL, and calculating the tfidf value of each word;
the keyword features include n1, n2, n3, a first frequency, a second frequency, a third frequency, and tfidf values corresponding to each result URL.
As another implementable manner, on the basis of the third embodiment, optionally, the determining module is specifically configured to:
respectively inputting the characteristic data to be predicted into an Xgboost model and an SVM model for prediction;
and if the results predicted by the Xgboost model and the SVM model are the same result URL, taking the result URL as a target URL corresponding to the user search request.
As another practical way, on the basis of the third embodiment, optionally, the apparatus further includes a training module 34. As shown in fig. 8, a schematic structural diagram of a home page recognition processing device provided in the present embodiment is shown.
The acquisition module is further used for acquiring training characteristic data, wherein the training characteristic data comprises keyword training characteristics, URL identification training characteristics and labeling data; and the training module is used for training the pre-established Xgboost network and the SVM network based on the training characteristic data to obtain an Xgboost model and an SVM model.
The specific manner in which the respective modules perform operations has been described in detail in relation to the apparatus in this embodiment, and will not be elaborated upon here.
It should be noted that the respective implementable modes in the present embodiment may be implemented individually, or may be implemented in combination in any combination without conflict, and the present application is not limited thereto.
According to the device for identifying and processing the homepage, the target URL corresponding to the user search request is determined by adopting the pre-trained Xgboost model and SVM model based on the characteristic data to be predicted, and the user search request is responded based on the target URL, so that the most useful webpage information can be provided for the user, the accuracy of the user search result is improved, and the user experience is improved.
EXAMPLE five
The present embodiment provides a terminal device, configured to execute the method provided in the foregoing embodiment.
As shown in fig. 9, which is a schematic structural diagram of the terminal device provided in this embodiment. The terminal device 50 includes: at least one processor 51 and memory 52;
the memory stores a computer program; at least one processor executes the computer program stored in the memory to implement the methods provided by the above-described embodiments.
According to the terminal equipment of the embodiment, the target URL corresponding to the user search request is determined by adopting the pre-trained Xgboost model and SVM model based on the characteristic data to be predicted, and the user search request is responded based on the target URL, so that the most useful webpage information can be provided for the user, the accuracy of the user search result is improved, and the user experience is improved.
EXAMPLE six
The present embodiment provides a search server for executing the method provided by the above embodiment.
As shown in fig. 10, a schematic structural diagram of the search server provided in this embodiment is shown. The search server 60 includes: at least one processor 61 and memory 62;
the memory stores a computer program; at least one processor executes the computer program stored in the memory to implement the methods provided by the above-described embodiments.
According to the search server of the embodiment, the target URL corresponding to the user search request is determined by adopting the pre-trained Xgboost model and SVM model based on the characteristic data to be predicted, and the user search request is responded based on the target URL, so that the most useful webpage information can be provided for the user, the accuracy of the user search result is improved, and the user experience is improved.
EXAMPLE seven
The present embodiment provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed, the method provided by any one of the above embodiments is implemented.
According to the computer-readable storage medium of the embodiment, the target URL corresponding to the user search request is determined by adopting the pre-trained Xgboost model and SVM model based on the characteristic data to be predicted, and the user search request is responded based on the target URL, so that the most useful webpage information can be provided for the user, the accuracy of the user search result is improved, and the user experience is improved.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (13)

1. A method for identifying a homepage, comprising:
acquiring feature data to be predicted corresponding to a user search request, wherein the feature data to be predicted comprises keyword features and URL identification features;
determining a target URL corresponding to the user search request by adopting a pre-trained Xgboost model and an SVM model based on the characteristic data to be predicted;
responding to the user search request based on the target URL.
2. The method according to claim 1, wherein the obtaining of the feature data to be predicted corresponding to the user search request comprises:
acquiring search keywords corresponding to a user search request, and acquiring identifiers of at least two result URLs and HTML texts corresponding to the identifiers according to the search keywords;
and acquiring the characteristics of the keywords according to the search keywords and the HTML texts corresponding to the marks, and acquiring the characteristics of the URL marks according to the marks.
3. The method of claim 2, wherein the obtaining the keyword features according to the search keywords and the HTML text corresponding to each identifier comprises:
acquiring the number N of words included in the search keyword;
for each result URL, calculating the frequency n1 and the first frequency of occurrence of each word in the search keyword in the result URL, the frequency n2 and the second frequency of occurrence of each word in a title tag in an HTML text corresponding to the result URL, the frequency n3 and the third frequency of occurrence of each word in a non-title tag in the HTML text corresponding to the result URL, and calculating the tfidf value of each word;
the keyword features include n1, n2, n3, the first frequency, the second frequency, the third frequency, and the tfidf value corresponding to each result URL.
4. The method of claim 1, wherein determining a target URL corresponding to the user search request by using a pre-trained Xgboost model and a pre-trained SVM model based on the feature data to be predicted comprises:
inputting the characteristic data to be predicted into the Xgboost model and the SVM model respectively for prediction;
and if the results predicted by the Xgboost model and the SVM model are the same result URL, taking the result URL as a target URL corresponding to the user search request.
5. The method according to any one of claims 1-4, further comprising:
acquiring training characteristic data, wherein the training characteristic data comprises keyword training characteristics, URL identification training characteristics and labeling data;
and training the pre-established Xgboost network and SVM network based on the training characteristic data to obtain the Xgboost model and the SVM model.
6. An apparatus for identifying a homepage, comprising:
the device comprises an acquisition module, a prediction module and a prediction module, wherein the acquisition module is used for acquiring feature data to be predicted corresponding to a user search request, and the feature data to be predicted comprises keyword features and URL identification features;
the determining module is used for determining a target URL corresponding to the user search request by adopting a pre-trained Xgboost model and a pre-trained SVM model based on the characteristic data to be predicted;
a processing module for responding to the user search request based on the target URL.
7. The apparatus of claim 6, wherein the obtaining module is specifically configured to:
acquiring search keywords corresponding to a user search request, and acquiring identifiers of at least two result URLs and HTML texts corresponding to the identifiers according to the search keywords;
and acquiring the characteristics of the keywords according to the search keywords and the HTML texts corresponding to the marks, and acquiring the characteristics of the URL marks according to the marks.
8. The apparatus of claim 7, wherein the obtaining module is specifically configured to:
acquiring the number N of words included in the search keyword;
for each result URL, calculating the frequency n1 and the first frequency of occurrence of each word in the search keyword in the result URL, the frequency n2 and the second frequency of occurrence of each word in a title tag in an HTML text corresponding to the result URL, the frequency n3 and the third frequency of occurrence of each word in a non-title tag in the HTML text corresponding to the result URL, and calculating the tfidf value of each word;
the keyword features include n1, n2, n3, the first frequency, the second frequency, the third frequency, and the tfidf value corresponding to each result URL.
9. The apparatus of claim 6, wherein the determining module is specifically configured to:
inputting the characteristic data to be predicted into the Xgboost model and the SVM model respectively for prediction;
and if the results predicted by the Xgboost model and the SVM model are the same result URL, taking the result URL as a target URL corresponding to the user search request.
10. The apparatus of any one of claims 6-9, further comprising a training module;
the acquisition module is further used for acquiring training characteristic data, wherein the training characteristic data comprises keyword training characteristics, URL identification training characteristics and labeling data;
and the training module is used for training the pre-established Xgboost network and the SVM network based on the training characteristic data to obtain the Xgboost model and the SVM model.
11. A terminal device, comprising: at least one processor and memory;
the memory stores a computer program; the at least one processor executes the memory-stored computer program to implement the method of any of claims 1-5.
12. A search server, comprising: at least one processor and memory;
the memory stores a computer program; the at least one processor executes the memory-stored computer program to implement the method of any of claims 1-5.
13. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when executed, implements the method of any of claims 1-5.
CN201811321529.8A 2018-11-07 2018-11-07 Method, device, equipment and storage medium for identifying and processing homepage Pending CN111159527A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811321529.8A CN111159527A (en) 2018-11-07 2018-11-07 Method, device, equipment and storage medium for identifying and processing homepage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811321529.8A CN111159527A (en) 2018-11-07 2018-11-07 Method, device, equipment and storage medium for identifying and processing homepage

Publications (1)

Publication Number Publication Date
CN111159527A true CN111159527A (en) 2020-05-15

Family

ID=70554956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811321529.8A Pending CN111159527A (en) 2018-11-07 2018-11-07 Method, device, equipment and storage medium for identifying and processing homepage

Country Status (1)

Country Link
CN (1) CN111159527A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870573A (en) * 2014-03-18 2014-06-18 北京奇虎科技有限公司 Method and device for website analysis
CN107463704A (en) * 2017-08-16 2017-12-12 北京百度网讯科技有限公司 Searching method and device based on artificial intelligence
CN107526744A (en) * 2016-06-21 2017-12-29 北京搜狗科技发展有限公司 A kind of information displaying method and device based on search
CN108182186A (en) * 2016-12-08 2018-06-19 广东精点数据科技股份有限公司 A kind of Web page sequencing method based on random forests algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870573A (en) * 2014-03-18 2014-06-18 北京奇虎科技有限公司 Method and device for website analysis
CN107526744A (en) * 2016-06-21 2017-12-29 北京搜狗科技发展有限公司 A kind of information displaying method and device based on search
CN108182186A (en) * 2016-12-08 2018-06-19 广东精点数据科技股份有限公司 A kind of Web page sequencing method based on random forests algorithm
CN107463704A (en) * 2017-08-16 2017-12-12 北京百度网讯科技有限公司 Searching method and device based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN110674429B (en) Method, apparatus, device and computer readable storage medium for information retrieval
JP6517818B2 (en) Improving Website Traffic Optimization
US20190370397A1 (en) Artificial intelligence based-document processing
Zhou et al. Query expansion with enriched user profiles for personalized search utilizing folksonomy data
US8799310B2 (en) Method and system for processing a uniform resource locator
US8700621B1 (en) Generating query suggestions from user generated content
US10585927B1 (en) Determining a set of steps responsive to a how-to query
CN107491518A (en) Method and apparatus, server, storage medium are recalled in one kind search
US9760636B1 (en) Systems and methods for browsing historical content
US9514113B1 (en) Methods for automatic footnote generation
KR100859918B1 (en) Method and apparatus for evaluating searched contents by using user feedback and providing search result by utilizing evaluation result
US20100106719A1 (en) Context-sensitive search
CN109564573A (en) Platform from computer application metadata supports cluster
CN111813905B (en) Corpus generation method, corpus generation device, computer equipment and storage medium
JP6053131B2 (en) Information processing apparatus, information processing method, and program
US9280522B2 (en) Highlighting of document elements
CN105917364A (en) Ranking of discussion threads in a question-and-answer forum
CN112883030A (en) Data collection method and device, computer equipment and storage medium
CN112740202A (en) Performing image search using content tags
CN114330329A (en) Service content searching method and device, electronic equipment and storage medium
CN112818200A (en) Data crawling and event analyzing method and system based on static website
US20170235835A1 (en) Information identification and extraction
CN104933099B (en) Method and device for providing target search result for user
Vinutha et al. Insights into search engine optimization using natural language processing and machine learning
Cao et al. Extraction of informative blocks from web pages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230705

Address after: 3007, Hengqin International Financial Center Building, No. 58 Huajin Street, Hengqin New District, Zhuhai City, Guangdong Province, 519030

Applicant after: New founder holdings development Co.,Ltd.

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 9 floor

Applicant before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Applicant before: PKU FOUNDER INFORMATION INDUSTRY GROUP CO.,LTD.

TA01 Transfer of patent application right