CN114036263A - Website identification method and device and electronic equipment - Google Patents

Website identification method and device and electronic equipment Download PDF

Info

Publication number
CN114036263A
CN114036263A CN202111362224.3A CN202111362224A CN114036263A CN 114036263 A CN114036263 A CN 114036263A CN 202111362224 A CN202111362224 A CN 202111362224A CN 114036263 A CN114036263 A CN 114036263A
Authority
CN
China
Prior art keywords
website
candidate
target type
occurrence frequency
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111362224.3A
Other languages
Chinese (zh)
Inventor
薛昌熵
杨骏伟
刘晓庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111362224.3A priority Critical patent/CN114036263A/en
Publication of CN114036263A publication Critical patent/CN114036263A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a website identification method, a website identification device and electronic equipment, and relates to the field of data processing, in particular to the technical field of data mining. The specific implementation scheme is as follows: acquiring a plurality of keywords of a target type website; aiming at each keyword, performing website retrieval by using the keyword to obtain a retrieval result of the keyword; determining the occurrence frequency of the candidate websites in each retrieval result; and if the occurrence frequency is greater than a preset first threshold value, determining the candidate website as the website of the target type. The efficiency of website identification can be improved.

Description

Website identification method and device and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to the field of data mining technologies.
Background
According to actual requirements, a user can divide the websites into a plurality of different types according to different classification standards. For example, websites are classified into various categories according to the industry they serve, such as websites serving the financial industry, websites serving the tourism industry, and so on.
Disclosure of Invention
The disclosure provides a website identification method and device and electronic equipment.
According to a first aspect of the present disclosure, there is provided a website identification method, including:
acquiring a plurality of keywords of a target type website;
aiming at each keyword, performing website retrieval by using the keyword to obtain a retrieval result of the keyword;
determining the occurrence frequency of the candidate websites in each retrieval result;
and if the occurrence frequency is greater than a preset first threshold value, determining the candidate website as the website of the target type.
According to a second aspect of the present disclosure, there is provided a website identifying apparatus including:
the keyword acquisition module is used for acquiring a plurality of keywords of the target type website;
the retrieval module is used for carrying out website retrieval on each keyword by using the keyword to obtain a retrieval result of the keyword;
the occurrence frequency counting module is used for determining the occurrence frequency of the candidate websites in each retrieval result;
and the first judgment module is used for determining the candidate website as the target type website if the occurrence frequency is greater than a preset first threshold value.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first aspects.
According to a fourth aspect provided by the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above first aspects.
According to a fifth aspect provided by the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above first aspects
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic flow chart diagram of a website identification method provided in accordance with the present disclosure;
FIG. 2 is a schematic flow chart diagram of another website identification method provided in accordance with the present disclosure;
FIG. 3 is a schematic flow chart diagram of another website identification method provided in accordance with the present disclosure;
FIG. 4 is a schematic diagram of a website identification apparatus according to the present disclosure;
fig. 5 is a block diagram of an electronic device for implementing a website identification method according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In order to more clearly illustrate the website identification method provided by the present disclosure, an exemplary description will be given below on one possible application scenario of the website identification method provided by the present disclosure, it is understood that the following example is only one possible application scenario of the website identification method provided by the present disclosure, and in other possible embodiments, the website identification method provided by the present disclosure may also be applied to other possible application scenarios, and the following example does not limit this.
In order to manage and analyze the website, the user needs to identify the category of the website, and for convenience of description, it is assumed that the user needs to identify whether the website is a website serving a target industry (hereinafter referred to as an industry website). The titles of a plurality of industry websites and keywords in website contents can be extracted in advance to serve as target features, a recognition model is obtained by utilizing target feature training, the recognition model is used for mapping the input features to recognition results, and the recognition results are used for indicating whether websites to which the input features belong are industry websites or not.
And extracting the title of the website to be identified and the keywords in the content as the features to be identified, wherein the website to be identified is not determined to be an industry website. Inputting the features to be recognized into a recognition model obtained by pre-training to obtain a recognition result output by the recognition model, and determining whether the website to be recognized is an industry website or not according to the recognition result.
However, in this scheme, each time a website to be identified is identified, the title of the website to be identified and the keywords in the content need to be extracted to obtain the features to be identified. And the features to be recognized need to be mapped by using the recognition model, so that the occupied computing resources are more, and the recognition efficiency is lower. In the case of limited computing resources, it is difficult to identify a large number of websites, and therefore, the method can be applied only to a scene where a small number of websites are identified.
Based on this, the present disclosure provides a website identification method, which may be applied to any electronic device with website identification capability, including but not limited to a personal computer, a server, and the like, and the website identification method may be as shown in fig. 1, including:
s101, acquiring a plurality of keywords of the target type website.
And S102, performing website retrieval on each keyword by using the keywords to obtain a retrieval result of the keywords.
S103, determining the occurrence frequency of the candidate websites in each retrieval result.
And S104, if the occurrence frequency is greater than a preset first threshold value, determining the candidate website as a target type website.
In this embodiment, since the keyword used for the search is the keyword of the website of the target type, it may be considered that a website appearing in the search result searched using the keyword has a certain similarity with the website of the target type, and therefore the greater the frequency of appearance of the candidate website appearing in each search result is, the higher the similarity between the candidate website and the website of the target type is, and when the frequency of appearance of the candidate website is greater than the preset first threshold, the sufficiently high similarity between the candidate website and the website of the target type is considered, and at this time, the candidate website may be determined to be the website of the target type.
In addition, in the identification process, the website characteristics of each candidate website do not need to be extracted, and whether the candidate website is the target type website or not is judged in a keyword search and search result statistics mode. And the keyword retrieval and retrieval result statistics can be used for carrying out batch processing on a large number of candidate websites, so that the website identification method provided by the disclosure can be used for carrying out batch identification on the large number of candidate websites, and the website identification efficiency is effectively improved. The website identification method provided by the disclosure can improve the efficiency of website identification on the basis of accurately identifying the website.
For example, taking a scene in which m websites are identified as an example, if the method in the foregoing example is used, it is necessary to extract features of the m websites, input the features of the m websites into the identification models respectively, perform mapping for m times by using the identification models to obtain an identification result, and determine websites belonging to the target category in the m websites according to the identification result.
And if the number of the keywords is assumed to be n, the website identification method provided by the disclosure is selected, only n times of retrieval are needed, the retrieval results obtained by the n times of retrieval are counted, and the websites belonging to the target category in the m websites can be determined according to the occurrence frequency of each website obtained by counting.
Assume that a calculation amount required for extracting a feature of one website and performing one-time mapping using a recognition model is a, and a calculation amount required for performing a search with one keyword is B. Then according to the method of the example, the amount of computation that needs to be consumed is approximately m a, whereas according to the website identification method provided by the present disclosure, the amount of computation that needs to be consumed is approximately n B.
It can be understood that when the number of m is large, n is often much smaller than m, for example, for practical needs, a user may need to identify tens of thousands or hundreds of thousands of websites, and keywords used for searching are often much smaller than ten thousand. Moreover, the calculation amount mapped by the model is often larger than or similar to the calculation amount required by the search, so that m × a is obviously larger than n × B, and the difference between the calculation amounts consumed by the two is further increased as m is increased, namely, as the number of websites to be identified is increased.
The foregoing S101 to S104 will be described below, respectively:
in S101, the target type website may be any type of website, such as an industry website serving a specific industry, for example, a functional website capable of providing a specific function, and the like. The plurality of keywords may be input by the user based on experience, or extracted from a target type of website.
For example, in one possible embodiment, the user considers that the keywords a, B and C often appear in the target category of the website according to actual experience, and then the keywords a-C can be input through the operation instruction, so that the execution subject obtains the keywords a-C. In another possible embodiment, the execution subject or other devices except the execution subject extracts words existing in a plurality of target types of websites, counts the occurrence frequency of each extracted word, and selects a plurality of words from each word as keywords according to the order of the occurrence frequency from high to low.
In S102, each keyword is input to a search engine to perform website search, and a set of searched websites is used as a search result of the keyword. For example, assuming that a keyword a is used to perform website search, a total of 100 websites are searched and respectively marked as websites 1 to 100, a set of websites 1 to 100 is used as a search result of the keyword.
It can be understood that, when a keyword is used for website search, each website in the obtained search result theoretically has the keyword or is associated with the keyword. The keywords are the target type of websites, so the keywords can reflect the characteristics of the target type of websites to a certain extent. Therefore, in a certain feature dimension, each website in the search result is similar to a target type website.
In S103, the presence of the website in the search result means that the website is included in the search result. The frequency of occurrence refers to a ratio of the number of occurrences of the candidate website in each search result to the number of search results, and for example, assuming that there are 100 search results in total and the candidate website occurs in 80 search results, the frequency of occurrence of the candidate website is 80/100-0.8.
Although the frequency of occurrence in the present disclosure refers to the ratio of the number of occurrences to the number of retrieval results, the frequency of occurrence may be expressed in any form that can reflect the ratio, and for example, when the ratio of the number of occurrences to the number of retrieval results belongs to [0.7,1], the frequency of occurrence is expressed in the word "high", when the ratio of the number of occurrences to the number of retrieval results belongs to (0.3,7), "medium", and when the ratio of the number of occurrences to the number of retrieval results belongs to [0,0.3], the frequency of occurrence is expressed in the word "low".
In S104, as in the analysis in S102 described above, each website in the search result is similar to a website of the target type in a certain feature dimension, and therefore, if the frequency of occurrence of the candidate website is higher, the candidate website may be considered to be similar to a website of the target type in more feature dimensions. When the occurrence frequency is greater than the preset first threshold, the candidate website can be considered to be similar to the target type website in more feature dimensions, and therefore the candidate website can be considered to be the target type website at this time.
If the occurrence frequency is not greater than the preset first threshold, in one possible embodiment, it may be determined that the candidate website is not a target type website. In another possible embodiment, it may also be further determined whether the candidate website is a target type website. Illustratively, as shown in fig. 2, the method includes:
s201, acquiring a plurality of keywords of the target type website.
The step is the same as the step S101, and reference may be made to the related description of the step S101, which is not described herein again.
S202, aiming at each keyword, carrying out website retrieval by the keyword to obtain a retrieval result of the keyword.
The step is the same as the step S102, and reference may be made to the related description of the step S102, which is not described herein again.
S203, determining the frequency of the candidate websites appearing in each retrieval result.
The step is the same as the step S103, and reference may be made to the related description of the step S103, which is not described herein again.
And S204, if the occurrence frequency is greater than a preset first threshold value, determining the candidate website as a target type website.
The step is the same as the step S104, and reference may be made to the related description of the step S104, which is not described herein again.
S205, if the occurrence frequency is not more than a preset first threshold value, determining that the retrieval result of the seed website appears in all the retrieval results as the seed retrieval result.
Wherein, the seed website is a target type website. The number of the seed websites can be one or more, and if the number of the seed websites is one, the seed websites appearing in the search result means that one seed website appears in the search result at the same time. If the number of the seed websites is multiple, the seed websites appearing in the retrieval result means that at least x seed websites appear in the retrieval result at the same time, x is any positive integer with the value range of [1, r ], and r is the number of the seed websites.
The seed website can be set by the user according to actual needs or experience, or determined according to preset rules. For example, if the user knows that the websites a and B are the target types of websites according to the experience of the user, the websites a and B may be set as seed websites. For another example, the execution subject or another device other than the execution subject may count traffic of multiple target types of websites within a preset time window, and select at least one website from the multiple target types of websites as a seed website in an order from high traffic to low traffic.
And S206, determining the co-occurrence frequency of the candidate websites appearing in the seed retrieval result.
The co-occurrence frequency is a ratio of the number of co-occurrences of the candidate website in the seed search results to the number of the seed search results, for example, assuming that there are 50 seed search results in total and the candidate website appears in 40 seed search results, the co-occurrence frequency is 40/50 ═ 0.8.
Similarly to the aforementioned frequency of occurrence, the frequency of occurrence in the present disclosure refers to the ratio of the number of co-occurrences to the number of seed search results, but the co-occurrence frequency may be represented in any form capable of reflecting the ratio,
and S207, if the co-occurrence frequency is greater than a preset second threshold value, determining the candidate website as a target type website.
The preset second threshold may be equal to the preset first threshold, may also be greater than the preset first threshold, and may also be smaller than the preset first threshold, which is not limited in this embodiment.
It can be understood that, limited to the manner of obtaining the keywords, some of the obtained keywords may not accurately reflect the characteristics of the target type of website, so that even if the candidate website is the target type of website, the keywords may not appear in the search results of the keywords, and thus the frequency of appearance of the candidate website is low.
The seed search result includes a seed website, and the seed website is a target type website, so that if the search result of a keyword is the seed search result, it can be considered that the keyword can relatively accurately reflect the characteristics of the target type website. Therefore, compared with the frequency of occurrence, the frequency of co-occurrence of the candidate website in the seed retrieval result can reflect the similarity of the candidate website and the target type website more accurately. Therefore, the candidate websites can be more accurately identified according to the co-occurrence frequency, the probability of identification errors is effectively reduced, and the identification accuracy is improved. Namely, the embodiment is adopted, so that the accuracy of website identification can be further improved.
For example, it is assumed that the preset first threshold and the preset second threshold are both 0.75, and there are 100 keywords in total, where 30 keywords cannot accurately reflect the features of the target type of website, and the frequency of occurrence of the candidate websites is 0.63.
If it is determined that the candidate website is not the website of the target type when the frequency of occurrence is less than the preset first threshold, in this example, it is determined that the candidate website is not the website of the target type because the frequency of occurrence is less than the preset first threshold.
It can be understood that, since the 30 keywords (hereinafter referred to as invalid keywords) cannot accurately reflect the features of the target type website, even if the candidate website is the target type website, the candidate website may not appear in the search result of the 30 invalid keywords with a high probability, and as a result, even if the candidate website is the target type website, the frequency of appearance is smaller than the preset first threshold, and thus, it is determined that the candidate website is not the target type website, only by virtue of the frequency of appearance being smaller than the preset first threshold, that is, not accurate enough.
In this embodiment, it is assumed that the search results of the remaining 70 keywords (hereinafter, the 70 keywords are referred to as valid keywords) except 30 invalid keywords are the seed search results. If the co-occurrence frequency of the candidate website is 0.8, the candidate website is determined to be the target type website because the co-occurrence frequency is greater than a preset second threshold.
It is understood that the candidate website appears 56 times in the 70 seed search results and 7 times in the search results of 30 invalid keywords at this time. Since the invalid keywords cannot accurately reflect the features of the target type websites, the search result of the invalid keywords cannot reflect the similarity between the candidate websites and the target type websites, and thus it can be considered that the candidate websites are similar to the target type websites in 56 dimensions out of 70 feature dimensions at this time, and it is visible that the candidate websites are similar to the target type websites in most feature dimensions at this time, and thus it is relatively accurate to determine that the candidate websites are the target type websites at this time.
In one possible embodiment, if the co-occurrence frequency is not greater than the preset second threshold, the candidate website is determined not to be a target type website.
It will be appreciated that although there are theoretically some common features for all target types of websites (hereinafter these common features are referred to as commonalities), there are often some different features for each different target type of website (hereinafter these common features are referred to as personalities). Therefore, even though the keyword may accurately reflect the characteristics of the target type of websites, the keyword may only reflect the personality of the target type of websites, but may not accurately reflect the commonalities of the target type of websites.
Therefore, even if the search result of a keyword is a seed result, the keyword may only reflect the personality of the seed website, so that even if the candidate website is a target type website, the candidate website does not appear in the search result of the keyword due to the fact that the candidate website and the seed website have different personalities, and the co-occurrence frequency of the candidate website is low.
It can be seen that even if the candidate website is a target type website, the co-occurrence frequency of the candidate website may be lower than the preset second threshold. Therefore, when the co-occurrence frequency is not greater than the preset second threshold, it is not accurate enough to determine that the candidate website is not the target type website.
Based on this, the present disclosure provides a website identification method, which may be as shown in fig. 3, including:
s301, a plurality of keywords of the target type websites are obtained.
The step is the same as the step S101, and reference may be made to the related description of the step S101, which is not described herein again.
S302, aiming at each keyword, performing website retrieval by using the keyword to obtain a retrieval result of the keyword.
The step is the same as the step S102, and reference may be made to the related description of the step S102, which is not described herein again.
S303, determining the occurrence frequency of the candidate websites in each retrieval result.
The step is the same as the step S103, and reference may be made to the related description of the step S103, which is not described herein again.
S304, if the occurrence frequency is larger than a preset first threshold value, determining the candidate website as a target type website.
The step is the same as the step S104, and reference may be made to the related description of the step S104, which is not described herein again.
S305, if the occurrence frequency is not more than a preset first threshold value, determining that the retrieval result of the seed website appears in all the retrieval results as a seed retrieval result.
The step is the same as the step S205, and reference may be made to the related description of the step S205, which is not described herein again.
S306, determining the co-occurrence frequency of the candidate websites appearing in the seed retrieval result.
The step is the same as the step S206, and reference may be made to the related description of the step S206, which is not described herein again.
S307, if the co-occurrence frequency is larger than a preset second threshold value, determining the candidate website as a target type website.
The step is the same as S207, and reference may be made to the related description of S207, which is not described herein again.
And S308, if the co-occurrence frequency is less than a preset second threshold, extracting the website characteristics of the candidate website.
The website features include, but are not limited to, any one or more of the following: titles, keywords in the content, referenced links, pictures, tables, styles, feature vectors extracted by any feature extraction algorithm, etc.
S309, determining the confidence degree of the candidate website as the website of the target type according to the matching degree between the website characteristics of the candidate website and the website characteristics of the website of the target type.
The way of calculating the confidence level may be different according to different application scenarios, but it should be ensured that the confidence level is positively correlated with the matching degree. The positive correlation between the confidence degree and the matching degree means that the greater the matching degree, the greater the confidence degree under the condition that other factors influencing the confidence degree are unchanged.
For example, in one possible embodiment, the confidence is the degree of matching k, where k is a preset coefficient and k is greater than 0. In another possible embodiment, the confidence may be determined according to the matching degree, the occurrence frequency and the co-occurrence frequency, and the matching degree is positively correlated with the occurrence frequency and the co-occurrence frequency, for example, the confidence is a1Degree of matching + a2Frequency of occurrence + a3Co-occurrence frequency of, wherein, a1、a2、a3Is a predetermined coefficient, and a1、a2、a3Are all greater than 0.
By selecting the embodiment, the confidence coefficient can be determined by combining the occurrence frequency and the co-occurrence frequency, and as the analysis is carried out, the similarity between the candidate website and the target type website can be reflected to a certain extent by the occurrence frequency and the co-occurrence frequency, so that the higher the occurrence frequency and the co-occurrence frequency is, the higher the confidence coefficient that the candidate website is the target type website is. Therefore, the confidence coefficient is determined by combining the occurrence frequency and the co-occurrence frequency, so that the accuracy of the determined confidence coefficient can be improved, and the candidate websites can be classified more accurately.
S310, if the confidence coefficient is larger than a preset third threshold value, determining the candidate website as a website of the target type.
It is understood that the confidence level reflects the probability that the candidate website is the target type website, and when the confidence level is greater than the preset third threshold value, the probability that the candidate website is the target type website may be considered to be high enough, and the candidate website may be considered to be the target type website.
By adopting the embodiment, under the condition that the occurrence frequency and the co-occurrence frequency can not accurately judge whether the candidate website is the website of the target type, the characteristics of the candidate website are further extracted, and whether the candidate website is the website of the target type is determined in a characteristic matching mode, so that the condition that the candidate website is the website of the target type is further reduced, the candidate website is mistakenly determined as the website of the non-target type, and the website identification accuracy is further improved. Meanwhile, since the embodiment only needs to extract the features and perform the feature matching on the candidate websites with the occurrence frequency lower than the preset first threshold and the co-occurrence frequency lower than the preset second threshold, the amount of calculation required to be consumed is still low compared with the method in the foregoing example application scenario.
For the case where the confidence level is not greater than the preset third threshold, in one possible embodiment, it may be determined that the candidate website is not a target type of website. In another possible embodiment, other means may be further adopted to identify the candidate website, which is not limited in this disclosure.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a website identification apparatus provided in the present disclosure, which may include:
a keyword obtaining module 401, configured to obtain a plurality of keywords of a target type website;
a retrieval module 402, configured to perform, for each keyword, website retrieval using the keyword to obtain a retrieval result of the keyword;
an occurrence frequency statistics module 403, configured to determine an occurrence frequency of the candidate websites appearing in each search result;
a first determining module 404, configured to determine the candidate website as the target type website if the frequency of occurrence is greater than a preset first threshold.
In a possible embodiment, further comprising:
the screening module is used for determining a retrieval result of a seed website in all the retrieval results as a seed retrieval result if the occurrence frequency is less than the preset first threshold, wherein the seed website is a preset target type website;
the co-occurrence frequency counting module is used for determining the co-occurrence frequency of the candidate websites in each seed retrieval result;
and the second judgment module is used for determining the candidate website as the target type website if the co-occurrence frequency is greater than a preset second threshold.
In a possible embodiment, further comprising:
the feature extraction module is used for extracting website features of the candidate websites if the co-occurrence frequency is smaller than the preset second threshold;
the feature matching module is used for determining the confidence degree of the candidate website as the website of the target type according to the matching degree between the website features of the candidate website and the website features of the website of the target type, and the confidence degree is positively correlated with the matching degree;
and the third judging module is used for determining the candidate website as the target type website if the confidence coefficient is greater than a preset third threshold value.
In a possible embodiment, the feature matching module is specifically configured to determine, according to the matching degree, the occurrence frequency, and the co-occurrence frequency between the website features of the candidate website and the website features of the target type of website, a confidence that the candidate website is the target type of website, where the confidence is positively correlated with the occurrence frequency, and the confidence is positively correlated with the co-occurrence frequency.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as the website identification method. For example, in some embodiments, the website identification method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the website identification method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the website identification method in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (11)

1. A website identification method, comprising:
acquiring a plurality of keywords of a target type website;
aiming at each keyword, performing website retrieval by using the keyword to obtain a retrieval result of the keyword;
determining the occurrence frequency of the candidate websites in each retrieval result;
and if the occurrence frequency is greater than a preset first threshold value, determining the candidate website as the website of the target type.
2. The method of claim 1, further comprising:
if the occurrence frequency is less than the preset first threshold value, determining the retrieval result of the seed website in all the retrieval results as the seed retrieval result, wherein the seed website is the preset target type website;
determining the co-occurrence frequency of the candidate websites in each seed retrieval result;
and if the co-occurrence frequency is greater than a preset second threshold value, determining the candidate website as the target type website.
3. The method of claim 2, further comprising:
if the co-occurrence frequency is smaller than the preset second threshold, extracting the website characteristics of the candidate website;
determining the confidence degree of the candidate website as the website of the target type according to the matching degree between the website features of the candidate website and the website features of the website of the target type, wherein the confidence degree is positively correlated with the matching degree;
and if the confidence coefficient is larger than a preset third threshold value, determining the candidate website as the website of the target type.
4. The method of claim 3, wherein the determining the confidence level that the candidate website is the target type website according to the matching degree between the website features of the candidate website and the website features of the target type website comprises:
determining the confidence level that the candidate website is the website of the target type according to the matching degree, the occurrence frequency and the co-occurrence frequency between the website features of the candidate website and the website features of the website of the target type, wherein the confidence level is positively correlated with the occurrence frequency and the confidence level is positively correlated with the co-occurrence frequency.
5. A website identification apparatus comprising:
the keyword acquisition module is used for acquiring a plurality of keywords of the target type website;
the retrieval module is used for carrying out website retrieval on each keyword by using the keyword to obtain a retrieval result of the keyword;
the occurrence frequency counting module is used for determining the occurrence frequency of the candidate websites in each retrieval result;
and the first judgment module is used for determining the candidate website as the target type website if the occurrence frequency is greater than a preset first threshold value.
6. The apparatus of claim 5, further comprising:
the screening module is used for determining a retrieval result of a seed website in all the retrieval results as a seed retrieval result if the occurrence frequency is less than the preset first threshold, wherein the seed website is a preset target type website;
the co-occurrence frequency counting module is used for determining the co-occurrence frequency of the candidate websites in each seed retrieval result;
and the second judgment module is used for determining the candidate website as the target type website if the co-occurrence frequency is greater than a preset second threshold.
7. The apparatus of claim 6, further comprising:
the feature extraction module is used for extracting website features of the candidate websites if the co-occurrence frequency is smaller than the preset second threshold;
the feature matching module is used for determining the confidence degree of the candidate website as the website of the target type according to the matching degree between the website features of the candidate website and the website features of the website of the target type, and the confidence degree is positively correlated with the matching degree;
and the third judging module is used for determining the candidate website as the target type website if the confidence coefficient is greater than a preset third threshold value.
8. The apparatus according to claim 7, wherein the feature matching module is specifically configured to determine a confidence level that the candidate website is the target type website according to a matching degree between the website features of the candidate website and the website features of the target type website, the frequency of occurrence, and the frequency of co-occurrence, the confidence level being positively correlated with the frequency of occurrence, and the confidence level being positively correlated with the frequency of co-occurrence.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-4.
11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-4.
CN202111362224.3A 2021-11-17 2021-11-17 Website identification method and device and electronic equipment Pending CN114036263A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111362224.3A CN114036263A (en) 2021-11-17 2021-11-17 Website identification method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111362224.3A CN114036263A (en) 2021-11-17 2021-11-17 Website identification method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN114036263A true CN114036263A (en) 2022-02-11

Family

ID=80137951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111362224.3A Pending CN114036263A (en) 2021-11-17 2021-11-17 Website identification method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114036263A (en)

Similar Documents

Publication Publication Date Title
CN110020422B (en) Feature word determining method and device and server
CN113590645B (en) Searching method, searching device, electronic equipment and storage medium
CN114549874A (en) Training method of multi-target image-text matching model, image-text retrieval method and device
CN112989235B (en) Knowledge base-based inner link construction method, device, equipment and storage medium
CN112380847A (en) Interest point processing method and device, electronic equipment and storage medium
CN113806660A (en) Data evaluation method, training method, device, electronic device and storage medium
CN113963197A (en) Image recognition method and device, electronic equipment and readable storage medium
CN112699237B (en) Label determination method, device and storage medium
CN112989170A (en) Keyword matching method applied to information search, information search method and device
CN114691918B (en) Radar image retrieval method and device based on artificial intelligence and electronic equipment
CN114491232B (en) Information query method and device, electronic equipment and storage medium
CN114647739B (en) Entity chain finger method, device, electronic equipment and storage medium
CN114329210A (en) Information recommendation method and device and electronic equipment
CN114048315A (en) Method and device for determining document tag, electronic equipment and storage medium
CN114036263A (en) Website identification method and device and electronic equipment
CN113239273A (en) Method, device, equipment and storage medium for generating text
CN114048376A (en) Advertisement service information mining method and device, electronic equipment and storage medium
CN113807391A (en) Task model training method and device, electronic equipment and storage medium
CN113807091A (en) Word mining method and device, electronic equipment and readable storage medium
CN112528644A (en) Entity mounting method, device, equipment and storage medium
CN112784600A (en) Information sorting method and device, electronic equipment and storage medium
CN112989190A (en) Commodity mounting method and device, electronic equipment and storage medium
CN114547448B (en) Data processing method, model training method, device, equipment, storage medium and program
CN113656393B (en) Data processing method, device, electronic equipment and storage medium
CN114861062B (en) Information filtering method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination