CN106294535B - The recognition methods of website and device - Google Patents
The recognition methods of website and device Download PDFInfo
- Publication number
- CN106294535B CN106294535B CN201610571258.6A CN201610571258A CN106294535B CN 106294535 B CN106294535 B CN 106294535B CN 201610571258 A CN201610571258 A CN 201610571258A CN 106294535 B CN106294535 B CN 106294535B
- Authority
- CN
- China
- Prior art keywords
- website
- verified
- page
- comentropy
- history
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Abstract
The embodiment of the invention discloses a kind of recognition methods of website and devices.The described method includes: in the set time period, obtaining and updating the page with associated at least two history in website to be verified;The page is updated to each history and carries out Context resolution, obtains at least one content domain corresponding with each history refresh page face;The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain;According to comentropy calculated result, anomalous identification is carried out to the website to be verified.The discrimination for the Information Entropy Features that technical solution of the present invention uses is good, calculate simple, timeliness height, it can solve the technical issues of artificial mark or data preparation that discrimination brought by existing cheating website identification technology is not high, real-time is poor and needs introducing is additional work, existing website identification technology is optimized, the recognition accuracy of abnormal website is improved.
Description
Technical field
The present embodiments relate to the recognition methods of computer processing technology more particularly to a kind of website and devices.
Background technique
The information for including in document needed for information retrieval refers to the document needed for searching in the set of information resources or searches
The process of content.Search engine is exactly the information retrieval tool for searching internet information.The appearance of search engine allows people
Information is obtained from vast resources becomes convenient.After search engine occurs, consequent is webpage cheating problem.For economy
Interests or other interests, cheating website mislead search engine by various methods, are tied with improving its page in search engine sequence
Position sequence in fruit.Since cheating Website quality is not generally high, the usually advertisement of, gambling etc. especially pornographic comprising advertisement,
User experience can be seriously affected, therefore the major issue that website identification of practising fraud belongs in information retrieval.The identification of cheating website
The promotion of technology is of great significance to the effect for promoting search engine.
Currently, the cheat method variation of cheating website is frequently, but it can generally be summarized as content cheating and link to make
Disadvantage two major classes.Content cheating is generally by piling up the mode of focus inquiry (also referred to as Query) in the page to improve the page
Sequence in search-engine results;Link cheating is primarily directed to the page scoring algorithm of calculating PageRank (also referred to as
It is the nomography of prototype for PageRank), by building linking relationship to improve weight of website, link cheating further includes passing through page
The cheating mode that face redirects.Cheating website identification technology is always one of industry research hotspot, including naive Bayesian,
Logistic Regression (also referred to as logistic regression), it SVM (Support Vector Machine, support vector machines), integrates
A variety of machine learning methods such as study, deep learning have application, and the feature used includes content characteristic, chain feature etc..?
Have and is identified using external informations such as user's click behaviors.
The major defect of existing cheating website identification technology is: the not significant, content of text for page structure feature
On do not carry out the cheating page piled up of cheating word, it is difficult to identification in time.The graph model algorithm for relying on link relationship characteristic is complicated,
It is difficult to meet the needs of identifying in real time;Emerging general Websites and compare minority website, how with emerging cheating net
It stands and mutually distinguishes and one of difficulty;In addition, it is exactly network upgrade speed of practising fraud that cheating website identification mission, which faces a major challenge,
Fastly, existing cheating identifying schemes or identification model effect gradually fail over time.Enhancing study and Active Learning energy
It is enough partially to solve the problems, such as this, but need to introduce additional artificial mark or data preparation work.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of recognition methods of website and device, to optimize existing website
Identification technology improves the recognition accuracy of abnormal website.
In a first aspect, the embodiment of the invention provides a kind of recognition methods of website, comprising:
In the set time period, it obtains and updates the page with associated at least two history in website to be verified;
The page is updated to each history and carries out Context resolution, obtains corresponding with each history refresh page face at least one
A content domain;
The content change in the page in identical content domain is updated according to each history, calculates the information of each content domain
Entropy;
According to comentropy calculated result, anomalous identification is carried out to the website to be verified.
In second aspect, the embodiment of the invention also provides a kind of identification devices of website, comprising:
History updates page acquisition module, in the set time period, obtaining and website to be verified associated at least two
A history updates the page;
Content domain obtains module, carries out Context resolution for updating the page to each history, obtains and each history
At least one corresponding content domain of refresh page face;
Content domain comentropy computing module becomes for updating the content in the page in identical content domain according to each history
Change, calculates the comentropy of each content domain;
Anomalous identification module, for carrying out anomalous identification to the website to be verified according to comentropy calculated result.
The embodiment of the present invention in the set time period, obtains and the associated at least two history refresh page in website to be verified
Face;The page is updated to each history and carries out Context resolution, is obtained corresponding with each history refresh page face at least one
Hold domain;The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain;
According to comentropy calculated result, anomalous identification is carried out to the website to be verified, since the discrimination of Information Entropy Features is good, calculates
Simply, timeliness is high, can solve that discrimination brought by the identification technology of existing cheating website is not high, real-time is poor and needs
The technical issues of additional artificial mark or data preparation work is introduced, existing website identification technology is optimized, improves
The recognition accuracy of abnormal website.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the recognition methods for website that the embodiment of the present invention one provides;
Fig. 2 is a kind of flow chart of the recognition methods of website provided by Embodiment 2 of the present invention;
Fig. 3 is a kind of flow chart of the recognition methods for website that the embodiment of the present invention three provides;
Fig. 4 is a kind of structure chart of the identification device for website that the embodiment of the present invention four provides.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing to of the invention specific real
Example is applied to be described in further detail.It is understood that specific embodiment described herein is used only for explaining the present invention,
Rather than limitation of the invention.
It also should be noted that only the parts related to the present invention are shown for ease of description, in attached drawing rather than
Full content.It should be mentioned that some exemplary embodiments are described before exemplary embodiment is discussed in greater detail
At the processing or method described as flow chart.Although operations (or step) are described as the processing of sequence by flow chart,
It is that many of these operations can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of operations can be by again
It arranges.The processing can be terminated when its operations are completed, it is also possible to have the additional step being not included in attached drawing.
The processing can correspond to method, function, regulation, subroutine, subprogram etc..
In order to be easy to understand hereinafter, inventive concept of the invention is simply introduced first:
Inventors discovered through research that: from purpose, cheating website is allowed in website and embeds to obtain higher ranked
Ad content obtain more high access.Wherein, the advertisement classification of website of practising fraud generally more is concentrated, majority is gambling, it is pornographic,
Beautifying medical, gun instrument etc..The cheating of cheating website is that have mark governed.In order to allow search engine to include and obtain height
Sorting position, cheating website often updates content of pages, the inquiry of current popular high frequency is added in the page;Since cost is asked
Topic, cheating website can generally replicate same page content.In order to cope with search engine anti-cheating strategy, website of practising fraud it is interior
Appearance, pattern, network address are also required to frequent updating.
As the above analysis: cheating network upgrade is frequent, and advertising information is contained in website of practising fraud, and these advertising informations
Within some period, update and infrequently.That is, in some important positions, there are unreasonable redundancies for cheating website, and
Normal website especially high quality website does not need to make this redundancy, because can not provide valuable letter like that more
Breath.
The concept of entropy is introduced information theory by founder's Shannon of information theory, as the measurement to information content size.Information content
Size it is related to the size of uncertainty, entropy is higher, uncertain higher, other information required for describing clearly
It measures bigger.
Namely: from the point of view of information theory, if normal website updates frequently, illustrate containing much information for its offer, entropy
Value can be larger;If updated, the information content for infrequently illustrating that website provides is small, then entropy is smaller.Cheating website often updates, in advance
Phase, its entropy was larger, but certain content domains or certain objects, because containing advertising information, these advertising information renewal speed are slow,
Its entropy is caused to become smaller, i.e., the practical entropy of certain content domains directly has differences with expected entropy.By calculating website of practising fraud
The entropy and its difference degree in different content domain can help effectively to identify cheating website.
By above-mentioned analysis, this concept of comentropy is introduced the identification of abnormal website by the proposition of inventor's creativeness
Cheng Zhong carries out anomalous identification to the website by calculating the comentropy of one or more content domain in a website.
Embodiment one
Fig. 1 is a kind of flow chart of the recognition methods for website that the embodiment of the present invention one provides, and the method for the present embodiment can
It is executed with the identification device by website, which can be realized by way of hardware and/or software, and can generally be integrated in use
In the server for realizing abnormal website identification function.The method of the present embodiment specifically includes:
110, it in the set time period, obtains and updates the page with associated at least two history in website to be verified.
In the present embodiment, the website to be verified specifically refers to the website for needing to carry out anomalous identification.Wherein it is possible to will
The whole websites included in search engine carry out anomalous identification as website to be verified, still, it is contemplated that abnormal website is (typical
, website of practising fraud) in order to obtain the higher ranking results of position sequence in a search engine, content of pages can be often updated, therefore can
There is the new generation page using selection or have the website for updating the page as website to be verified, this also contributes to reducing calculation amount.
As previously mentioned, core of the invention be by analysis one website to be verified in each content domain comentropy come
Anomalous identification is carried out to the website, and comentropy is mainly to measure the uncertainty degree of the content occurred in content domain, therefore need
(for example, 1 hour, 1 day or 1 week etc.) is obtained in set period of time, more with associated at least two history in website to be verified
New page updates the content updated in the page by analyzing the history, determines the letter of each content domain in the website to be verified
Cease entropy.
Wherein, at least two history associated with website to be verified update the page may include: with it is described to be verified
Corresponding at least two history of the website domain name of website updates the page;And/or with the same webpage in the website to be verified
Corresponding at least two history in location updates the page.
In a specific example, the website domain name of a website to be verified is www.A.com, available setting time
Whole history corresponding with the website domain name update the page as with the associated history in website to be verified and update the page in section;
Further, it is contemplated that can simultaneously include the subpage frame of multiple and different types in a website (for example, a news website
In simultaneously include subpage frames such as " current events ", " amusement " and " sport "), in order to carry out more fine-grained analysis, can also obtain
Take in the website to be verified with same web page address (such as: www.A.com/B) corresponding whole history updates the page, as
The page is updated with the associated history in website to be verified.
120, the page is updated to each history and carries out Context resolution, obtained corresponding extremely with each history refresh page face
A few content domain.
It in general, include different types of data content in a page, in the present embodiment, by above-mentioned inhomogeneity
The data content of type is defined as domain.Such as: text header, text body, picture header, picture and picture accurate description text
This etc..By page parsing, namely to the HTML (HyperText Markup Language, hypertext markup language) of the page
File is analyzed, the page can be divided into different domains and extract the text for including in these domains, picture by a page etc.
Content.
In view of the computation complexity of follow-up entropy, in the present embodiment, the content chosen when calculating comentropy
Domain may include at least one of following: text header domain, picture domain, picture header domain, picture describe textview field.
Wherein, the text header domain specifically refers to the page location where one or more text header, the figure
Piece domain specifically refers to the page location where one or more picture, and the picture header domain specifically refers to corresponding with picture
Page location where one or more picture header, the picture describe textview field and specifically refer to one corresponding with picture
Or multiple pictures precisely describe the page location where text.
130, the content change in the page in identical content domain is updated according to each history, calculates each content domain
Comentropy.
By the related notion of comentropy it is found that the content change in a content domain is more frequent, content in the content domain
Uncertainty it is bigger, then the comentropy of the content domain is also bigger;Conversely, the content in a content domain is more fixed, this is interior
The uncertainty for holding the content in domain is smaller, then the comentropy of the content domain is also just smaller.
Wherein, the calculation formula of comentropy specifically:
Wherein, x has n kind value: x1…xi…xn, corresponding probability are as follows: P (x1)…P(xi)…P(xn)。
Typically, the frequency of occurrence in the page can be updated in each history according to different content in content domain, calculated in each
Hold the comentropy in domain.
140, according to comentropy calculated result, anomalous identification is carried out to the website to be verified.
It, can be by the comentropy meter of content domain each in website to be verified in a preferred embodiment of the present embodiment
It calculates result to be compared with the comentropy of each content domain of a reliable website, and then abnormal knowledge is carried out to the website to be verified
Not;
It, can also be by the letter in different content domain in website to be verified in another preferred embodiment of the present embodiment
Breath entropy is compared, and then carries out anomalous identification to the website to be verified;
It, can also be using the comentropy calculated result as at least in another preferred embodiment of the present embodiment
The Information Entropy Features value and other abnormal website identification feature values are combined by one Information Entropy Features value, to it is described to
It verifies website and carries out anomalous identification.
In general, the prior art mainly carries out anomalous identification to a website to be verified using classifier, by
One or more abnormal website identification feature value (typical, content characteristic, chain and connect feature etc.) is added in the classifier
Complete the identification to abnormal website.In the present embodiment, in addition to that directly can identify it to abnormal website is carried out by use information entropy
It outside, can also be on the basis of existing abnormal website identification technology, by the comentropy meter of each content domain in website to be verified
Result is calculated as one or more Information Entropy Features value, by the Information Entropy Features value and other abnormal website identification feature values
It is input in classifier together, after combining with existing abnormal website identification technology, exception knowledge is carried out to the website to be verified
Not, to further increase the recognition accuracy of abnormal website.
The embodiment of the present invention in the set time period, obtains and the associated at least two history refresh page in website to be verified
Face;The page is updated to each history and carries out Context resolution, is obtained corresponding with each history refresh page face at least one
Hold domain;The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain;
According to comentropy calculated result, anomalous identification is carried out to the website to be verified, since the discrimination of Information Entropy Features is good, is calculated
Simply, timeliness is high, can solve that discrimination brought by the identification technology of existing cheating website is not high, real-time is poor and needs
The technical issues of additional artificial mark or data preparation work is introduced, existing website identification technology is optimized, improves
The recognition accuracy of abnormal website.
Embodiment two
Fig. 2 is a kind of flow chart of the recognition methods of website provided by Embodiment 2 of the present invention.The present embodiment is with above-mentioned reality
It applies and optimizes based on example, in the present embodiment, will obtain and website to be verified associated at least two in the set time period
The optimization of a history refresh page mask body are as follows: in the set time period, grabbed by web crawlers it is newly generated in network, and/or
There is the page of update;It, will be described in corresponding with clustering cluster website be used as after the page of crawl is clustered according to website domain name
Website to be verified;According to the page for including in the clustering cluster, obtain and associated at least two history in website to be verified
Update the page;
Meanwhile the content change in the page in identical content domain will be updated according to each history, calculate each content
The comentropy in domain specifically optimizes are as follows: respectively in the same target content domain that each history updates the page, extracts at least one
Compare object;According to frequency of occurrence of the comparison object in the object content domain that each history updates the page, institute is calculated
State the probability of occurrence for comparing object;According to the probability of occurrence for comparing object, letter corresponding with the object content domain is calculated
Cease entropy.
Correspondingly, the method for the present embodiment specifically includes:
210, it in the set time period, is grabbed by web crawlers newly generated in network, and/or has the page of update.
In the present embodiment, it is contemplated that abnormal website, especially cheating website are generally the ratio updated more frequently website.
Therefore, it can obtain first and be grabbed by web crawlers newly generated in network, and have the page of update, by these pages
Face is merged to cluster according to website, can determine corresponding website to be verified in turn.
220, after the page of crawl being clustered according to website domain name, will corresponding with clustering cluster website as it is described to
Verify website.
230, it according to the page for including in the clustering cluster, obtains and associated at least two history in website to be verified
Update the page.
Wherein, if it is described with the associated at least two history refresh page mask body in website to be verified be with it is described to be verified
Corresponding at least two history of the website domain name of website updates the page, then according to the page for including in the clustering cluster, obtain with
The associated at least two history refresh page mask body in website to be verified may include:
The whole pages that will include in the clustering cluster, directly as with the associated history refresh page in website to be verified
Face;
If described be and the website to be verified with the associated at least two history refresh page mask body in website to be verified
In corresponding at least two history of same web page address update the page, then according to the page for including in the clustering cluster, obtain
May include: with the associated at least two history refresh page mask body in website to be verified
By the page for including in the clustering cluster according to URL (Uniform Resource Locator, unified resource positioning
Symbol) address is grouped, wherein and the page in same grouping corresponds to an identical address URL;It obtains and is wrapped in same grouping
The page included updates the page as with the associated history in website to be verified.
240, the page is updated to each history and carries out Context resolution, obtained corresponding extremely with each history refresh page face
A few content domain.
250, it respectively in the same target content domain that each history updates the page, extracts at least one and compares object.
In the present embodiment, if the content in the object content domain includes text, the comparison object be can wrap
It includes: urtext, semantic signature or semantic classes;If the content in the object content domain includes picture, the ratio
It may include: original image or picture classification to object.
Wherein, the urtext specifically refers to the content of text directly occurred in some content domain, such as: text header
Content of text in domain are as follows: " 2016.6.17 days, XX company lists in the U.S. ", then above-mentioned content of text is urtext;
Semantic signature is the improvement to urtext, i.e., carries out semantics recognition and processing to urtext, retain original text
Core semantic content in this, and it is expressed as the combination of several core words, the combination of this core word, referred to as semantic label
Name.Continuous precedent, for " 2016.6.17 day, XX company is in U.S.'s listing " this urtext, corresponding to semantic signature be
" XX company, the U.S., listing ";
Semantic classes refers to the semantic classes of raw text content.Continuous precedent, for " 2016.6.17 days, XX company was in the U.S.
Listing " this urtext, corresponding to semantic classes be " finance and economics ".
It is understood that urtext, semantic signature and semantic classes represent the different information type of thickness granularity,
Correspondingly, by the comentropy for calculating these three information types, the different informational content measure result of available thickness granularity.?
When practical application, those skilled in the art can choose the letter of different thicknesses granularity according to actual abnormal website accuracy of identification
Type is ceased as the comparison object.
Similar, the original image specifically refers to the image content directly occurred in some content domain, the picture
The classification of classification, in particular to picture under certain classification system.
Currently, it will be appreciated by persons skilled in the art that the comparison pair of other forms can also be obtained in content domain
As, in fact, it is all can clear definition and the page column of identification or the data of page info type can be used as it is described
Object is compared, the present embodiment is to this and is not limited.
260, the frequency of occurrence according to the comparison object in the object content domain that each history updates the page, calculates
The probability of occurrence for comparing object.
In a specific example, within one day, website to be verified corresponds to three history and updates the page, and history updates
The page 1, history update the page 2 and history updates the page 3, and the object content domain of selection is text header domain, the comparison of selection
Object is urtext.
Wherein, the urtext occurred in the text header domain that history updates the page 1 includes: text header 1, text mark
Topic 2 and text header 3;In the text header domain that history updates in the page 2 urtext that occurs include: text header 1,
Text header 3 and text header 4;The urtext occurred in the text header domain that history updates in the page 3 includes: text
Title 3 and text header 5.
There are 8 text headers altogether correspondingly, updating in the page in above three history, text header 1 is above-mentioned
Three history update to be occurred 2 times altogether in the page, and then can determine that probability of occurrence corresponding with text header is 2/8;Text mark
Topic 2 updates in the page in above three history to be occurred 1 time altogether, and then can determine that probability of occurrence corresponding with text header is 1/
8;Text header 3 updates in the page in above three history to be occurred 3 times altogether, and then can determine appearance corresponding with text header
Probability is 3/8;Text header 4 updates in the page in above three history to be occurred 1 time altogether, and then can be determined and text header pair
The probability of occurrence answered is 1/8;Text header 5 updates in the page in above three history to be occurred 1 time altogether, and then can be determined and text
The corresponding probability of occurrence of this title 5 is 1/8.
270, according to the probability of occurrence for comparing object, comentropy corresponding with the object content domain is calculated.
According to comentropy calculation formula, available above-mentioned comentropy H corresponding with the object content domain are as follows:
H=(1/4) log24+(1/8)log28+(3/8)log23/8+(1/8)log28+(1/8)log28。
280, according to comentropy calculated result, anomalous identification is carried out to the website to be verified.
It is found after the characteristics of inventor is by analyzing various cheating websites: if in multiple history corresponding with same website
It updates in the page, the main picture of the page largely repeats (comentropy of picture is small), and picture describes text or text header is fresh
See repetition (picture describes text or the comentropy of text header is big), then the website has greater probability for cheating website;In addition,
If the other comentropy of picture category and the comentropy of picture header, there are notable difference, which also has greater probability for cheating
Website.
Accordingly, in a preferred embodiment of the present embodiment, according to comentropy calculated result, to described to be verified
Website carries out anomalous identification
If the summation of the comentropy of each content domain corresponding with the website to be verified is less than the first threshold value of setting,
Determine the website to be verified for abnormal website;Or
If the comentropy at least one object content domain corresponding with the website to be verified is less than the second thresholding of setting
Value, it is determined that the website to be verified is abnormal website;Or
If being less than setting third threshold value with the ratio in corresponding at least two object content in the website to be verified domain,
Then determine the website to be verified for abnormal website.
Wherein, first threshold value, the second threshold value and third threshold value can be preset according to the actual situation,
The present embodiment is to this and is not limited.
The technical solution of the present embodiment newly generates or has the page of update by screening in some period, will come from identical
The page aggregation of website together, and chooses the mode that website to be verified carries out anomalous identification according to polymerization result, compared to general
Whole websites that search engine is included carry out the mode of anomalous identification, under the premise of not dramatically increasing omission factor, Ke Yi great
It is big to reduce calculation amount;In addition, by carrying out anomalous identification to website according to the comentropy difference of each content domain in a website
Mode, do not need to introduce any reference site, only according to the comentropy difference feature in different content domain in website to be verified,
Technical effect that is simple, accurately identifying abnormal website can be realized.
On the basis of the various embodiments described above, in the appearance frequency according to the comparison object in each object content domain
It is secondary, before calculating the probability of occurrence for comparing object, can also include:
If it is determined that the comparison object is the simple repeated text of timeliness, then updated in the page in each history, point
The associated body matter of object Huo Qu not be compared with described;If updated in the page in different history, compared pair with same target
As corresponding body matter is not identical, then the target is compared object tag is different comparison objects.
The reason of being arranged in this way is: when calculating comentropy, it is special to need to the identic text with timeliness
Processing.For example, as " one week news flash ", " Domestic Briefs " this headline, it is different in the corresponding body matter of different time,
When calculating comentropy, need that body matter is combined to be judged.Namely: the page 1 is updated in history and history updates in the page 2
There is " one week news flash " this comparison object, if only counting the frequency of occurrence of " one week news flash ", the comparison pair
The probability of occurrence of elephant is 1.But, it is contemplated that " one week news flash " is the text with timeliness, also to continue history more
New page 1 and history, which update, compares body matter corresponding with " one week news flash " in the page 2, if the two is different, can incite somebody to action
" the one week news flash " that history updates in " one week news flash " and the history update page 2 in the page 1 is identified as different comparisons pair
As, and then can determine that the probability of occurrence of the comparison object is 1/2.
By above-mentioned setting, the accuracy in computation of comentropy can be improved, and then the identification that abnormal website can be improved is quasi-
Exactness.
Embodiment three
Fig. 3 is a kind of flow chart of the recognition methods of website provided by Embodiment 2 of the present invention.The present embodiment is with above-mentioned reality
It applies and optimizes based on example, in the present embodiment, the website to be verified will be carried out abnormal according to comentropy calculated result
The specific optimization of identification are as follows: according to the data characteristics of the website to be verified, obtained in reliable website list with it is described to be verified
The associated reference site in website;Obtain the comentropy of at least one content domain corresponding with the reference site;Described to be tested
It demonstrate,proves in website and the reference site, chooses at least one key content domain;According to the website to be verified and the ginseng
Examine in website, comentropy corresponding with the key content domain, calculate the website to be verified and the reference site it
Between the diversity factor factor;If the diversity factor factor meets given threshold condition, it is determined that the website to be verified is abnormal
Website.
Correspondingly, the method for the present embodiment specifically includes:
310, it in the set time period, obtains and updates the page with associated at least two history in website to be verified.
320, the page is updated to each history and carries out Context resolution, obtained corresponding extremely with each history refresh page face
A few content domain.
330, the content change in the page in identical content domain is updated according to each history, calculates each content domain
Comentropy.
340, it according to the data characteristics of the website to be verified, is obtained and the website to be verified in reliable website list
Associated reference site.
In the present embodiment, the data characteristics of the website to be verified may include at least one of following: set period of time
Interior network upgrade frequency, the new added pages quantity in set period of time and content topic etc..
Wherein, the reliable website list specifically refers to: it is excavated by User action log or the methods of manual sorting,
The a collection of reliable website determined.
In the present embodiment, it is contemplated that renewal frequency is similar, the new added pages quantity in set period of time is similar or interior
Hold the similar reliable website of theme, can also have certain similitude between the comentropy of each content domain in webpage.Therefore,
By the acquisition in reliable website list and the website to be verified reference site similar in the data characteristics, and lead to
The comentropy difference for crossing each domain in the reference site and the website to be verified, can identify abnormal website.
350, the comentropy of at least one content domain corresponding with the reference site is obtained.
360, in the website to be verified and the reference site, at least one key content domain is chosen.
Wherein it is possible to obtain the full content domain for including in the website to be verified and the reference site as institute
State key content domain, also it is available it is above-mentioned both include one or more important content domain (for example, picture domain with
And text header domain etc.) it is used as the key content domain, the present embodiment is to this and is not limited.
370, according in the website to be verified and the reference site, letter corresponding with the key content domain
Entropy is ceased, the diversity factor factor between the website to be verified and the reference site is calculated.
In a preferred embodiment of the present embodiment, according to the website to be verified and the reference site
In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site
The different degree factor can specifically include:
In the website to be verified and the reference site, it is poor to obtain comentropy corresponding with same key content domain
Value is used as the diversity factor factor.
For example, comentropy corresponding with key content domain 1 is A, corresponding with key content domain 2 in website to be verified
Comentropy is B;In reference site, comentropy corresponding with key content domain 1 is C, comentropy corresponding with key content domain 2
For D;
Can then incite somebody to action | A-C | and | B-D | as the diversity factor factor.Wherein, | | represent the symbol that takes absolute value.
In another preferred embodiment of the present embodiment, according to the website to be verified and the reference site
In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site
The different degree factor can specifically include:
In the website to be verified, comentropy corresponding at least two key content domains is constituted into the first information
Vector;
In the reference site, comentropy corresponding at least two key content domain is constituted into the second letter
Cease vector;
It calculates the distance between the first information vector and second information vector value and is used as the diversity factor factor.
Continuous precedent, in website to be verified, comentropy corresponding with key content domain 1 is A, corresponding with key content domain 2
Comentropy be B;In reference site, comentropy corresponding with key content domain 1 is C, information corresponding with key content domain 2
Entropy is D;
Then first information vector corresponding with website to be verified is [A, B], the second information vector corresponding with reference site
For [C, D].
Wherein it is possible to calculate the distance between two vectors value by various modes, typically, the cosine folder of the two is calculated
The mode at angle, and using calculated distance value as the diversity factor factor.
380, judge whether the diversity factor factor meets given threshold condition, if so, executing 390.Otherwise, it executes
3100。
Wherein, if the diversity factor factor is comentropy difference, if the diversity factor factor meets setting
Threshold condition, it is determined that the website to be verified is that abnormal website can specifically include:
If the comentropy difference for setting quantity is more than given threshold, and/or information corresponding with setting key content domain
Entropy difference is more than given threshold, it is determined that the website to be verified is abnormal website;Or
If being weighted the difference accumulated value that summation obtains at least two comentropy differences is more than given threshold, really
The fixed website to be verified is abnormal website.
If the diversity factor factor is the distance value, if the diversity factor factor meets given threshold item
Part, it is determined that the website to be verified is that abnormal website can specifically include:
If the distance value is more than setting threshold value, it is determined that the website to be verified is abnormal website.
390, determine the website to be verified for abnormal website.
3100, determine that the website to be verified is normal website.
After comentropy of the technical solution of the present embodiment by content domain each in website to be verified is calculated, obtain
The comentropy of each content domain, the comentropy based on the two calculate in reliable website similar with the website data feature to be verified
The diversity factor factor of the two is obtained, and then carries out the technological means of anomalous identification to website to be verified, may be implemented according to exception
Comentropy difference between website and reliable website, simply, quickly identifies the technical effect of abnormal website, recognition accuracy
Height, real-time are good.
Example IV
Fig. 4 is a kind of structure chart of the identification device for website that the embodiment of the present invention four provides.As shown in figure 4, the dress
Set includes: that history updates page acquisition module 41, content domain acquisition module 42, content domain comentropy computing module 43 and exception
Identification module 44, in which:
History updates page acquisition module 41, associated at least with website to be verified in the set time period, obtaining
Two history update the page.
Content domain obtains module 42, carries out Context resolution for updating the page to each history, obtains and described goes through with each
At least one corresponding content domain of history refresh page face.
Content domain comentropy computing module 43, for updating the content in the page in identical content domain according to each history
Variation calculates the comentropy of each content domain.
Anomalous identification module 44, for carrying out anomalous identification to the website to be verified according to comentropy calculated result.
The embodiment of the present invention is updated by the set time period, obtaining with associated at least two history in website to be verified
The page;The page is updated to each history and carries out Context resolution, obtain it is corresponding with each history refresh page face at least one
Content domain;The content change in the page in identical content domain is updated according to each history, calculates the information of each content domain
Entropy;According to comentropy calculated result, the technological means of anomalous identification is carried out to the website to be verified, due to Information Entropy Features
Discrimination is good, calculates simple, timeliness height, it is not high, real to can solve discrimination brought by the identification technology of existing cheating website
The technical issues of when property difference and the artificial mark for needing introducing additional or data preparation work, optimizes existing website and knows
Other technology, improves the recognition accuracy of abnormal website.
On the basis of the various embodiments described above, at least two history associated with website to be verified updates the page can be with
Include:
At least two history corresponding with the website domain name of the website to be verified updates the page;And/or
At least two history corresponding with the same web page address in the website to be verified updates the page.
On the basis of the various embodiments described above, the history updates page acquisition module, specifically can be used for:
In the set time period, it is grabbed by web crawlers newly generated in network, and/or has the page of update;
After the page of crawl is clustered according to website domain name, will website corresponding with clustering cluster as described to be verified
Website;
According to the page for including in the clustering cluster, obtains and updated with associated at least two history in website to be verified
The page.
On the basis of the various embodiments described above, the content domain may include at least one of following:
Text header domain, picture domain, picture header domain, picture describe textview field.
On the basis of the various embodiments described above, the content domain comentropy computing module specifically can be used for:
Respectively in the same target content domain that each history updates the page, extracts at least one and compare object;
According to frequency of occurrence of the comparison object in the object content domain that each history updates the page, described in calculating
Compare the probability of occurrence of object;
According to the probability of occurrence for comparing object, comentropy corresponding with the object content domain is calculated.
On the basis of the various embodiments described above, if the content in the object content domain includes text, the comparison
Object may include: urtext, semantic signature or semantic classes;
If the content in the object content domain includes picture, the comparison object may include: original image or
Person's picture classification.
On the basis of the various embodiments described above, it can also include: body matter association comparison module, be used for:
In the frequency of occurrence according to the comparison object in each object content domain, going out for the comparison object is calculated
Before existing probability, if it is determined that the comparison object is the simple repeated text of timeliness, then updates in the page in each history,
It obtains respectively and compares the associated body matter of object with described;
If updated in the page in different history, it is not identical that the corresponding body matter of object is compared with same target, then will
It is different comparison objects that the target, which compares object tag,.
On the basis of the various embodiments described above, the anomalous identification module be can specifically include:
Reference site acquiring unit obtains in reliable website list for the data characteristics according to the website to be verified
It takes and the associated reference site in website to be verified;
Reference site comentropy acquiring unit, for obtaining the letter of at least one content domain corresponding with the reference site
Cease entropy;
Key content domain selection unit, for choosing at least one in the website to be verified and the reference site
A key content domain;
Diversity factor factor calculating unit, for according in the website to be verified and the reference site, with the pass
The corresponding comentropy of key content domain calculates the diversity factor factor between the website to be verified and the reference site;
Abnormal website identifies subelement, if meeting given threshold condition for the diversity factor factor, it is determined that described
Website to be verified is abnormal website.
On the basis of the various embodiments described above, diversity factor factor calculating unit specifically can be used for:
In the website to be verified and the reference site, it is poor to obtain comentropy corresponding with same key content domain
Value is used as the diversity factor factor;
Abnormal website identification subelement specifically can be used for:
If the comentropy difference for setting quantity is more than given threshold, and/or information corresponding with setting key content domain
Entropy difference is more than given threshold, it is determined that the website to be verified is abnormal website;Or
If being weighted the difference accumulated value that summation obtains at least two comentropy differences is more than given threshold, really
The fixed website to be verified is abnormal website.
On the basis of the various embodiments described above, diversity factor factor calculating unit specifically can be used for:
In the website to be verified, comentropy corresponding at least two key content domains is constituted into the first information
Vector;
In the reference site, comentropy corresponding at least two key content domain is constituted into the second letter
Cease vector;
It calculates the distance between the first information vector and second information vector value and is used as the diversity factor factor;
The exception website identification subelement specifically can be used for:
If the distance value is more than setting threshold value, it is determined that the website to be verified is abnormal website.
On the basis of the various embodiments described above, the data characteristics of the website to be verified may include at least one of following:
The new added pages quantity and content topic in network upgrade frequency, set period of time in set period of time.
On the basis of the various embodiments described above, the anomalous identification module be can specifically include:
If the summation of the comentropy of each content domain corresponding with the website to be verified is less than the first threshold value of setting,
Determine the website to be verified for abnormal website;Or
If the comentropy at least one object content domain corresponding with the website to be verified is less than the second thresholding of setting
Value, it is determined that the website to be verified is abnormal website;Or
If being less than setting third threshold value with the ratio in corresponding at least two object content in the website to be verified domain,
Then determine the website to be verified for abnormal website.
On the basis of the various embodiments described above, the anomalous identification module be can specifically include:
Using the comentropy calculated result as at least one Information Entropy Features value, by the Information Entropy Features value and other
Abnormal website identification feature value is combined, and carries out anomalous identification to the website to be verified.
The identification device of website provided by the embodiment of the present invention can be used for executing the net of any embodiment of that present invention offer
The recognition methods stood has corresponding functional module, realizes identical beneficial effect.
Obviously, it will be understood by those skilled in the art that each module of the above invention or each step can be by as above
The server implementation.Optionally, the embodiment of the present invention can be realized with the program that computer installation can be performed, so as to
It is executed by a processor with being stored in storage device, the program can store in a kind of computer-readable storage
In medium, storage medium mentioned above can be read-only memory, disk or CD etc.;Or they are fabricated to each
A integrated circuit modules, or single integrated circuit module is maked multiple modules or steps in them to realize.In this way,
The present invention is not limited to the combinations of any specific hardware and software.
The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art
For, the invention can have various changes and changes.All any modifications made within the spirit and principles of the present invention are equal
Replacement, improvement etc., should all be included in the protection scope of the present invention.
Claims (18)
1. a kind of recognition methods of website characterized by comprising
In the set time period, it obtains and updates the page with associated at least two history in website to be verified;
The page is updated to each history and carries out Context resolution, is obtained corresponding with each history refresh page face at least one
Hold domain;
The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain;
According to comentropy calculated result, anomalous identification is carried out to the website to be verified;
Wherein, the content change in the page in identical content domain is updated according to each history, calculates the letter of each content domain
Ceasing entropy includes:
Respectively in the same target content domain that each history updates the page, extracts at least one and compare object;
According to frequency of occurrence of the comparison object in the object content domain that each history updates the page, the comparison is calculated
The probability of occurrence of object;
According to the probability of occurrence for comparing object, comentropy corresponding with the object content domain is calculated.
2. the method according to claim 1, wherein described and associated at least two history in website to be verified is more
New page includes:
At least two history corresponding with the website domain name of the website to be verified updates the page;And/or
At least two history corresponding with the same web page address in the website to be verified updates the page.
3. method according to claim 1 or 2, which is characterized in that in the set time period, obtain and closed with website to be verified
At least two history refresh page faces of connection include:
In the set time period, it is grabbed by web crawlers newly generated in network, and/or has the page of update;
After the page of crawl is clustered according to website domain name, will website corresponding with clustering cluster as the net to be verified
It stands;
According to the page for including in the clustering cluster, obtain and the associated at least two history refresh page in website to be verified
Face.
4. the method according to claim 1, wherein the content domain includes at least one of following:
Text header domain, picture domain, picture header domain, picture describe textview field.
5. according to the method described in claim 1, it is characterized by:
If the content in the object content domain includes text, the comparison object include: urtext, semantic signature or
Person's semantic classes;
If the content in the object content domain includes picture, the comparison object includes: original image or picture category
Not.
6. method according to claim 1 or 5, which is characterized in that according to the comparison object in each target
Hold the frequency of occurrence in domain, before the probability of occurrence for calculating the comparison object, further includes:
If it is determined that the comparison object is the simple repeated text of timeliness, then updates in the page in each history, obtain respectively
It takes and compares the associated body matter of object with described;
If updated in the page in different history, it is not identical to compare the corresponding body matter of object with same target, then it will be described
It is different comparison objects that target, which compares object tag,.
7. the method according to claim 1, wherein according to comentropy calculated result, to the website to be verified
Carrying out anomalous identification includes:
According to the data characteristics of the website to be verified, obtained and the associated ginseng in website to be verified in reliable website list
Examine website;
Obtain the comentropy of at least one content domain corresponding with the reference site;
In the website to be verified and the reference site, at least one key content domain is chosen;
According in the website to be verified and the reference site, comentropy corresponding with the key content domain is counted
Calculate the diversity factor factor between the website to be verified and the reference site;
If the diversity factor factor meets given threshold condition, it is determined that the website to be verified is abnormal website.
8. the method according to the description of claim 7 is characterized in that according to the website to be verified and the reference site
In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site
The different degree factor specifically includes:
In the website to be verified and the reference site, obtains comentropy difference corresponding with same key content domain and make
For the diversity factor factor;
If the diversity factor factor meets given threshold condition, it is determined that the website to be verified is that abnormal website is specific
Include:
If the comentropy difference for setting quantity is more than given threshold, and/or comentropy corresponding with key content domain is set is poor
Value is more than given threshold, it is determined that the website to be verified is abnormal website;Or
If being weighted the difference accumulated value that summation obtains at least two comentropy differences is more than given threshold, it is determined that institute
Website to be verified is stated as abnormal website.
9. the method according to the description of claim 7 is characterized in that according to the website to be verified and the reference site
In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site
The different degree factor specifically includes:
In the website to be verified, will corresponding at least two key content domains comentropy constitute the first information to
Amount;
In the reference site, will corresponding at least two key content domain comentropy constitute the second information to
Amount;
It calculates the distance between the first information vector and second information vector value and is used as the diversity factor factor;
If the diversity factor factor meets given threshold condition, it is determined that the website to be verified is that abnormal website is specific
Include:
If the distance value is more than setting threshold value, it is determined that the website to be verified is abnormal website.
10. according to the described in any item methods of claim 7-9, which is characterized in that the data characteristics packet of the website to be verified
It includes at least one of following:
The new added pages quantity and content topic in network upgrade frequency, set period of time in set period of time.
11. the method according to claim 1, wherein according to comentropy calculated result, to the website to be verified
Carrying out anomalous identification includes:
If the summation of the comentropy of each content domain corresponding with the website to be verified is less than the first threshold value of setting, it is determined that
The website to be verified is abnormal website;Or
If the comentropy at least one object content domain corresponding with the website to be verified is less than the second threshold value of setting,
Determine the website to be verified for abnormal website;Or
If being less than setting third threshold value with the ratio in corresponding at least two object content in the website to be verified domain, really
The fixed website to be verified is abnormal website.
12. the method according to claim 1, wherein according to comentropy calculated result, to the website to be verified
Carrying out anomalous identification includes:
Using the comentropy calculated result as at least one Information Entropy Features value, by the Information Entropy Features value and other exceptions
Website identification feature value is combined, and carries out anomalous identification to the website to be verified.
13. a kind of identification device of website characterized by comprising
History updates page acquisition module, in the set time period, acquisition to be gone through with website to be verified associated at least two
History updates the page;
Content domain obtains module, carries out Context resolution for updating the page to each history, obtains and update with each history
At least one corresponding content domain of the page;
Content domain comentropy computing module, for updating the content change in the page in identical content domain according to each history,
Calculate the comentropy of each content domain;
Anomalous identification module, for carrying out anomalous identification to the website to be verified according to comentropy calculated result;
Wherein, the content domain comentropy computing module, is specifically used for:
Respectively in the same target content domain that each history updates the page, extracts at least one and compare object;
According to frequency of occurrence of the comparison object in the object content domain that each history updates the page, the comparison is calculated
The probability of occurrence of object;
According to the probability of occurrence for comparing object, comentropy corresponding with the object content domain is calculated.
14. device according to claim 13, which is characterized in that the history updates page acquisition module, is specifically used for:
In the set time period, it is grabbed by web crawlers newly generated in network, and/or has the page of update;
After the page of crawl is clustered according to website domain name, will website corresponding with clustering cluster as the net to be verified
It stands;
According to the page for including in the clustering cluster, obtain and the associated at least two history refresh page in website to be verified
Face.
15. device according to claim 13, which is characterized in that further include: body matter is associated with comparison module, is used for:
In the frequency of occurrence according to the comparison object in each object content domain, it is general to calculate the appearance for comparing object
Before rate, if it is determined that the comparison object is the simple repeated text of timeliness, then updates in the page in each history, respectively
It obtains and compares the associated body matter of object with described;
If updated in the page in different history, it is not identical to compare the corresponding body matter of object with same target, then it will be described
It is different comparison objects that target, which compares object tag,.
16. device according to claim 13, which is characterized in that the anomalous identification module specifically includes:
Reference site acquiring unit, for the data characteristics according to the website to be verified, obtained in reliable website list with
The associated reference site in website to be verified;
Reference site comentropy acquiring unit, for obtaining the information of at least one content domain corresponding with the reference site
Entropy;
Key content domain selection unit, for choosing at least one pass in the website to be verified and the reference site
Key content domain;
Diversity factor factor calculating unit, for according in the website to be verified and the reference site, in the key
Hold the corresponding comentropy in domain, calculates the diversity factor factor between the website to be verified and the reference site;
Abnormal website identifies subelement, if meeting given threshold condition for the diversity factor factor, it is determined that described to be tested
Demonstrate,proving website is abnormal website.
17. device according to claim 13, which is characterized in that the anomalous identification module specifically includes:
If the summation of the comentropy of each content domain corresponding with the website to be verified is less than the first threshold value of setting, it is determined that
The website to be verified is abnormal website;Or
If the comentropy at least one object content domain corresponding with the website to be verified is less than the second threshold value of setting,
Determine the website to be verified for abnormal website;Or
If being less than setting third threshold value with the ratio in corresponding at least two object content in the website to be verified domain, really
The fixed website to be verified is abnormal website.
18. device according to claim 13, which is characterized in that the anomalous identification module specifically includes:
Using the comentropy calculated result as at least one Information Entropy Features value, by the Information Entropy Features value and other exceptions
Website identification feature value is combined, and carries out anomalous identification to the website to be verified.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610571258.6A CN106294535B (en) | 2016-07-19 | 2016-07-19 | The recognition methods of website and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610571258.6A CN106294535B (en) | 2016-07-19 | 2016-07-19 | The recognition methods of website and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106294535A CN106294535A (en) | 2017-01-04 |
CN106294535B true CN106294535B (en) | 2019-06-25 |
Family
ID=57651792
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610571258.6A Active CN106294535B (en) | 2016-07-19 | 2016-07-19 | The recognition methods of website and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106294535B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108280110A (en) * | 2017-05-15 | 2018-07-13 | 广州市动景计算机科技有限公司 | Website contrast difference's method, apparatus and client |
CN107451180B (en) * | 2017-06-13 | 2021-02-19 | 百度在线网络技术(北京)有限公司 | Method, device, equipment and computer storage medium for identifying site homologous relation |
CN109150817B (en) * | 2017-11-24 | 2020-11-27 | 新华三信息安全技术有限公司 | Webpage request identification method and device |
CN109800378A (en) * | 2019-01-23 | 2019-05-24 | 北京字节跳动网络技术有限公司 | Content processing method, device and electronic equipment based on custom browser |
CN109818828A (en) * | 2019-02-20 | 2019-05-28 | 成都嗨翻屋科技有限公司 | A kind of distributed reptile system monitoring method and device |
WO2020237480A1 (en) * | 2019-05-27 | 2020-12-03 | 西门子股份公司 | Control method and device based on image recognition |
CN110716778B (en) * | 2019-09-10 | 2023-09-26 | 创新先进技术有限公司 | Application compatibility testing method, device and system |
CN111460763A (en) * | 2020-03-02 | 2020-07-28 | 南京南瑞继保电气有限公司 | Method, device and equipment for marking file differences and computer-readable storage medium |
CN113554131B (en) * | 2021-09-22 | 2021-12-03 | 四川大学华西医院 | Medical image processing and analyzing method, computer device, system and storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100565523C (en) * | 2007-04-05 | 2009-12-02 | 中国科学院自动化研究所 | A kind of filtering sensitive web page method and system based on multiple Classifiers Combination |
CN101751458A (en) * | 2009-12-31 | 2010-06-23 | 暨南大学 | Network public sentiment monitoring system and method |
CN105205061B (en) * | 2014-06-12 | 2018-08-10 | 中国银联股份有限公司 | A kind of page info acquisition methods of electric business website |
-
2016
- 2016-07-19 CN CN201610571258.6A patent/CN106294535B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN106294535A (en) | 2017-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106294535B (en) | The recognition methods of website and device | |
Zhang et al. | Ad hoc table retrieval using semantic similarity | |
US11023513B2 (en) | Method and apparatus for searching using an active ontology | |
US9449271B2 (en) | Classifying resources using a deep network | |
US7680858B2 (en) | Techniques for clustering structurally similar web pages | |
US7676465B2 (en) | Techniques for clustering structurally similar web pages based on page features | |
CN103870973B (en) | Information push, searching method and the device of keyword extraction based on electronic information | |
US6735578B2 (en) | Indexing of knowledge base in multilayer self-organizing maps with hessian and perturbation induced fast learning | |
CN100565523C (en) | A kind of filtering sensitive web page method and system based on multiple Classifiers Combination | |
CN108965245A (en) | Detection method for phishing site and system based on the more disaggregated models of adaptive isomery | |
CN107220386A (en) | Information-pushing method and device | |
US20090125529A1 (en) | Extracting information based on document structure and characteristics of attributes | |
US20100211533A1 (en) | Extracting structured data from web forums | |
CN108763321A (en) | A kind of related entities recommendation method based on extensive related entities network | |
CN105843796A (en) | Microblog emotional tendency analysis method and device | |
CN113722478B (en) | Multi-dimensional feature fusion similar event calculation method and system and electronic equipment | |
CN104123321B (en) | A kind of determining method and device for recommending picture | |
CN114328913A (en) | Text classification method and device, computer equipment and storage medium | |
CN106776910A (en) | The display methods and device of a kind of Search Results | |
US20140236968A1 (en) | Discrete Wavelet Transform Method for Document Structure Similarity | |
CN113010771A (en) | Training method and device for personalized semantic vector model in search engine | |
CN116977701A (en) | Video classification model training method, video classification method and device | |
US9195940B2 (en) | Jabba-type override for correcting or improving output of a model | |
CN113435213B (en) | Method and device for returning answers to user questions and knowledge base | |
CN113806536B (en) | Text classification method and device, equipment, medium and product thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |