CN106294535B - The recognition methods of website and device - Google Patents

The recognition methods of website and device Download PDF

Info

Publication number
CN106294535B
CN106294535B CN201610571258.6A CN201610571258A CN106294535B CN 106294535 B CN106294535 B CN 106294535B CN 201610571258 A CN201610571258 A CN 201610571258A CN 106294535 B CN106294535 B CN 106294535B
Authority
CN
China
Prior art keywords
website
verified
page
comentropy
history
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610571258.6A
Other languages
Chinese (zh)
Other versions
CN106294535A (en
Inventor
邹红建
方高林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201610571258.6A priority Critical patent/CN106294535B/en
Publication of CN106294535A publication Critical patent/CN106294535A/en
Application granted granted Critical
Publication of CN106294535B publication Critical patent/CN106294535B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The embodiment of the invention discloses a kind of recognition methods of website and devices.The described method includes: in the set time period, obtaining and updating the page with associated at least two history in website to be verified;The page is updated to each history and carries out Context resolution, obtains at least one content domain corresponding with each history refresh page face;The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain;According to comentropy calculated result, anomalous identification is carried out to the website to be verified.The discrimination for the Information Entropy Features that technical solution of the present invention uses is good, calculate simple, timeliness height, it can solve the technical issues of artificial mark or data preparation that discrimination brought by existing cheating website identification technology is not high, real-time is poor and needs introducing is additional work, existing website identification technology is optimized, the recognition accuracy of abnormal website is improved.

Description

The recognition methods of website and device
Technical field
The present embodiments relate to the recognition methods of computer processing technology more particularly to a kind of website and devices.
Background technique
The information for including in document needed for information retrieval refers to the document needed for searching in the set of information resources or searches The process of content.Search engine is exactly the information retrieval tool for searching internet information.The appearance of search engine allows people Information is obtained from vast resources becomes convenient.After search engine occurs, consequent is webpage cheating problem.For economy Interests or other interests, cheating website mislead search engine by various methods, are tied with improving its page in search engine sequence Position sequence in fruit.Since cheating Website quality is not generally high, the usually advertisement of, gambling etc. especially pornographic comprising advertisement, User experience can be seriously affected, therefore the major issue that website identification of practising fraud belongs in information retrieval.The identification of cheating website The promotion of technology is of great significance to the effect for promoting search engine.
Currently, the cheat method variation of cheating website is frequently, but it can generally be summarized as content cheating and link to make Disadvantage two major classes.Content cheating is generally by piling up the mode of focus inquiry (also referred to as Query) in the page to improve the page Sequence in search-engine results;Link cheating is primarily directed to the page scoring algorithm of calculating PageRank (also referred to as It is the nomography of prototype for PageRank), by building linking relationship to improve weight of website, link cheating further includes passing through page The cheating mode that face redirects.Cheating website identification technology is always one of industry research hotspot, including naive Bayesian, Logistic Regression (also referred to as logistic regression), it SVM (Support Vector Machine, support vector machines), integrates A variety of machine learning methods such as study, deep learning have application, and the feature used includes content characteristic, chain feature etc..? Have and is identified using external informations such as user's click behaviors.
The major defect of existing cheating website identification technology is: the not significant, content of text for page structure feature On do not carry out the cheating page piled up of cheating word, it is difficult to identification in time.The graph model algorithm for relying on link relationship characteristic is complicated, It is difficult to meet the needs of identifying in real time;Emerging general Websites and compare minority website, how with emerging cheating net It stands and mutually distinguishes and one of difficulty;In addition, it is exactly network upgrade speed of practising fraud that cheating website identification mission, which faces a major challenge, Fastly, existing cheating identifying schemes or identification model effect gradually fail over time.Enhancing study and Active Learning energy It is enough partially to solve the problems, such as this, but need to introduce additional artificial mark or data preparation work.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of recognition methods of website and device, to optimize existing website Identification technology improves the recognition accuracy of abnormal website.
In a first aspect, the embodiment of the invention provides a kind of recognition methods of website, comprising:
In the set time period, it obtains and updates the page with associated at least two history in website to be verified;
The page is updated to each history and carries out Context resolution, obtains corresponding with each history refresh page face at least one A content domain;
The content change in the page in identical content domain is updated according to each history, calculates the information of each content domain Entropy;
According to comentropy calculated result, anomalous identification is carried out to the website to be verified.
In second aspect, the embodiment of the invention also provides a kind of identification devices of website, comprising:
History updates page acquisition module, in the set time period, obtaining and website to be verified associated at least two A history updates the page;
Content domain obtains module, carries out Context resolution for updating the page to each history, obtains and each history At least one corresponding content domain of refresh page face;
Content domain comentropy computing module becomes for updating the content in the page in identical content domain according to each history Change, calculates the comentropy of each content domain;
Anomalous identification module, for carrying out anomalous identification to the website to be verified according to comentropy calculated result.
The embodiment of the present invention in the set time period, obtains and the associated at least two history refresh page in website to be verified Face;The page is updated to each history and carries out Context resolution, is obtained corresponding with each history refresh page face at least one Hold domain;The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain; According to comentropy calculated result, anomalous identification is carried out to the website to be verified, since the discrimination of Information Entropy Features is good, calculates Simply, timeliness is high, can solve that discrimination brought by the identification technology of existing cheating website is not high, real-time is poor and needs The technical issues of additional artificial mark or data preparation work is introduced, existing website identification technology is optimized, improves The recognition accuracy of abnormal website.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the recognition methods for website that the embodiment of the present invention one provides;
Fig. 2 is a kind of flow chart of the recognition methods of website provided by Embodiment 2 of the present invention;
Fig. 3 is a kind of flow chart of the recognition methods for website that the embodiment of the present invention three provides;
Fig. 4 is a kind of structure chart of the identification device for website that the embodiment of the present invention four provides.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing to of the invention specific real Example is applied to be described in further detail.It is understood that specific embodiment described herein is used only for explaining the present invention, Rather than limitation of the invention.
It also should be noted that only the parts related to the present invention are shown for ease of description, in attached drawing rather than Full content.It should be mentioned that some exemplary embodiments are described before exemplary embodiment is discussed in greater detail At the processing or method described as flow chart.Although operations (or step) are described as the processing of sequence by flow chart, It is that many of these operations can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of operations can be by again It arranges.The processing can be terminated when its operations are completed, it is also possible to have the additional step being not included in attached drawing. The processing can correspond to method, function, regulation, subroutine, subprogram etc..
In order to be easy to understand hereinafter, inventive concept of the invention is simply introduced first:
Inventors discovered through research that: from purpose, cheating website is allowed in website and embeds to obtain higher ranked Ad content obtain more high access.Wherein, the advertisement classification of website of practising fraud generally more is concentrated, majority is gambling, it is pornographic, Beautifying medical, gun instrument etc..The cheating of cheating website is that have mark governed.In order to allow search engine to include and obtain height Sorting position, cheating website often updates content of pages, the inquiry of current popular high frequency is added in the page;Since cost is asked Topic, cheating website can generally replicate same page content.In order to cope with search engine anti-cheating strategy, website of practising fraud it is interior Appearance, pattern, network address are also required to frequent updating.
As the above analysis: cheating network upgrade is frequent, and advertising information is contained in website of practising fraud, and these advertising informations Within some period, update and infrequently.That is, in some important positions, there are unreasonable redundancies for cheating website, and Normal website especially high quality website does not need to make this redundancy, because can not provide valuable letter like that more Breath.
The concept of entropy is introduced information theory by founder's Shannon of information theory, as the measurement to information content size.Information content Size it is related to the size of uncertainty, entropy is higher, uncertain higher, other information required for describing clearly It measures bigger.
Namely: from the point of view of information theory, if normal website updates frequently, illustrate containing much information for its offer, entropy Value can be larger;If updated, the information content for infrequently illustrating that website provides is small, then entropy is smaller.Cheating website often updates, in advance Phase, its entropy was larger, but certain content domains or certain objects, because containing advertising information, these advertising information renewal speed are slow, Its entropy is caused to become smaller, i.e., the practical entropy of certain content domains directly has differences with expected entropy.By calculating website of practising fraud The entropy and its difference degree in different content domain can help effectively to identify cheating website.
By above-mentioned analysis, this concept of comentropy is introduced the identification of abnormal website by the proposition of inventor's creativeness Cheng Zhong carries out anomalous identification to the website by calculating the comentropy of one or more content domain in a website.
Embodiment one
Fig. 1 is a kind of flow chart of the recognition methods for website that the embodiment of the present invention one provides, and the method for the present embodiment can It is executed with the identification device by website, which can be realized by way of hardware and/or software, and can generally be integrated in use In the server for realizing abnormal website identification function.The method of the present embodiment specifically includes:
110, it in the set time period, obtains and updates the page with associated at least two history in website to be verified.
In the present embodiment, the website to be verified specifically refers to the website for needing to carry out anomalous identification.Wherein it is possible to will The whole websites included in search engine carry out anomalous identification as website to be verified, still, it is contemplated that abnormal website is (typical , website of practising fraud) in order to obtain the higher ranking results of position sequence in a search engine, content of pages can be often updated, therefore can There is the new generation page using selection or have the website for updating the page as website to be verified, this also contributes to reducing calculation amount.
As previously mentioned, core of the invention be by analysis one website to be verified in each content domain comentropy come Anomalous identification is carried out to the website, and comentropy is mainly to measure the uncertainty degree of the content occurred in content domain, therefore need (for example, 1 hour, 1 day or 1 week etc.) is obtained in set period of time, more with associated at least two history in website to be verified New page updates the content updated in the page by analyzing the history, determines the letter of each content domain in the website to be verified Cease entropy.
Wherein, at least two history associated with website to be verified update the page may include: with it is described to be verified Corresponding at least two history of the website domain name of website updates the page;And/or with the same webpage in the website to be verified Corresponding at least two history in location updates the page.
In a specific example, the website domain name of a website to be verified is www.A.com, available setting time Whole history corresponding with the website domain name update the page as with the associated history in website to be verified and update the page in section; Further, it is contemplated that can simultaneously include the subpage frame of multiple and different types in a website (for example, a news website In simultaneously include subpage frames such as " current events ", " amusement " and " sport "), in order to carry out more fine-grained analysis, can also obtain Take in the website to be verified with same web page address (such as: www.A.com/B) corresponding whole history updates the page, as The page is updated with the associated history in website to be verified.
120, the page is updated to each history and carries out Context resolution, obtained corresponding extremely with each history refresh page face A few content domain.
It in general, include different types of data content in a page, in the present embodiment, by above-mentioned inhomogeneity The data content of type is defined as domain.Such as: text header, text body, picture header, picture and picture accurate description text This etc..By page parsing, namely to the HTML (HyperText Markup Language, hypertext markup language) of the page File is analyzed, the page can be divided into different domains and extract the text for including in these domains, picture by a page etc. Content.
In view of the computation complexity of follow-up entropy, in the present embodiment, the content chosen when calculating comentropy Domain may include at least one of following: text header domain, picture domain, picture header domain, picture describe textview field.
Wherein, the text header domain specifically refers to the page location where one or more text header, the figure Piece domain specifically refers to the page location where one or more picture, and the picture header domain specifically refers to corresponding with picture Page location where one or more picture header, the picture describe textview field and specifically refer to one corresponding with picture Or multiple pictures precisely describe the page location where text.
130, the content change in the page in identical content domain is updated according to each history, calculates each content domain Comentropy.
By the related notion of comentropy it is found that the content change in a content domain is more frequent, content in the content domain Uncertainty it is bigger, then the comentropy of the content domain is also bigger;Conversely, the content in a content domain is more fixed, this is interior The uncertainty for holding the content in domain is smaller, then the comentropy of the content domain is also just smaller.
Wherein, the calculation formula of comentropy specifically:
Wherein, x has n kind value: x1…xi…xn, corresponding probability are as follows: P (x1)…P(xi)…P(xn)。
Typically, the frequency of occurrence in the page can be updated in each history according to different content in content domain, calculated in each Hold the comentropy in domain.
140, according to comentropy calculated result, anomalous identification is carried out to the website to be verified.
It, can be by the comentropy meter of content domain each in website to be verified in a preferred embodiment of the present embodiment It calculates result to be compared with the comentropy of each content domain of a reliable website, and then abnormal knowledge is carried out to the website to be verified Not;
It, can also be by the letter in different content domain in website to be verified in another preferred embodiment of the present embodiment Breath entropy is compared, and then carries out anomalous identification to the website to be verified;
It, can also be using the comentropy calculated result as at least in another preferred embodiment of the present embodiment The Information Entropy Features value and other abnormal website identification feature values are combined by one Information Entropy Features value, to it is described to It verifies website and carries out anomalous identification.
In general, the prior art mainly carries out anomalous identification to a website to be verified using classifier, by One or more abnormal website identification feature value (typical, content characteristic, chain and connect feature etc.) is added in the classifier Complete the identification to abnormal website.In the present embodiment, in addition to that directly can identify it to abnormal website is carried out by use information entropy It outside, can also be on the basis of existing abnormal website identification technology, by the comentropy meter of each content domain in website to be verified Result is calculated as one or more Information Entropy Features value, by the Information Entropy Features value and other abnormal website identification feature values It is input in classifier together, after combining with existing abnormal website identification technology, exception knowledge is carried out to the website to be verified Not, to further increase the recognition accuracy of abnormal website.
The embodiment of the present invention in the set time period, obtains and the associated at least two history refresh page in website to be verified Face;The page is updated to each history and carries out Context resolution, is obtained corresponding with each history refresh page face at least one Hold domain;The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain; According to comentropy calculated result, anomalous identification is carried out to the website to be verified, since the discrimination of Information Entropy Features is good, is calculated Simply, timeliness is high, can solve that discrimination brought by the identification technology of existing cheating website is not high, real-time is poor and needs The technical issues of additional artificial mark or data preparation work is introduced, existing website identification technology is optimized, improves The recognition accuracy of abnormal website.
Embodiment two
Fig. 2 is a kind of flow chart of the recognition methods of website provided by Embodiment 2 of the present invention.The present embodiment is with above-mentioned reality It applies and optimizes based on example, in the present embodiment, will obtain and website to be verified associated at least two in the set time period The optimization of a history refresh page mask body are as follows: in the set time period, grabbed by web crawlers it is newly generated in network, and/or There is the page of update;It, will be described in corresponding with clustering cluster website be used as after the page of crawl is clustered according to website domain name Website to be verified;According to the page for including in the clustering cluster, obtain and associated at least two history in website to be verified Update the page;
Meanwhile the content change in the page in identical content domain will be updated according to each history, calculate each content The comentropy in domain specifically optimizes are as follows: respectively in the same target content domain that each history updates the page, extracts at least one Compare object;According to frequency of occurrence of the comparison object in the object content domain that each history updates the page, institute is calculated State the probability of occurrence for comparing object;According to the probability of occurrence for comparing object, letter corresponding with the object content domain is calculated Cease entropy.
Correspondingly, the method for the present embodiment specifically includes:
210, it in the set time period, is grabbed by web crawlers newly generated in network, and/or has the page of update.
In the present embodiment, it is contemplated that abnormal website, especially cheating website are generally the ratio updated more frequently website. Therefore, it can obtain first and be grabbed by web crawlers newly generated in network, and have the page of update, by these pages Face is merged to cluster according to website, can determine corresponding website to be verified in turn.
220, after the page of crawl being clustered according to website domain name, will corresponding with clustering cluster website as it is described to Verify website.
230, it according to the page for including in the clustering cluster, obtains and associated at least two history in website to be verified Update the page.
Wherein, if it is described with the associated at least two history refresh page mask body in website to be verified be with it is described to be verified Corresponding at least two history of the website domain name of website updates the page, then according to the page for including in the clustering cluster, obtain with The associated at least two history refresh page mask body in website to be verified may include:
The whole pages that will include in the clustering cluster, directly as with the associated history refresh page in website to be verified Face;
If described be and the website to be verified with the associated at least two history refresh page mask body in website to be verified In corresponding at least two history of same web page address update the page, then according to the page for including in the clustering cluster, obtain May include: with the associated at least two history refresh page mask body in website to be verified
By the page for including in the clustering cluster according to URL (Uniform Resource Locator, unified resource positioning Symbol) address is grouped, wherein and the page in same grouping corresponds to an identical address URL;It obtains and is wrapped in same grouping The page included updates the page as with the associated history in website to be verified.
240, the page is updated to each history and carries out Context resolution, obtained corresponding extremely with each history refresh page face A few content domain.
250, it respectively in the same target content domain that each history updates the page, extracts at least one and compares object.
In the present embodiment, if the content in the object content domain includes text, the comparison object be can wrap It includes: urtext, semantic signature or semantic classes;If the content in the object content domain includes picture, the ratio It may include: original image or picture classification to object.
Wherein, the urtext specifically refers to the content of text directly occurred in some content domain, such as: text header Content of text in domain are as follows: " 2016.6.17 days, XX company lists in the U.S. ", then above-mentioned content of text is urtext;
Semantic signature is the improvement to urtext, i.e., carries out semantics recognition and processing to urtext, retain original text Core semantic content in this, and it is expressed as the combination of several core words, the combination of this core word, referred to as semantic label Name.Continuous precedent, for " 2016.6.17 day, XX company is in U.S.'s listing " this urtext, corresponding to semantic signature be " XX company, the U.S., listing ";
Semantic classes refers to the semantic classes of raw text content.Continuous precedent, for " 2016.6.17 days, XX company was in the U.S. Listing " this urtext, corresponding to semantic classes be " finance and economics ".
It is understood that urtext, semantic signature and semantic classes represent the different information type of thickness granularity, Correspondingly, by the comentropy for calculating these three information types, the different informational content measure result of available thickness granularity.? When practical application, those skilled in the art can choose the letter of different thicknesses granularity according to actual abnormal website accuracy of identification Type is ceased as the comparison object.
Similar, the original image specifically refers to the image content directly occurred in some content domain, the picture The classification of classification, in particular to picture under certain classification system.
Currently, it will be appreciated by persons skilled in the art that the comparison pair of other forms can also be obtained in content domain As, in fact, it is all can clear definition and the page column of identification or the data of page info type can be used as it is described Object is compared, the present embodiment is to this and is not limited.
260, the frequency of occurrence according to the comparison object in the object content domain that each history updates the page, calculates The probability of occurrence for comparing object.
In a specific example, within one day, website to be verified corresponds to three history and updates the page, and history updates The page 1, history update the page 2 and history updates the page 3, and the object content domain of selection is text header domain, the comparison of selection Object is urtext.
Wherein, the urtext occurred in the text header domain that history updates the page 1 includes: text header 1, text mark Topic 2 and text header 3;In the text header domain that history updates in the page 2 urtext that occurs include: text header 1, Text header 3 and text header 4;The urtext occurred in the text header domain that history updates in the page 3 includes: text Title 3 and text header 5.
There are 8 text headers altogether correspondingly, updating in the page in above three history, text header 1 is above-mentioned Three history update to be occurred 2 times altogether in the page, and then can determine that probability of occurrence corresponding with text header is 2/8;Text mark Topic 2 updates in the page in above three history to be occurred 1 time altogether, and then can determine that probability of occurrence corresponding with text header is 1/ 8;Text header 3 updates in the page in above three history to be occurred 3 times altogether, and then can determine appearance corresponding with text header Probability is 3/8;Text header 4 updates in the page in above three history to be occurred 1 time altogether, and then can be determined and text header pair The probability of occurrence answered is 1/8;Text header 5 updates in the page in above three history to be occurred 1 time altogether, and then can be determined and text The corresponding probability of occurrence of this title 5 is 1/8.
270, according to the probability of occurrence for comparing object, comentropy corresponding with the object content domain is calculated.
According to comentropy calculation formula, available above-mentioned comentropy H corresponding with the object content domain are as follows:
H=(1/4) log24+(1/8)log28+(3/8)log23/8+(1/8)log28+(1/8)log28。
280, according to comentropy calculated result, anomalous identification is carried out to the website to be verified.
It is found after the characteristics of inventor is by analyzing various cheating websites: if in multiple history corresponding with same website It updates in the page, the main picture of the page largely repeats (comentropy of picture is small), and picture describes text or text header is fresh See repetition (picture describes text or the comentropy of text header is big), then the website has greater probability for cheating website;In addition, If the other comentropy of picture category and the comentropy of picture header, there are notable difference, which also has greater probability for cheating Website.
Accordingly, in a preferred embodiment of the present embodiment, according to comentropy calculated result, to described to be verified Website carries out anomalous identification
If the summation of the comentropy of each content domain corresponding with the website to be verified is less than the first threshold value of setting, Determine the website to be verified for abnormal website;Or
If the comentropy at least one object content domain corresponding with the website to be verified is less than the second thresholding of setting Value, it is determined that the website to be verified is abnormal website;Or
If being less than setting third threshold value with the ratio in corresponding at least two object content in the website to be verified domain, Then determine the website to be verified for abnormal website.
Wherein, first threshold value, the second threshold value and third threshold value can be preset according to the actual situation, The present embodiment is to this and is not limited.
The technical solution of the present embodiment newly generates or has the page of update by screening in some period, will come from identical The page aggregation of website together, and chooses the mode that website to be verified carries out anomalous identification according to polymerization result, compared to general Whole websites that search engine is included carry out the mode of anomalous identification, under the premise of not dramatically increasing omission factor, Ke Yi great It is big to reduce calculation amount;In addition, by carrying out anomalous identification to website according to the comentropy difference of each content domain in a website Mode, do not need to introduce any reference site, only according to the comentropy difference feature in different content domain in website to be verified, Technical effect that is simple, accurately identifying abnormal website can be realized.
On the basis of the various embodiments described above, in the appearance frequency according to the comparison object in each object content domain It is secondary, before calculating the probability of occurrence for comparing object, can also include:
If it is determined that the comparison object is the simple repeated text of timeliness, then updated in the page in each history, point The associated body matter of object Huo Qu not be compared with described;If updated in the page in different history, compared pair with same target As corresponding body matter is not identical, then the target is compared object tag is different comparison objects.
The reason of being arranged in this way is: when calculating comentropy, it is special to need to the identic text with timeliness Processing.For example, as " one week news flash ", " Domestic Briefs " this headline, it is different in the corresponding body matter of different time, When calculating comentropy, need that body matter is combined to be judged.Namely: the page 1 is updated in history and history updates in the page 2 There is " one week news flash " this comparison object, if only counting the frequency of occurrence of " one week news flash ", the comparison pair The probability of occurrence of elephant is 1.But, it is contemplated that " one week news flash " is the text with timeliness, also to continue history more New page 1 and history, which update, compares body matter corresponding with " one week news flash " in the page 2, if the two is different, can incite somebody to action " the one week news flash " that history updates in " one week news flash " and the history update page 2 in the page 1 is identified as different comparisons pair As, and then can determine that the probability of occurrence of the comparison object is 1/2.
By above-mentioned setting, the accuracy in computation of comentropy can be improved, and then the identification that abnormal website can be improved is quasi- Exactness.
Embodiment three
Fig. 3 is a kind of flow chart of the recognition methods of website provided by Embodiment 2 of the present invention.The present embodiment is with above-mentioned reality It applies and optimizes based on example, in the present embodiment, the website to be verified will be carried out abnormal according to comentropy calculated result The specific optimization of identification are as follows: according to the data characteristics of the website to be verified, obtained in reliable website list with it is described to be verified The associated reference site in website;Obtain the comentropy of at least one content domain corresponding with the reference site;Described to be tested It demonstrate,proves in website and the reference site, chooses at least one key content domain;According to the website to be verified and the ginseng Examine in website, comentropy corresponding with the key content domain, calculate the website to be verified and the reference site it Between the diversity factor factor;If the diversity factor factor meets given threshold condition, it is determined that the website to be verified is abnormal Website.
Correspondingly, the method for the present embodiment specifically includes:
310, it in the set time period, obtains and updates the page with associated at least two history in website to be verified.
320, the page is updated to each history and carries out Context resolution, obtained corresponding extremely with each history refresh page face A few content domain.
330, the content change in the page in identical content domain is updated according to each history, calculates each content domain Comentropy.
340, it according to the data characteristics of the website to be verified, is obtained and the website to be verified in reliable website list Associated reference site.
In the present embodiment, the data characteristics of the website to be verified may include at least one of following: set period of time Interior network upgrade frequency, the new added pages quantity in set period of time and content topic etc..
Wherein, the reliable website list specifically refers to: it is excavated by User action log or the methods of manual sorting, The a collection of reliable website determined.
In the present embodiment, it is contemplated that renewal frequency is similar, the new added pages quantity in set period of time is similar or interior Hold the similar reliable website of theme, can also have certain similitude between the comentropy of each content domain in webpage.Therefore, By the acquisition in reliable website list and the website to be verified reference site similar in the data characteristics, and lead to The comentropy difference for crossing each domain in the reference site and the website to be verified, can identify abnormal website.
350, the comentropy of at least one content domain corresponding with the reference site is obtained.
360, in the website to be verified and the reference site, at least one key content domain is chosen.
Wherein it is possible to obtain the full content domain for including in the website to be verified and the reference site as institute State key content domain, also it is available it is above-mentioned both include one or more important content domain (for example, picture domain with And text header domain etc.) it is used as the key content domain, the present embodiment is to this and is not limited.
370, according in the website to be verified and the reference site, letter corresponding with the key content domain Entropy is ceased, the diversity factor factor between the website to be verified and the reference site is calculated.
In a preferred embodiment of the present embodiment, according to the website to be verified and the reference site In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site The different degree factor can specifically include:
In the website to be verified and the reference site, it is poor to obtain comentropy corresponding with same key content domain Value is used as the diversity factor factor.
For example, comentropy corresponding with key content domain 1 is A, corresponding with key content domain 2 in website to be verified Comentropy is B;In reference site, comentropy corresponding with key content domain 1 is C, comentropy corresponding with key content domain 2 For D;
Can then incite somebody to action | A-C | and | B-D | as the diversity factor factor.Wherein, | | represent the symbol that takes absolute value.
In another preferred embodiment of the present embodiment, according to the website to be verified and the reference site In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site The different degree factor can specifically include:
In the website to be verified, comentropy corresponding at least two key content domains is constituted into the first information Vector;
In the reference site, comentropy corresponding at least two key content domain is constituted into the second letter Cease vector;
It calculates the distance between the first information vector and second information vector value and is used as the diversity factor factor.
Continuous precedent, in website to be verified, comentropy corresponding with key content domain 1 is A, corresponding with key content domain 2 Comentropy be B;In reference site, comentropy corresponding with key content domain 1 is C, information corresponding with key content domain 2 Entropy is D;
Then first information vector corresponding with website to be verified is [A, B], the second information vector corresponding with reference site For [C, D].
Wherein it is possible to calculate the distance between two vectors value by various modes, typically, the cosine folder of the two is calculated The mode at angle, and using calculated distance value as the diversity factor factor.
380, judge whether the diversity factor factor meets given threshold condition, if so, executing 390.Otherwise, it executes 3100。
Wherein, if the diversity factor factor is comentropy difference, if the diversity factor factor meets setting Threshold condition, it is determined that the website to be verified is that abnormal website can specifically include:
If the comentropy difference for setting quantity is more than given threshold, and/or information corresponding with setting key content domain Entropy difference is more than given threshold, it is determined that the website to be verified is abnormal website;Or
If being weighted the difference accumulated value that summation obtains at least two comentropy differences is more than given threshold, really The fixed website to be verified is abnormal website.
If the diversity factor factor is the distance value, if the diversity factor factor meets given threshold item Part, it is determined that the website to be verified is that abnormal website can specifically include:
If the distance value is more than setting threshold value, it is determined that the website to be verified is abnormal website.
390, determine the website to be verified for abnormal website.
3100, determine that the website to be verified is normal website.
After comentropy of the technical solution of the present embodiment by content domain each in website to be verified is calculated, obtain The comentropy of each content domain, the comentropy based on the two calculate in reliable website similar with the website data feature to be verified The diversity factor factor of the two is obtained, and then carries out the technological means of anomalous identification to website to be verified, may be implemented according to exception Comentropy difference between website and reliable website, simply, quickly identifies the technical effect of abnormal website, recognition accuracy Height, real-time are good.
Example IV
Fig. 4 is a kind of structure chart of the identification device for website that the embodiment of the present invention four provides.As shown in figure 4, the dress Set includes: that history updates page acquisition module 41, content domain acquisition module 42, content domain comentropy computing module 43 and exception Identification module 44, in which:
History updates page acquisition module 41, associated at least with website to be verified in the set time period, obtaining Two history update the page.
Content domain obtains module 42, carries out Context resolution for updating the page to each history, obtains and described goes through with each At least one corresponding content domain of history refresh page face.
Content domain comentropy computing module 43, for updating the content in the page in identical content domain according to each history Variation calculates the comentropy of each content domain.
Anomalous identification module 44, for carrying out anomalous identification to the website to be verified according to comentropy calculated result.
The embodiment of the present invention is updated by the set time period, obtaining with associated at least two history in website to be verified The page;The page is updated to each history and carries out Context resolution, obtain it is corresponding with each history refresh page face at least one Content domain;The content change in the page in identical content domain is updated according to each history, calculates the information of each content domain Entropy;According to comentropy calculated result, the technological means of anomalous identification is carried out to the website to be verified, due to Information Entropy Features Discrimination is good, calculates simple, timeliness height, it is not high, real to can solve discrimination brought by the identification technology of existing cheating website The technical issues of when property difference and the artificial mark for needing introducing additional or data preparation work, optimizes existing website and knows Other technology, improves the recognition accuracy of abnormal website.
On the basis of the various embodiments described above, at least two history associated with website to be verified updates the page can be with Include:
At least two history corresponding with the website domain name of the website to be verified updates the page;And/or
At least two history corresponding with the same web page address in the website to be verified updates the page.
On the basis of the various embodiments described above, the history updates page acquisition module, specifically can be used for:
In the set time period, it is grabbed by web crawlers newly generated in network, and/or has the page of update;
After the page of crawl is clustered according to website domain name, will website corresponding with clustering cluster as described to be verified Website;
According to the page for including in the clustering cluster, obtains and updated with associated at least two history in website to be verified The page.
On the basis of the various embodiments described above, the content domain may include at least one of following:
Text header domain, picture domain, picture header domain, picture describe textview field.
On the basis of the various embodiments described above, the content domain comentropy computing module specifically can be used for:
Respectively in the same target content domain that each history updates the page, extracts at least one and compare object;
According to frequency of occurrence of the comparison object in the object content domain that each history updates the page, described in calculating Compare the probability of occurrence of object;
According to the probability of occurrence for comparing object, comentropy corresponding with the object content domain is calculated.
On the basis of the various embodiments described above, if the content in the object content domain includes text, the comparison Object may include: urtext, semantic signature or semantic classes;
If the content in the object content domain includes picture, the comparison object may include: original image or Person's picture classification.
On the basis of the various embodiments described above, it can also include: body matter association comparison module, be used for:
In the frequency of occurrence according to the comparison object in each object content domain, going out for the comparison object is calculated Before existing probability, if it is determined that the comparison object is the simple repeated text of timeliness, then updates in the page in each history, It obtains respectively and compares the associated body matter of object with described;
If updated in the page in different history, it is not identical that the corresponding body matter of object is compared with same target, then will It is different comparison objects that the target, which compares object tag,.
On the basis of the various embodiments described above, the anomalous identification module be can specifically include:
Reference site acquiring unit obtains in reliable website list for the data characteristics according to the website to be verified It takes and the associated reference site in website to be verified;
Reference site comentropy acquiring unit, for obtaining the letter of at least one content domain corresponding with the reference site Cease entropy;
Key content domain selection unit, for choosing at least one in the website to be verified and the reference site A key content domain;
Diversity factor factor calculating unit, for according in the website to be verified and the reference site, with the pass The corresponding comentropy of key content domain calculates the diversity factor factor between the website to be verified and the reference site;
Abnormal website identifies subelement, if meeting given threshold condition for the diversity factor factor, it is determined that described Website to be verified is abnormal website.
On the basis of the various embodiments described above, diversity factor factor calculating unit specifically can be used for:
In the website to be verified and the reference site, it is poor to obtain comentropy corresponding with same key content domain Value is used as the diversity factor factor;
Abnormal website identification subelement specifically can be used for:
If the comentropy difference for setting quantity is more than given threshold, and/or information corresponding with setting key content domain Entropy difference is more than given threshold, it is determined that the website to be verified is abnormal website;Or
If being weighted the difference accumulated value that summation obtains at least two comentropy differences is more than given threshold, really The fixed website to be verified is abnormal website.
On the basis of the various embodiments described above, diversity factor factor calculating unit specifically can be used for:
In the website to be verified, comentropy corresponding at least two key content domains is constituted into the first information Vector;
In the reference site, comentropy corresponding at least two key content domain is constituted into the second letter Cease vector;
It calculates the distance between the first information vector and second information vector value and is used as the diversity factor factor;
The exception website identification subelement specifically can be used for:
If the distance value is more than setting threshold value, it is determined that the website to be verified is abnormal website.
On the basis of the various embodiments described above, the data characteristics of the website to be verified may include at least one of following:
The new added pages quantity and content topic in network upgrade frequency, set period of time in set period of time.
On the basis of the various embodiments described above, the anomalous identification module be can specifically include:
If the summation of the comentropy of each content domain corresponding with the website to be verified is less than the first threshold value of setting, Determine the website to be verified for abnormal website;Or
If the comentropy at least one object content domain corresponding with the website to be verified is less than the second thresholding of setting Value, it is determined that the website to be verified is abnormal website;Or
If being less than setting third threshold value with the ratio in corresponding at least two object content in the website to be verified domain, Then determine the website to be verified for abnormal website.
On the basis of the various embodiments described above, the anomalous identification module be can specifically include:
Using the comentropy calculated result as at least one Information Entropy Features value, by the Information Entropy Features value and other Abnormal website identification feature value is combined, and carries out anomalous identification to the website to be verified.
The identification device of website provided by the embodiment of the present invention can be used for executing the net of any embodiment of that present invention offer The recognition methods stood has corresponding functional module, realizes identical beneficial effect.
Obviously, it will be understood by those skilled in the art that each module of the above invention or each step can be by as above The server implementation.Optionally, the embodiment of the present invention can be realized with the program that computer installation can be performed, so as to It is executed by a processor with being stored in storage device, the program can store in a kind of computer-readable storage In medium, storage medium mentioned above can be read-only memory, disk or CD etc.;Or they are fabricated to each A integrated circuit modules, or single integrated circuit module is maked multiple modules or steps in them to realize.In this way, The present invention is not limited to the combinations of any specific hardware and software.
The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art For, the invention can have various changes and changes.All any modifications made within the spirit and principles of the present invention are equal Replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (18)

1. a kind of recognition methods of website characterized by comprising
In the set time period, it obtains and updates the page with associated at least two history in website to be verified;
The page is updated to each history and carries out Context resolution, is obtained corresponding with each history refresh page face at least one Hold domain;
The content change in the page in identical content domain is updated according to each history, calculates the comentropy of each content domain;
According to comentropy calculated result, anomalous identification is carried out to the website to be verified;
Wherein, the content change in the page in identical content domain is updated according to each history, calculates the letter of each content domain Ceasing entropy includes:
Respectively in the same target content domain that each history updates the page, extracts at least one and compare object;
According to frequency of occurrence of the comparison object in the object content domain that each history updates the page, the comparison is calculated The probability of occurrence of object;
According to the probability of occurrence for comparing object, comentropy corresponding with the object content domain is calculated.
2. the method according to claim 1, wherein described and associated at least two history in website to be verified is more New page includes:
At least two history corresponding with the website domain name of the website to be verified updates the page;And/or
At least two history corresponding with the same web page address in the website to be verified updates the page.
3. method according to claim 1 or 2, which is characterized in that in the set time period, obtain and closed with website to be verified At least two history refresh page faces of connection include:
In the set time period, it is grabbed by web crawlers newly generated in network, and/or has the page of update;
After the page of crawl is clustered according to website domain name, will website corresponding with clustering cluster as the net to be verified It stands;
According to the page for including in the clustering cluster, obtain and the associated at least two history refresh page in website to be verified Face.
4. the method according to claim 1, wherein the content domain includes at least one of following:
Text header domain, picture domain, picture header domain, picture describe textview field.
5. according to the method described in claim 1, it is characterized by:
If the content in the object content domain includes text, the comparison object include: urtext, semantic signature or Person's semantic classes;
If the content in the object content domain includes picture, the comparison object includes: original image or picture category Not.
6. method according to claim 1 or 5, which is characterized in that according to the comparison object in each target Hold the frequency of occurrence in domain, before the probability of occurrence for calculating the comparison object, further includes:
If it is determined that the comparison object is the simple repeated text of timeliness, then updates in the page in each history, obtain respectively It takes and compares the associated body matter of object with described;
If updated in the page in different history, it is not identical to compare the corresponding body matter of object with same target, then it will be described It is different comparison objects that target, which compares object tag,.
7. the method according to claim 1, wherein according to comentropy calculated result, to the website to be verified Carrying out anomalous identification includes:
According to the data characteristics of the website to be verified, obtained and the associated ginseng in website to be verified in reliable website list Examine website;
Obtain the comentropy of at least one content domain corresponding with the reference site;
In the website to be verified and the reference site, at least one key content domain is chosen;
According in the website to be verified and the reference site, comentropy corresponding with the key content domain is counted Calculate the diversity factor factor between the website to be verified and the reference site;
If the diversity factor factor meets given threshold condition, it is determined that the website to be verified is abnormal website.
8. the method according to the description of claim 7 is characterized in that according to the website to be verified and the reference site In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site The different degree factor specifically includes:
In the website to be verified and the reference site, obtains comentropy difference corresponding with same key content domain and make For the diversity factor factor;
If the diversity factor factor meets given threshold condition, it is determined that the website to be verified is that abnormal website is specific Include:
If the comentropy difference for setting quantity is more than given threshold, and/or comentropy corresponding with key content domain is set is poor Value is more than given threshold, it is determined that the website to be verified is abnormal website;Or
If being weighted the difference accumulated value that summation obtains at least two comentropy differences is more than given threshold, it is determined that institute Website to be verified is stated as abnormal website.
9. the method according to the description of claim 7 is characterized in that according to the website to be verified and the reference site In, comentropy corresponding with the key content domain calculates the difference between the website to be verified and the reference site The different degree factor specifically includes:
In the website to be verified, will corresponding at least two key content domains comentropy constitute the first information to Amount;
In the reference site, will corresponding at least two key content domain comentropy constitute the second information to Amount;
It calculates the distance between the first information vector and second information vector value and is used as the diversity factor factor;
If the diversity factor factor meets given threshold condition, it is determined that the website to be verified is that abnormal website is specific Include:
If the distance value is more than setting threshold value, it is determined that the website to be verified is abnormal website.
10. according to the described in any item methods of claim 7-9, which is characterized in that the data characteristics packet of the website to be verified It includes at least one of following:
The new added pages quantity and content topic in network upgrade frequency, set period of time in set period of time.
11. the method according to claim 1, wherein according to comentropy calculated result, to the website to be verified Carrying out anomalous identification includes:
If the summation of the comentropy of each content domain corresponding with the website to be verified is less than the first threshold value of setting, it is determined that The website to be verified is abnormal website;Or
If the comentropy at least one object content domain corresponding with the website to be verified is less than the second threshold value of setting, Determine the website to be verified for abnormal website;Or
If being less than setting third threshold value with the ratio in corresponding at least two object content in the website to be verified domain, really The fixed website to be verified is abnormal website.
12. the method according to claim 1, wherein according to comentropy calculated result, to the website to be verified Carrying out anomalous identification includes:
Using the comentropy calculated result as at least one Information Entropy Features value, by the Information Entropy Features value and other exceptions Website identification feature value is combined, and carries out anomalous identification to the website to be verified.
13. a kind of identification device of website characterized by comprising
History updates page acquisition module, in the set time period, acquisition to be gone through with website to be verified associated at least two History updates the page;
Content domain obtains module, carries out Context resolution for updating the page to each history, obtains and update with each history At least one corresponding content domain of the page;
Content domain comentropy computing module, for updating the content change in the page in identical content domain according to each history, Calculate the comentropy of each content domain;
Anomalous identification module, for carrying out anomalous identification to the website to be verified according to comentropy calculated result;
Wherein, the content domain comentropy computing module, is specifically used for:
Respectively in the same target content domain that each history updates the page, extracts at least one and compare object;
According to frequency of occurrence of the comparison object in the object content domain that each history updates the page, the comparison is calculated The probability of occurrence of object;
According to the probability of occurrence for comparing object, comentropy corresponding with the object content domain is calculated.
14. device according to claim 13, which is characterized in that the history updates page acquisition module, is specifically used for:
In the set time period, it is grabbed by web crawlers newly generated in network, and/or has the page of update;
After the page of crawl is clustered according to website domain name, will website corresponding with clustering cluster as the net to be verified It stands;
According to the page for including in the clustering cluster, obtain and the associated at least two history refresh page in website to be verified Face.
15. device according to claim 13, which is characterized in that further include: body matter is associated with comparison module, is used for:
In the frequency of occurrence according to the comparison object in each object content domain, it is general to calculate the appearance for comparing object Before rate, if it is determined that the comparison object is the simple repeated text of timeliness, then updates in the page in each history, respectively It obtains and compares the associated body matter of object with described;
If updated in the page in different history, it is not identical to compare the corresponding body matter of object with same target, then it will be described It is different comparison objects that target, which compares object tag,.
16. device according to claim 13, which is characterized in that the anomalous identification module specifically includes:
Reference site acquiring unit, for the data characteristics according to the website to be verified, obtained in reliable website list with The associated reference site in website to be verified;
Reference site comentropy acquiring unit, for obtaining the information of at least one content domain corresponding with the reference site Entropy;
Key content domain selection unit, for choosing at least one pass in the website to be verified and the reference site Key content domain;
Diversity factor factor calculating unit, for according in the website to be verified and the reference site, in the key Hold the corresponding comentropy in domain, calculates the diversity factor factor between the website to be verified and the reference site;
Abnormal website identifies subelement, if meeting given threshold condition for the diversity factor factor, it is determined that described to be tested Demonstrate,proving website is abnormal website.
17. device according to claim 13, which is characterized in that the anomalous identification module specifically includes:
If the summation of the comentropy of each content domain corresponding with the website to be verified is less than the first threshold value of setting, it is determined that The website to be verified is abnormal website;Or
If the comentropy at least one object content domain corresponding with the website to be verified is less than the second threshold value of setting, Determine the website to be verified for abnormal website;Or
If being less than setting third threshold value with the ratio in corresponding at least two object content in the website to be verified domain, really The fixed website to be verified is abnormal website.
18. device according to claim 13, which is characterized in that the anomalous identification module specifically includes:
Using the comentropy calculated result as at least one Information Entropy Features value, by the Information Entropy Features value and other exceptions Website identification feature value is combined, and carries out anomalous identification to the website to be verified.
CN201610571258.6A 2016-07-19 2016-07-19 The recognition methods of website and device Active CN106294535B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610571258.6A CN106294535B (en) 2016-07-19 2016-07-19 The recognition methods of website and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610571258.6A CN106294535B (en) 2016-07-19 2016-07-19 The recognition methods of website and device

Publications (2)

Publication Number Publication Date
CN106294535A CN106294535A (en) 2017-01-04
CN106294535B true CN106294535B (en) 2019-06-25

Family

ID=57651792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610571258.6A Active CN106294535B (en) 2016-07-19 2016-07-19 The recognition methods of website and device

Country Status (1)

Country Link
CN (1) CN106294535B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280110A (en) * 2017-05-15 2018-07-13 广州市动景计算机科技有限公司 Website contrast difference's method, apparatus and client
CN107451180B (en) * 2017-06-13 2021-02-19 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for identifying site homologous relation
CN109150817B (en) * 2017-11-24 2020-11-27 新华三信息安全技术有限公司 Webpage request identification method and device
CN109800378A (en) * 2019-01-23 2019-05-24 北京字节跳动网络技术有限公司 Content processing method, device and electronic equipment based on custom browser
CN109818828A (en) * 2019-02-20 2019-05-28 成都嗨翻屋科技有限公司 A kind of distributed reptile system monitoring method and device
WO2020237480A1 (en) * 2019-05-27 2020-12-03 西门子股份公司 Control method and device based on image recognition
CN110716778B (en) * 2019-09-10 2023-09-26 创新先进技术有限公司 Application compatibility testing method, device and system
CN111460763A (en) * 2020-03-02 2020-07-28 南京南瑞继保电气有限公司 Method, device and equipment for marking file differences and computer-readable storage medium
CN113554131B (en) * 2021-09-22 2021-12-03 四川大学华西医院 Medical image processing and analyzing method, computer device, system and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100565523C (en) * 2007-04-05 2009-12-02 中国科学院自动化研究所 A kind of filtering sensitive web page method and system based on multiple Classifiers Combination
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN105205061B (en) * 2014-06-12 2018-08-10 中国银联股份有限公司 A kind of page info acquisition methods of electric business website

Also Published As

Publication number Publication date
CN106294535A (en) 2017-01-04

Similar Documents

Publication Publication Date Title
CN106294535B (en) The recognition methods of website and device
Zhang et al. Ad hoc table retrieval using semantic similarity
US11023513B2 (en) Method and apparatus for searching using an active ontology
US9449271B2 (en) Classifying resources using a deep network
US7680858B2 (en) Techniques for clustering structurally similar web pages
US7676465B2 (en) Techniques for clustering structurally similar web pages based on page features
CN103870973B (en) Information push, searching method and the device of keyword extraction based on electronic information
US6735578B2 (en) Indexing of knowledge base in multilayer self-organizing maps with hessian and perturbation induced fast learning
CN100565523C (en) A kind of filtering sensitive web page method and system based on multiple Classifiers Combination
CN108965245A (en) Detection method for phishing site and system based on the more disaggregated models of adaptive isomery
CN107220386A (en) Information-pushing method and device
US20090125529A1 (en) Extracting information based on document structure and characteristics of attributes
US20100211533A1 (en) Extracting structured data from web forums
CN108763321A (en) A kind of related entities recommendation method based on extensive related entities network
CN105843796A (en) Microblog emotional tendency analysis method and device
CN113722478B (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
CN104123321B (en) A kind of determining method and device for recommending picture
CN114328913A (en) Text classification method and device, computer equipment and storage medium
CN106776910A (en) The display methods and device of a kind of Search Results
US20140236968A1 (en) Discrete Wavelet Transform Method for Document Structure Similarity
CN113010771A (en) Training method and device for personalized semantic vector model in search engine
CN116977701A (en) Video classification model training method, video classification method and device
US9195940B2 (en) Jabba-type override for correcting or improving output of a model
CN113435213B (en) Method and device for returning answers to user questions and knowledge base
CN113806536B (en) Text classification method and device, equipment, medium and product thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant