CN107168997B - Webpage originality assessment method and device based on artificial intelligence and storage medium - Google Patents

Webpage originality assessment method and device based on artificial intelligence and storage medium Download PDF

Info

Publication number
CN107168997B
CN107168997B CN201710202081.7A CN201710202081A CN107168997B CN 107168997 B CN107168997 B CN 107168997B CN 201710202081 A CN201710202081 A CN 201710202081A CN 107168997 B CN107168997 B CN 107168997B
Authority
CN
China
Prior art keywords
webpage
sentence
original
weight
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710202081.7A
Other languages
Chinese (zh)
Other versions
CN107168997A (en
Inventor
马晋
程刚
张晋
周志奋
李田赫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201710202081.7A priority Critical patent/CN107168997B/en
Publication of CN107168997A publication Critical patent/CN107168997A/en
Application granted granted Critical
Publication of CN107168997B publication Critical patent/CN107168997B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage originality assessment method, a webpage originality assessment device and a storage medium based on artificial intelligence, wherein the method comprises the following steps: respectively acquiring the weight of each sentence extracted from a webpage to be processed, and identifying whether the sentence is an original sentence or not; and determining the original authority of the webpage to be processed according to the recognition result and the obtained weight of the sentence. By applying the scheme of the invention, the originality authority of the webpage can be effectively evaluated.

Description

Webpage originality assessment method and device based on artificial intelligence and storage medium
[ technical field ] A method for producing a semiconductor device
The invention relates to the internet technology, in particular to a webpage originality assessment method and device based on artificial intelligence and a storage medium.
[ background of the invention ]
Artificial Intelligence (Artificial Intelligence), abbreviated in english as AI. The method is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, a field of research that includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others.
With the explosive growth of internet data in recent years, search engine companies have begun retrieving billions of levels of web page resources. Behind the massive webpage resources, a considerable number of station owners or resource generators exist, and other high-quality original webpages are transshipped or even copied in order to reduce the webpage making cost or utilize other high-quality webpages to suck and click the own websites to increase the flow and the like.
Although the phenomenon is beneficial to the rapid propagation of network resources to a certain extent, because the author of the original content spends a certain amount of time and energy to create the content, the reprinting or plagiarism behavior can reduce or even eliminate the creation value of the original author; in addition, for a search engine or the like, if a large number of duplicate resources are searched, more costs such as storage and retrieval time and the like are consumed.
Therefore, the originality authority of the web page needs to be evaluated, so that the original and originality value approved resources can be shown to the user when the resource screening, the retrieval side resource recall, the retrieval side ordering strategy and other scenes are carried out, and the construction of search content ecology and the like are promoted.
However, there is no effective implementation in the prior art as to how to assess the originality authority of the web page.
[ summary of the invention ]
In view of the above, the present invention provides a method, an apparatus and a storage medium for webpage originality assessment based on artificial intelligence.
The specific technical scheme is as follows:
a webpage originality assessment method based on artificial intelligence comprises the following steps:
respectively acquiring the weight of each sentence extracted from a webpage to be processed, and identifying whether the sentence is an original sentence or not;
and determining the original authority of the webpage to be processed according to the recognition result and the obtained weight of the sentence.
An artificial intelligence-based webpage originality assessment apparatus, comprising: the device comprises a preprocessing module and an evaluation module;
the preprocessing module is used for respectively acquiring the weight of each sentence extracted from the webpage to be processed and identifying whether the sentence is an original sentence or not;
and the evaluation module is used for determining the original authority of the webpage to be processed according to the recognition result of the preprocessing module and the obtained weight of the sentence.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth above.
Based on the introduction, the scheme of the invention can respectively acquire the weight of each sentence extracted from the webpage to be processed, identify whether the sentence is the original sentence, and further determine the original authority of the webpage to be processed according to the identification result and the acquired weight of the sentence, thereby realizing effective evaluation on the original authority of the webpage.
[ description of the drawings ]
FIG. 1 is a flowchart of an embodiment of a web page originality assessment method based on artificial intelligence according to the present invention.
FIG. 2 is a flowchart of an embodiment of a method for assessing the authoritative authority of originality of a web page according to the first embodiment of the present invention.
FIG. 3 is a schematic diagram illustrating a structure of an embodiment of an apparatus for evaluating webpage originality based on artificial intelligence according to the present invention.
FIG. 4 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention.
[ detailed description ] embodiments
In order to make the technical solution of the present invention clearer and more obvious, the solution of the present invention is further described in detail below by referring to the drawings and examples.
Fig. 1 is a flowchart of an embodiment of a web page originality assessment method based on artificial intelligence, as shown in fig. 1, including the following specific implementation manners:
in 101, for each sentence extracted from a webpage to be processed, respectively obtaining a weight of the sentence, and identifying whether the sentence is an original sentence;
in 102, the original authority of the webpage to be processed is determined according to the recognition result and the obtained weight of the sentence.
Specific implementations of the above-described contents of each part are described in detail below.
One) sentence extraction
For any web page, the title (title) and the body content (page field) of the web page can be acquired by page analysis and the like.
The obtained text content can be segmented into sentences, for example, the sentences can be segmented according to the end sign with sentence ending significance and the webpage source code label in the natural language, and the too short sentences can be filtered out, and the end sign with sentence ending significance in the natural language can include. ","? "and"! "and the like.
Then, the weight of each sentence can be calculated, specifically, the following processing can be performed for each sentence: the method includes performing word segmentation and word deactivation processing on a sentence according to a basic granularity, then calculating a weight value of the sentence according to a processing result, for example, adding Inverse text Frequency (IDF) values of words (term) in the sentence obtained after the processing, taking the sum of the added IDF values as the weight value of the sentence, and how to obtain the IDF value.
For each webpage, the sentences cut from the text content of the webpage can be sequenced according to the sequence of the weight values from large to small, the sentences at the front M positions after sequencing are selected, M is a positive integer larger than one, and then the selected sentences and the title of the webpage are used as the sentences extracted from the webpage.
The specific value of M may be determined according to actual needs, for example, 30, and the title may be reserved and identified as a special sentence.
For each extracted sentence, the sentence signature of the sentence can be calculated on the basis of word segmentation and word stop processing, such as a simhash value, which is a commonly used character string hash algorithm.
Two) sentence original identification
In practical applications, in order to facilitate search by a search engine, a large number of web pages may be collected and stored in a database, and each web page has its own storage time.
A plurality of sentences may be extracted from each web page stored in the database, and then a sentence-level original dictionary may be generated according to the extracted sentences, for example, for a same sentence, the sentence on which web page is an original sentence is distinguished by comparing the storage time of different web pages where the sentence is located, and the sentence on which web page is a non-original sentence, and the storage time is the earliest in theory and is usually original.
In this way, by searching the original search dictionary, it is possible to recognize whether any sentence extracted from any web page stored in the database is an original sentence.
It should be noted that the above is only an example, and is not intended to limit the technical solution of the present invention, and besides the above, any other method that can be conceived by those skilled in the art may be adopted to identify whether the sentence is original or not.
Third) evaluation of originality authority of web page
The originality authority is a feature which has brand new description value for the original information of the webpage and is generated based on the following considerations: if the original sentence in a certain web page is referred to by other web pages, the web page is endowed with a quantitative index for describing that the web page has some form of authority in the original concept.
Theoretically, for the original authority of a webpage, the following numerical change rule is expected in quantitative description:
1) the more the original sentences in the webpage are quoted/reprinted by the webpage, the greater the authority of the original of the webpage is;
2) the greater the authority of the web page referencing the original sentence in the web page, the greater the authority of the web page.
Based on the above consideration, for the evaluation of the originality authority of the web page, the invention provides two implementation manners, which are introduced below.
In a first mode
Fig. 2 is a flowchart of an embodiment of a method for evaluating the authenticity of a web page according to a first mode of the present invention, as shown in fig. 2, including the following specific implementation modes.
In 201, each web page saved in the database is taken as a web page to be processed.
Namely, each webpage stored in the database is used as a webpage to be processed, and the original authority of each webpage is determined subsequently.
In 202, the reference relationship between the web pages is analyzed according to the identification result, and a series of directed edges are determined according to the analysis result, wherein each directed edge corresponds to two web pages respectively, and the direction is from one web page to the other web page.
The following processes can be performed for each web page stored in the database:
taking the webpage as a reference webpage, respectively determining the webpage where an original sentence corresponding to each non-original sentence in the reference webpage is located, performing deduplication processing on the determined webpage, and respectively taking each processed webpage as a reference source webpage corresponding to the reference webpage;
and respectively utilizing the reference webpage and a reference source webpage to form a directed edge pointing to the reference source webpage from the reference webpage.
For example, for a webpage a, after identifying whether each sentence extracted from the webpage a is an original sentence, a webpage where the original sentence corresponding to each non-original sentence is located may be determined, that is, a reference source of the non-original sentence is determined, and accordingly, the determined webpage may be referred to as a reference source webpage.
Assuming that 30 sentences are extracted from the webpage a, 15 sentences are original sentences, and the other 15 non-original sentences, wherein 5 of the 15 non-original sentences are referenced from the webpage b, 5 are referenced from the webpage c, and 5 are referenced from the webpage d, so that the webpage b, the webpage c, and the webpage d are reference source webpages corresponding to the webpage a.
Accordingly, 3 directed edges are available, directed edge pointing from web page a to web page b, directed edge pointing from web page a to web page c, and directed edge pointing from web page a to web page d.
According to the method, a webpage-level weighted directed acyclic graph can be constructed for all the webpages stored in the database, and for two webpages, if a reference relationship exists between the two webpages, the direction of an edge is determined by the warehousing time, so that the constructed graph is necessarily acyclic.
In 203, the weight of each directed edge is determined according to the recognition result and the obtained weight of the sentence.
For each directed edge, the weight value is determined by the number of the quoted sentences and the weight value of the quoted sentences, and the more quoted sentences are, the greater the weight value of the sentence is, the greater the weight value of the directed edge is.
Accordingly, for each directed edge, the following processes can be performed:
non-original sentences which meet the requirements are screened from the non-original sentences in the referring webpage corresponding to the directed edges, wherein the meeting requirements are as follows: the corresponding original sentence is positioned in the reference source webpage corresponding to the directed edge;
calculating the sum of the weights of the screened non-original sentences to obtain a first addition result;
calculating the sum of the weights of the non-original sentences in the reference webpage corresponding to the directed edge to obtain a second addition result;
and dividing the first addition result by the second addition result, and taking the calculation result as the weight of the directed edge.
I.e. for any directed edge, its weight
Figure BDA0001258831870000061
Assuming that the reference webpage corresponding to the directed edge is a webpage i, and the corresponding reference source webpage is a webpage j;
sj represents a set of non-original sentences, wherein the original sentences corresponding to the non-original sentences in the web page i are located in the web page j;
s represents the sentence in the set Sj, w(s) represents the weight of the sentence, and Is takes the value of 1 when the sentence Is the original sentence, otherwise Is 0;
s represents that a set of all sentences is extracted from the webpage i, and obviously, the sentences in the set Sj are also positioned in the set S;
s ' represents the sentence in the set S, w (S ') represents the weight of the sentence, Is takes a value of 1 when the sentence Is the original sentence, otherwise 0, since (1-1) × w (S ') > Is 0,
Figure BDA0001258831870000071
the sum of the weights of the non-original sentences in the webpage i is calculated.
In 204, the original authority of each web page is determined according to the weight values of all the directed edges.
For the obtained weight of each directed edge, regularization processing can be performed on the obtained weight, and the processing result is used as the transition probability of the iterative algorithm.
For any web page a, assuming that there are 3 directed edges pointing to other web pages from the web page a, the other web pages are respectively web page b, web page c and web page d, the weight of each directed edge can be calculated respectively, and assuming that the directed edges are respectively weight b, weight c and weight d, then the regularization processing modes of the 3 weights can respectively be:
the weight b ═ weight b/(weight b + weight c + weight d);
the weight c ═ weight c/(weight b + weight c + weight d);
the weight d ═ weight d/(weight b + weight c + weight d);
the weight b ', the weight c ' and the weight d ' are 3 transition probabilities obtained after the regularization processing.
And forming a transition probability matrix of P by P according to all transition probabilities, wherein P is a positive integer and the value is equal to the number of the network pages stored in the database.
Assuming that 10 web pages (actually far more) are stored in the database, a 10 × 10 transition probability matrix is obtained, where each element is a calculated transition probability, for example, the element with coordinate position (2,3) represents the transition probability corresponding to the directed edge pointing from the web page 2 to the web page 3.
According to the transition probability matrix, the original authority of each webpage can be determined simultaneously through an iterative algorithm.
Specifically, one full 1-ordinate e of the P dimension may be set first.
Thereafter, iterative operations may be performed, including: calculating a product of the original authority vector and the transition probability matrix, and adding the product and e, wherein e is used as the original authority vector during first iteration;
and determining whether iterative convergence is achieved, if not, taking the added sum as an original authority vector, and repeatedly executing the iterative operation, and if so, taking each element in the original authority vector as an original authority score of a webpage respectively.
Namely, the method comprises the following steps: v. ofi+1=Wvi+e; (2)
v represents an original authority vector, and v is equal to e when the vector is iterated for the first time;
w represents a transition probability matrix.
V is finally obtainedi+1Will be a longitudinal quantity in the P dimension, each element of which is the originality authority score of a web page stored in the database.
The physical meaning of the iterative process is: the original authority of a webpage is obtained by accumulating the initial original authority (e) and the original authority transferred by other webpages, for any webpage x, if more webpages quoting original sentences in the webpage x and the original authority of the quoted webpage is larger, the original authority of the webpage x is larger according to iterative operation, the expected numerical value change rule is consistent, and simultaneously, the convergence of the iterative strategy is ensured because the weighted directed acyclic graph is adopted.
How to determine whether iteration convergence is reached is prior art.
Mode two
In order to obtain the original authority of the webpage, besides the first mode, the second mode can be adopted.
In the mode, any webpage stored in the database can be used as the webpage to be processed, namely the original authority of one webpage can be determined independently, and the original authority of all webpages is not necessarily determined simultaneously like the mode I.
For each original sentence in the webpage to be processed, the product of the weight of the original sentence and the length of the inverted zipper corresponding to the original sentence can be calculated respectively.
And then, products corresponding to the original sentences can be added, and the added sum is used as the original authority score of the webpage to be processed.
Namely, the method comprises the following steps:
Figure BDA0001258831870000091
wherein org _ auth (ui) represents the originality authority score of the web page to be processed;
n represents the number of sentences extracted from the webpage to be processed, and for any sentence, if the sentence is an original sentence, the value Ij is 1, otherwise, the value is 0;
w (j) represents the weight of the sentence, fj represents the length of the inverted zipper corresponding to the sentence;
since Ij × fj × w (j) is 0 for the non-original sentences, the corresponding products of the original sentences are added in equation (3).
The length of the inverted zipper of one sentence may refer to: the database includes the number of web pages of the sentence, or, the database includes the number of web pages of the sentence or a sentence adjacent to the sentence, the adjacent sentence is a sentence whose hamming distance from the sentence signature of the sentence is less than a predetermined threshold, and the specific value of the threshold can be determined according to the actual need.
For example, if the hamming distance between the sentence signature of sentence a and the sentence signature of sentence B is smaller than the threshold, sentence B is a neighboring sentence of sentence a, and similarly, sentence a is also a neighboring sentence of sentence B, and how to calculate the hamming distance is the prior art.
Compared with the first mode and the second mode, iterative operation is not needed, so that computing resources and the like can be saved.
The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.
Fig. 3 is a schematic structural diagram of a web page originality assessment apparatus based on artificial intelligence according to an embodiment of the present invention, as shown in fig. 3, including: a preprocessing module 301 and an evaluation module 302.
The preprocessing module 301 is configured to, for each sentence extracted from the webpage to be processed, respectively obtain a weight of the sentence, and identify whether the sentence is an original sentence.
And the evaluation module 302 is configured to determine the authority of the original of the webpage to be processed according to the recognition result of the preprocessing module 301 and the obtained weight of the sentence.
Specifically, the preprocessing module 301 may extract sentences from the web page to be processed as follows:
acquiring a title and text content of a webpage to be processed;
segmenting the text content into sentences, and respectively calculating the weight of each segmented sentence;
sequencing each sentence cut out according to the sequence of the weight values from large to small;
and selecting the sentences which are at the top M positions after sorting, wherein M is a positive integer larger than one, and taking the selected sentences and the titles as the extracted sentences.
The preprocessing module 301 may perform word segmentation and word stop removal processing on each segmented sentence, add the IDF values of the words obtained after the processing, and use the sum as the weight of the sentence.
As shown in fig. 3, evaluation module 302 may include: a first evaluation unit 3021.
The first evaluation unit 3021 may further include: a first determining subunit 30211 and a second determining subunit 30212.
A first determining subunit 30211, configured to use each web page stored in the database as a web page to be processed; analyzing the reference relationship among the webpages according to the identification result, and determining a series of directed edges according to the analysis result, wherein each directed edge corresponds to two webpages respectively, and the direction is that one webpage points to the other webpage; respectively determining the weight of each directed edge according to the recognition result and the obtained weight of the sentence;
a second determining subunit 30212, configured to determine the authority of the original of each web page according to the weights of all the directed edges.
Specifically, the first determining subunit 30211 may perform the following processing for each web page respectively:
taking the webpage as a reference webpage, respectively determining the webpage where an original sentence corresponding to each non-original sentence in the reference webpage is located, performing deduplication processing on the determined webpage, and respectively taking each processed webpage as a reference source webpage corresponding to the reference webpage;
and respectively utilizing the reference webpage and a reference source webpage to form a directed edge pointing to the reference source webpage from the reference webpage.
The first determining subunit 30211 may, for each directed edge, respectively screen a non-original sentence that meets the requirement from the non-original sentences in the reference webpage corresponding to the directed edge, where the meeting requirement is: the corresponding original sentence is positioned in the reference source webpage corresponding to the directed edge; calculating the sum of the weights of the screened non-original sentences to obtain a first addition result; calculating the sum of the weights of the non-original sentences in the reference webpage corresponding to the directed edge to obtain a second addition result; and dividing the first addition result by the second addition result, and taking the calculation result as the weight of the directed edge.
The second determining subunit 30212 may perform regularization processing on the weight of each directed edge respectively, and use the processing result as a transition probability of an iterative algorithm; forming a transition probability matrix of P x P according to all transition probabilities, wherein P is a positive integer and the value is equal to the number of pages of the network stored in the database; and according to the transition probability matrix, determining the original authority of each webpage simultaneously through an iterative algorithm.
Specifically, the second determining subunit 30212 may set one full 1-ordinate e of the P dimension;
performing an iterative operation comprising: calculating a product of the original authority vector and the transition probability matrix, and adding the product and e, wherein e is used as the original authority vector during first iteration;
and determining whether iterative convergence is achieved, if not, taking the added sum as an original authority vector, and repeatedly executing the iterative operation, and if so, taking each element in the original authority vector as an original authority score of a webpage respectively.
As shown in fig. 3, the evaluation module 302 may further include: a second evaluation unit 3022.
A second evaluation unit 3022 configured to take any web page stored in the database as a web page to be processed; for each original sentence in the webpage to be processed, respectively calculating the product of the weight of the original sentence and the length of the inverted zipper corresponding to the original sentence; and adding the products corresponding to the original sentences, and taking the added sum as the original authority score of the webpage to be processed.
Wherein, the zip fastener length of falling row includes: the database contains the number of web pages of the original sentence, or the database contains the number of web pages of the original sentence or a sentence adjacent to the original sentence, wherein the adjacent sentence is a sentence of which the hamming distance between the adjacent sentence and the sentence signature of the original sentence is less than a preset threshold value.
For a specific work flow of the apparatus embodiment shown in fig. 3, please refer to the corresponding description in the foregoing method embodiment, which is not repeated.
FIG. 4 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention. The computer system/server 12 shown in FIG. 4 is only one example and should not be taken to limit the scope of use or functionality of embodiments of the present invention.
As shown in FIG. 4, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors (processing modules) 16, a memory 28, and a bus 18 that connects the various system components, including the memory 28 and the processors 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 20. As shown in FIG. 4, network adapter 20 communicates with the other modules of computer system/server 12 via bus 18. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer system/server 12, including but not limited to: microcode, device drivers, redundant processing modules, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processor 16 executes various functional applications and data processing by running the program stored in the memory 28, for example, implementing the method in the embodiment shown in fig. 1, that is, for each sentence extracted from the web page to be processed, obtaining the weight value of the sentence, identifying whether the sentence is an original sentence, and determining the authority of the original of the web page to be processed according to the identification result and the obtained weight value of the sentence.
Specifically, there may be at least two implementation methods, i.e., a first implementation method and a second implementation method, please refer to the related description in the embodiment of the method shown in fig. 1, which is not described again.
The invention also discloses a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, will carry out the method as in the embodiment shown in fig. 1.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method, etc., can be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (18)

1. A webpage originality assessment method based on artificial intelligence is characterized by comprising the following steps:
respectively acquiring the weight of each sentence extracted from a webpage to be processed, and identifying whether the sentence is an original sentence or not;
determining the original authority of the webpage to be processed according to the recognition result and the obtained weight of the sentence, wherein the method comprises the following steps: all the web pages stored in the database are used as web pages to be processed; analyzing the reference relationship among the webpages according to the identification result, and determining a series of directed edges according to the analysis result, wherein each directed edge corresponds to two webpages respectively, and the direction is that one webpage points to the other webpage; respectively determining the weight of each directed edge according to the recognition result and the obtained weight of the sentence; and simultaneously determining the original authority of each webpage according to the weights of all the directed edges.
2. The method of claim 1,
the sentence extraction of the webpage to be processed comprises the following steps:
acquiring the title and the text content of the webpage to be processed;
carrying out sentence segmentation on the text content, and respectively calculating the weight of each segmented sentence;
sequencing each sentence cut out according to the sequence of the weight values from large to small;
and selecting the sentences which are at the top M positions after sorting, wherein M is a positive integer larger than one, and taking the selected sentences and the titles as the extracted sentences.
3. The method of claim 2,
the respectively calculating the weight of each sentence includes:
and for each segmented sentence, performing word segmentation and word deactivation treatment on the sentence, adding the inverse text frequency IDF values of the words obtained after treatment, and taking the sum as the weight of the sentence.
4. The method of claim 1,
the step of analyzing the reference relationship among the webpages according to the identification result and determining a series of directed edges according to the analysis result comprises the following steps:
for each webpage, the following processing is respectively carried out:
taking the webpage as a reference webpage, respectively determining the webpage where an original sentence corresponding to each non-original sentence in the reference webpage is located, performing deduplication processing on the determined webpage, and respectively taking each processed webpage as a reference source webpage corresponding to the reference webpage;
and respectively utilizing the reference webpage and the reference source webpage to form a directed edge pointing to the reference source webpage from the reference webpage.
5. The method of claim 4,
the determining the weight of each directed edge according to the recognition result and the obtained weight of the sentence respectively comprises:
for each directed edge, respectively screening out non-original sentences meeting requirements from the non-original sentences in the reference webpage corresponding to the directed edge, wherein the meeting requirements are as follows: the corresponding original sentence is positioned in the reference source webpage corresponding to the directed edge;
calculating the sum of the weights of the screened non-original sentences to obtain a first addition result;
calculating the sum of the weights of the non-original sentences in the reference webpage corresponding to the directed edge to obtain a second addition result;
and dividing the first addition result by the second addition result, and taking the calculation result as the weight of the directed edge.
6. The method of claim 4,
the step of simultaneously determining the original authority of each webpage according to the weights of all the directed edges comprises the following steps:
respectively carrying out regularization processing on the weight of each directed edge, and taking a processing result as the transition probability of an iterative algorithm;
forming a transition probability matrix of P x P according to all transition probabilities, wherein P is a positive integer and the value is equal to the number of pages of the network stored in the database;
and according to the transition probability matrix, determining the original authority of each webpage simultaneously through an iterative algorithm.
7. The method of claim 6,
the simultaneously determining the originality authority of each webpage comprises the following steps:
setting a P-dimensional all-1 longitudinal vector e;
performing an iterative operation comprising: calculating a product of an original authority vector and the transition probability matrix, and adding the product and the e, wherein the e is used as the original authority vector during first iteration;
and determining whether iterative convergence is achieved, if not, taking the added sum as the original authority vector, and repeatedly executing the iterative operation, and if so, taking each element in the original authority vector as the original authority score of one webpage respectively.
8. The method of claim 1,
the determining the original authority of the webpage to be processed according to the recognition result and the obtained weight of the sentence further comprises:
taking any webpage stored in the database as a webpage to be processed;
respectively calculating the product of the weight of the original sentence and the length of the inverted zipper corresponding to the original sentence aiming at each original sentence in the webpage to be processed;
adding the products corresponding to the original sentences, and taking the added sum as the original authority score of the webpage to be processed; the corresponding product is the product of the weight of the original sentence and the length of the inverted zipper corresponding to the original sentence;
wherein the inverted zipper length comprises: the database comprises the number of web pages of the original sentence, or the database comprises the number of web pages of the original sentence or a sentence adjacent to the original sentence, wherein the adjacent sentence is a sentence of which the hamming distance between the adjacent sentence and the sentence signature of the original sentence is less than a preset threshold value.
9. A webpage originality assessment apparatus based on artificial intelligence, comprising: the device comprises a preprocessing module and an evaluation module;
the preprocessing module is used for respectively acquiring the weight of each sentence extracted from the webpage to be processed and identifying whether the sentence is an original sentence or not;
the evaluation module is used for determining the original authority of the webpage to be processed according to the recognition result of the preprocessing module and the obtained weight of the sentence;
wherein, the evaluation module comprises: a first evaluation unit;
the first evaluation unit further comprises: a first determining subunit and a second determining subunit;
the first determining subunit is used for taking each webpage stored in the database as a webpage to be processed; analyzing the reference relationship among the webpages according to the identification result, and determining a series of directed edges according to the analysis result, wherein each directed edge corresponds to two webpages respectively, and the direction is that one webpage points to the other webpage; respectively determining the weight of each directed edge according to the recognition result and the obtained weight of the sentence;
and the second determining subunit is used for simultaneously determining the original authority of each webpage according to the weights of all the directed edges.
10. The apparatus of claim 9,
the preprocessing module extracts sentences of the webpage to be processed according to the following modes:
acquiring the title and the text content of the webpage to be processed;
carrying out sentence segmentation on the text content, and respectively calculating the weight of each segmented sentence;
sequencing each sentence cut out according to the sequence of the weight values from large to small;
and selecting the sentences which are at the top M positions after sorting, wherein M is a positive integer larger than one, and taking the selected sentences and the titles as the extracted sentences.
11. The apparatus of claim 10,
and the preprocessing module is used for respectively carrying out word segmentation and word stop removal on each segmented sentence, adding the inverse text frequency IDF values of the words obtained after processing, and taking the sum as the weight of the sentence.
12. The apparatus of claim 9,
the first determining subunit performs the following processing for each web page respectively:
taking the webpage as a reference webpage, respectively determining the webpage where an original sentence corresponding to each non-original sentence in the reference webpage is located, performing deduplication processing on the determined webpage, and respectively taking each processed webpage as a reference source webpage corresponding to the reference webpage;
and respectively utilizing the reference webpage and the reference source webpage to form a directed edge pointing to the reference source webpage from the reference webpage.
13. The apparatus of claim 12,
the first determining subunit is used for screening out non-original sentences meeting requirements from the non-original sentences in the reference webpage corresponding to the directed edges respectively aiming at each directed edge, and the meeting requirements are as follows: the corresponding original sentence is positioned in the reference source webpage corresponding to the directed edge; calculating the sum of the weights of the screened non-original sentences to obtain a first addition result; calculating the sum of the weights of the non-original sentences in the reference webpage corresponding to the directed edge to obtain a second addition result; and dividing the first addition result by the second addition result, and taking the calculation result as the weight of the directed edge.
14. The apparatus of claim 12,
the second determining subunit carries out regularization processing on the weight of each directed edge respectively, and a processing result is used as a transition probability of an iterative algorithm; forming a transition probability matrix of P x P according to all transition probabilities, wherein P is a positive integer and the value is equal to the number of pages of the network stored in the database; and according to the transition probability matrix, determining the original authority of each webpage simultaneously through an iterative algorithm.
15. The apparatus of claim 14,
the second determining subunit sets a P-dimensional all-1 longitudinal vector e;
performing an iterative operation comprising: calculating a product of an original authority vector and the transition probability matrix, and adding the product and the e, wherein the e is used as the original authority vector during first iteration;
and determining whether iterative convergence is achieved, if not, taking the added sum as the original authority vector, and repeatedly executing the iterative operation, and if so, taking each element in the original authority vector as the original authority score of one webpage respectively.
16. The apparatus of claim 9,
the evaluation module comprises: a second evaluation unit;
the second evaluation unit is used for taking any webpage stored in the database as a webpage to be processed; respectively calculating the product of the weight of the original sentence and the length of the inverted zipper corresponding to the original sentence aiming at each original sentence in the webpage to be processed; adding the products corresponding to the original sentences, and taking the added sum as the original authority score of the webpage to be processed; the corresponding product is the product of the weight of the original sentence and the length of the inverted zipper corresponding to the original sentence;
wherein the inverted zipper length comprises: the database comprises the number of web pages of the original sentence, or the database comprises the number of web pages of the original sentence or a sentence adjacent to the original sentence, wherein the adjacent sentence is a sentence of which the hamming distance between the adjacent sentence and the sentence signature of the original sentence is less than a preset threshold value.
17. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any one of claims 1 to 8.
18. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 8.
CN201710202081.7A 2017-03-30 2017-03-30 Webpage originality assessment method and device based on artificial intelligence and storage medium Active CN107168997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710202081.7A CN107168997B (en) 2017-03-30 2017-03-30 Webpage originality assessment method and device based on artificial intelligence and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710202081.7A CN107168997B (en) 2017-03-30 2017-03-30 Webpage originality assessment method and device based on artificial intelligence and storage medium

Publications (2)

Publication Number Publication Date
CN107168997A CN107168997A (en) 2017-09-15
CN107168997B true CN107168997B (en) 2021-07-20

Family

ID=59848997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710202081.7A Active CN107168997B (en) 2017-03-30 2017-03-30 Webpage originality assessment method and device based on artificial intelligence and storage medium

Country Status (1)

Country Link
CN (1) CN107168997B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595439B (en) * 2018-05-04 2022-04-12 北京中科闻歌科技股份有限公司 Method and system for analyzing character propagation path

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093485A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for filtering out repeated contents on web page
CN101499098A (en) * 2009-03-04 2009-08-05 阿里巴巴集团控股有限公司 Web page assessed value confirming and employing method and system
CN101539923A (en) * 2008-03-18 2009-09-23 北京搜狗科技发展有限公司 Method and device for extracting text segment from file
CN102799647A (en) * 2012-06-30 2012-11-28 华为技术有限公司 Method and device for webpage reduplication deletion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093485A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for filtering out repeated contents on web page
CN101539923A (en) * 2008-03-18 2009-09-23 北京搜狗科技发展有限公司 Method and device for extracting text segment from file
CN101499098A (en) * 2009-03-04 2009-08-05 阿里巴巴集团控股有限公司 Web page assessed value confirming and employing method and system
CN102799647A (en) * 2012-06-30 2012-11-28 华为技术有限公司 Method and device for webpage reduplication deletion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
网络原创文章优先的搜索引擎排序算法研究;郝金隆;《中国优秀硕士学位论文全文数据库信息科技辑》;20080515;第I138-819页 *

Also Published As

Publication number Publication date
CN107168997A (en) 2017-09-15

Similar Documents

Publication Publication Date Title
CN109087135B (en) Mining method and device for user intention, computer equipment and readable medium
CN108629043B (en) Webpage target information extraction method, device and storage medium
Li et al. Mining evidences for named entity disambiguation
CN111581355B (en) Threat information topic detection method, device and computer storage medium
CN107038157B (en) Artificial intelligence-based recognition error discovery method and device and storage medium
CN110569335B (en) Triple verification method and device based on artificial intelligence and storage medium
CN110033018B (en) Graph similarity judging method and device and computer readable storage medium
CN109871491A (en) Forum postings recommended method, system, equipment and storage medium
CN108090043B (en) Error correction report processing method and device based on artificial intelligence and readable medium
US10528662B2 (en) Automated discovery using textual analysis
CN110162786B (en) Method and device for constructing configuration file and extracting structured information
WO2022116435A1 (en) Title generation method and apparatus, electronic device and storage medium
CN107169011B (en) Webpage originality identification method and device based on artificial intelligence and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
CN109271624B (en) Target word determination method, device and storage medium
CN111597309A (en) Similar enterprise recommendation method and device, electronic equipment and medium
CN109815481B (en) Method, device, equipment and computer storage medium for extracting event from text
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN107861948B (en) Label extraction method, device, equipment and medium
CN113536182A (en) Method and device for generating long text webpage, electronic equipment and storage medium
CN107168997B (en) Webpage originality assessment method and device based on artificial intelligence and storage medium
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN112115266A (en) Malicious website classification method and device, computer equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant