CN106649871A - Detection method, apparatus and computing equipment for repetition degree of articles - Google Patents

Detection method, apparatus and computing equipment for repetition degree of articles Download PDF

Info

Publication number
CN106649871A
CN106649871A CN201710002050.7A CN201710002050A CN106649871A CN 106649871 A CN106649871 A CN 106649871A CN 201710002050 A CN201710002050 A CN 201710002050A CN 106649871 A CN106649871 A CN 106649871A
Authority
CN
China
Prior art keywords
section
similarity
article
search
multiplicity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710002050.7A
Other languages
Chinese (zh)
Other versions
CN106649871B (en
Inventor
潘庆翔
黄海澄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou I9Game Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou I9Game Information Technology Co Ltd filed Critical Guangzhou I9Game Information Technology Co Ltd
Priority to CN201710002050.7A priority Critical patent/CN106649871B/en
Publication of CN106649871A publication Critical patent/CN106649871A/en
Application granted granted Critical
Publication of CN106649871B publication Critical patent/CN106649871B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses a detection method, apparatus and computing equipment for the repetition degree of articles. The method comprises: splitting the article to be detected to obtain a plurality of slices; performing search operation on at least part of the slices of the plurality of slices to obtain a search result corresponding to each slice in the part of the slices, wherein the search result comprises one or more result pages; computing the similarity between each slice and the corresponding result page; and determining the repetition degree of the article to be detected based on the computed similarity. Thus, the method is implemented by splitting the article to be detected, computing the similarity of the plurality of slices, and determining the repetition degree of the article to be detected according to the computed similarity of the plurality of slices.

Description

The detection method of article multiplicity, device and computing device
Technical field
The present invention relates to internet arena, method, device that more particularly to a kind of multiplicity to article is detected And computing device.
Background technology
There are many soft text of popularization that user or company's submission are provided, Jing in the information article of many website orientations at present The situation of multiple contribution often occurs, and website needs to pay the utmost attention to the whether exclusive original of information content when information is issued Wound (whether the information content to be issued is included by other websites), because search engine is more willing to include and in Search Results In it is preferential show exclusive original information content, to improve the search experience of user.
For search engine, if website lack the most contents of exclusive original content or issue relative to If for search engine being repetition, then it can carry out drop power to website, the scoring to website be reduced, so as to reduce web site contents Exposure displaying amount in a search engine.So website is when article is issued, need to recognize this article whether by other multiple nets Include at station.
Thus, it is desirable to a kind of scheme that can be detected by the multiplicity that other websites are included to article.
The content of the invention
Present invention is primarily targeted at providing a kind of detection method of article multiplicity, device and computing device, it can To detect the multiplicity that article to be detected is included by other websites exactly.
According to an aspect of the invention, there is provided a kind of detection method of article multiplicity, including:To article to be detected Cutting is carried out, to obtain multiple sections;To at least partly section performs search operation in multiple sections, to obtain corresponding to part The Search Results of each section in section;Calculate the similarity between each section and corresponding Search Results;According to meter The similarity for obtaining is determining the multiplicity of article to be detected.
Accordingly, for article to be detected, it can be cut into slices and scan for, to calculate the similarity of section, Ran Hougen The multiplicity for obtaining article can be just calculated according to the similarity of multiple sections.
Preferably, the step of calculating the similarity between each section and corresponding Search Results can include:It is right Section carries out participle, to obtain first participle result;Participle is carried out to the matching content in results page, to obtain the second participle As a result;Calculate the word frequency of first participle result and the second word segmentation result respectively, with obtain first the second word frequency of word frequency vector sum to Amount;According to cosine similarity calculate first the second word frequency of word frequency vector sum vector similarity, as section and it is corresponding Similarity between Search Results.
Thus, it is possible to calculate the similarity between section and results page using cosine similarity.
Preferably, the step of according to calculated similarity come the multiplicity for determining article, can include:Calculate similar Degree accounts for the ratio of the number of total similarity more than the number of the first predetermined threshold, and ratio is the multiplicity of article to be detected.
Preferably, to including the step of at least partly section performs search operation in multiple sections:Drawn using search Hold up respectively at least partly section is scanned in database in multiple sections.
Thus, it is possible to crawl Search Results in corresponding database using search engine (one or more).
Preferably, when the quantity of the plurality of section is more than the second predetermined threshold, multiple section partial cut pieces are held Line search is operated, and when the quantity of the plurality of section is less than the second predetermined threshold, each section execution in multiple sections is searched Rope is operated.
Thus, when the number of sections for obtaining article cutting to be detected is more, part section can be chosen to carry out Search.
Preferably, the method can also include:Extract the keyword in article to be detected;According to keyword in a slice Appearance situation, is that each section at least partly section gives weight.
Thus, the keyword in article to be detected can also be in advance obtained, is then occurred in a slice according to keyword Situation, is that section gives corresponding weight.As such, it is possible to improve the degree of accuracy of article multiplicity detection.
According to another aspect of the present invention, a kind of detection means of article multiplicity is additionally provided, including:Cutting list Unit, for carrying out cutting to article to be detected, to obtain multiple sections;Search unit, at least part of in multiple sections Section performs search operation, to obtain the Search Results corresponding to each section in partially sliced;Similarity calculated, is used for Calculate the similarity between each section and corresponding Search Results;Multiplicity determining unit, is calculated for basis Similarity determining the multiplicity of article to be detected.
Preferably, cutting unit can include:First participle module, for carrying out participle to section, to obtain first point Word result;Second word-dividing mode, for carrying out participle to the matching content in Search Results, to obtain the second word segmentation result;Word Frequency computing module, for calculating the word frequency of first participle result and the second word segmentation result respectively, to obtain the first word frequency vector sum Second word frequency vector, similarity calculated calculates the similar of first the second word frequency of word frequency vector sum vector according to cosine similarity Degree, as the similarity between section and corresponding results page.
Preferably, multiplicity determining unit can calculate similarity and account for total similarity more than the number of the first predetermined threshold Number ratio, ratio is the multiplicity of article to be detected.
Preferably, search unit can be using search engine respectively at least partly cutting into slices in database in multiple sections Scan for.
Preferably, when the quantity of multiple sections is more than the second predetermined threshold, search unit is to multiple section partial cuts Piece perform search operation, multiple sections quantity be less than the second predetermined threshold when, search unit in multiple sections each cut Piece performs search operation.
Preferably, the detection means can also include:Keyword extracting unit, for extracting article to be detected in key Word;Weight given unit, is that each section at least partly section is assigned for the appearance situation according to keyword in a slice Give weight.
According to another aspect of the present invention, a kind of computing device is additionally provided, including:Network interface, network interface makes Obtaining computing device can be via one or more network service;Memory, the Internet resources loaded by network interface are buffered in In memory;And processor, it is connected with network interface and memory, processor is configured to perform following operation:To be checked Surveying article carries out cutting, to obtain multiple sections;To at least partly section performs search operation in multiple sections, to obtain correspondence The Search Results of each section in partially sliced, wherein, Search Results include one or more results pages;Calculate each to cut Similarity between piece and each corresponding results page;Article to be detected is determined according to calculated similarity Multiplicity.
To sum up, the detection method of article multiplicity of the invention, device and computing device be by article cutting to be detected, The similarity of multiple sections is calculated, the multiplicity of article to be detected is determined according to the similarity of calculated multiple sections. So, article to be detected propagation degree on the internet is assured that according to calculated multiplicity, is may thereby determine that Whether this article also has is issued value.
Description of the drawings
Disclosure illustrative embodiments are described in more detail by combining accompanying drawing, the disclosure above-mentioned and its Its purpose, feature and advantage will be apparent from, wherein, in disclosure illustrative embodiments, identical reference number Typically represent same parts.
Fig. 1 shows the structured flowchart of computing device according to an embodiment of the invention.
Fig. 2 shows the indicative flowchart of the detection method of article multiplicity according to an embodiment of the invention.
Fig. 3 shows effect diagram when scanning for using section.
Fig. 4 shows the structured flowchart of the detection means of article multiplicity according to an embodiment of the invention.
Fig. 5 shows the structured flowchart of the submodule that similarity calculated can include.
Specific embodiment
The preferred embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Preferred embodiment, however, it is to be appreciated that may be realized in various forms the disclosure and the embodiment party that should not be illustrated here Formula is limited.Conversely, these embodiments are provided so that the disclosure is more thorough and complete, and can be by the disclosure Scope intactly conveys to those skilled in the art.
For as previously mentioned, for search engine, if website lacks the major part of original exclusive content or issue Content is that if repetition, then it can carry out drop power to website for search engine, the scoring to website is reduced, so as to subtract Few web site contents exposure displaying amount in a search engine.So website is when content is issued, needing identification, content is searched draws The situation (for example whether the number of times for being included, being included) included is held up, to determine the content of issue whether relative to search engine It is exclusive original.
This is directed to, the present invention proposes a kind of scheme that can detect article multiplicity.Wherein, " the weight addressed herein Multiplicity " is for characterizing article propagation degree on the internet.For example, it is assumed that article A is only included by limited several websites, then may be used It is relatively low with the multiplicity for thinking article A, it is worth with issuing;Assume that article A is included by many websites, then it is considered that article The multiplicity of A is higher, has not possessed issue value.
Embodiments of the invention are specifically described below with reference to Fig. 1 to Fig. 5.
Fig. 1 shows the structured flowchart of computing device according to an embodiment of the invention 100.The portion of computing device 100 Part includes but is not limited to network interface 110, memory 120 and one or more processors 130.Processor 130 connects with network Mouth 110 is connected with memory 120.In one embodiment of the invention, do not show in above-mentioned and Fig. 1 of computing device 100 The miscellaneous part for going out can also be connected to each other, such as by bus.It should be appreciated that the computing device structured flowchart shown in Fig. 1 It is solely for the purpose of illustration, rather than limitation of the scope of the invention.Those skilled in the art can increase as needed Or replacement miscellaneous part.
Computing device 100 preferably can be any kind of mobile computing device, including mobile computer or mobile meter Calculation equipment (for example, tablet PC, personal digital assistant, laptop computer, notebook, net book etc.), movement Phone (for example, smart mobile phone), wearable computing device (for example, intelligent watch, intelligent glasses etc.) or other kinds of shifting Dynamic equipment.
Network interface 110 enables computing device 100 via one or more network services.The example bag of these networks Include the combination of the communication network of LAN (LAN), wide area network (WAN), Personal Area Network (PAN) or such as internet.Network interface 110 One or more in wired or wireless any kind of network interface (for example, NIC (NIC)) can be included, it is all As IEEE802.11 WLANs (WLAN) wave point, worldwide interoperability for microwave accesses (Wi-MAX) interface, Ethernet interface, USB (USB) interface, cellular network interface, blue tooth interface, near-field communication (NFC) interface etc..
The network data being had access to by network interface 110 is buffered in memory 120.Memory 120 may include storage One or more in any kind of storage device of the content of document form or other forms, including magnetic hard disk drives, Solid-state hard disk driver, semiconductor memory apparatus, flash memory, or can storage program instruction or any other computer of digital information can Read writable storage media.
Processor 130 can read the network data cached in memory 120, and be configured to determine the weight of article to be detected Multiplicity.
Wherein, processor 130 determines that the detailed process of article multiplicity may refer to Fig. 2.Fig. 2 is showed according to this The flow chart of the inspection method of the article multiplicity of a bright embodiment.
Referring to Fig. 2, the process for determining article multiplicity may begin at step S210, and to article to be detected cutting is carried out, To obtain multiple sections.Here it is possible to take various ways to carry out cutting to article to be detected, for example can be according to punctuation mark Cutting is carried out to article to be detected, it is also possible to which cutting is carried out to article to be detected according to semanteme, can specifically there are various realization sides Formula, repeats no more here.
In step S220, at least partly section performs search operation in multiple sections, to obtain corresponding to partially sliced In each section Search Results.
(it is when carrying out the quantity of multiple sections that cutting is obtained to article to be detected in step S210 more than predetermined threshold It is easy to distinguish, the second predetermined threshold is properly termed as here, specific numerical values recited can sets according to actual conditions), can be with Search is participated in from selected part section in multiple sections (such as can randomly select half section).Obtained based on step S210 When the quantity of the multiple sections arrived is less than the second predetermined threshold, whole sections can be chosen and participate in search.
Search engine (such as Google, Baidu, search dog, search well) may be used herein search operation is performed to section.Right When section is scanned for, it is possible to use search engine is scanned for whole section, it is also possible to first carry out participle to section, then Scanned for according to word segmentation result.For example, it is assumed that a section that cutting obtains is carried out to article to be detected for " contents paying Essence be manufacture closing and scarcity ", will can entirely cut into slices " essence of contents paying be manufacture closing and scarcity " as Keyword is scanned for using search engine, it is also possible to which individual section " essence of contents paying is manufacture closing and scarcity " is entered Row participle, obtains multiple keywords " contents paying ", " essence ", " manufacture closing ", " scarcity ", is then obtained using participle Multiple keywords are scanned for.
The Search Results obtained to section execution search operation are possibly multiple, it is also possible to zero or limited quantity. It is similar in therefore, it can take predetermined quantity (concrete numerical value can set as needed) Search Results to participate in step S230 Degree is calculated.Wherein, during the number deficiency predetermined quantity of the Search Results for obtaining in step S220, the whole come can be searched out Search Results participate in the Similarity Measure in step S230.
In step S230, the similarity between each section and corresponding Search Results is calculated.
Here the similarity between section and corresponding Search Results can be calculated according to cosine similarity.Specifically Ground, can carry out participle, to obtain first participle result and the second participle to the matching content in section and Search Results respectively As a result.
Wherein, the matching content in Search Results is the content corresponding with section.For example, using search engine to cutting When piece is scanned for, the Search Results being displayed in search results pages generally comprise the content and network address below title, title, Wherein, below title content is exactly typically the content similar with search word (section i.e. herein), therefore using searching Index is held up when scanning for section, can be directly using the content below title in Search Results as matching content.So, nothing Result of page searching need to be entered back into, the content in result of page searching and section are contrasted, to obtain matching content.
As an example, it is assumed that it is " although online contact is more that the section that cutting obtains is carried out to article to be detected But secondary I only met simultaneously with prosperous younger brother Rong Xue of nickname Daqu ", scanned for as search word with the section using search engine When the results page that obtains as shown in figure 3, for first Search Results or second Search Results, what title bar was put below draws The content of horizontal line part can serve as the matching content cut into slices.
After the first participle result and the second word segmentation result for obtaining, it is possible to calculate first participle result and the respectively The word frequency of two word segmentation results, to obtain first the second word frequency of word frequency vector sum vector.
Then according to cosine similarity calculate the first word frequency vector sum described in the second word frequency vector similarity, as section And the similarity between corresponding results page.
Wherein, the detailed process using cosine similarity calculating similarity is known to those skilled in the art, here not Repeat again.
In step S240, the multiplicity of article to be detected is determined according to calculated similarity.
Here the ratio that similarity accounts for the number of total similarity more than the number of the first predetermined threshold, the ratio can be calculated Value can serve as the multiplicity of article to be detected.
Specifically, for based on similar between the calculated section of step S230 and corresponding Search Results Degree, when the value of calculated similarity is more than the first predetermined threshold, it is believed that the section is similar with the Search Results.This Sample, the number that can count similar accounts for the ratio of the number of the similarity of total calculating, the multiplicity of article as to be detected.
For calculated article multiplicity, standards of grading can be formulated, for example can be according to the size of multiplicity point To recommend (multiplicity≤20%), it is proposed that modification (60% > multiplicity >=40%) and discarded (multiplicity >=60%).
As an alternative embodiment of the present invention, before execution step S210, during article to be detected can also be extracted Keyword (can be one or more), according to keyword appearance situation (number of the keyword of appearance, frequency in a slice Rate etc.), it is that each section at least partly section gives weight.
When the similarity between section and corresponding Search Results is calculated, can be according to the big of the weighted value of section Little, the Search Results for choosing respective numbers participate in the calculating of similarity.For example, the section high for weight, is searching to it During rope (step S220), the calculating that greater number of Search Results in the top participate in similarity can be chosen, for weight Relatively low section, when scanning for it, can choose small number of Search Results in the top and participate in similarity Calculate.
Fig. 4 shows the functional block diagram of the detection means 400 of article multiplicity according to an embodiment of the invention.Detection dress Put 400 functional module being implemented in combination in by hardware, software or the hardware and software for realizing the principle of the invention, for example lead to The one or more processors 130 crossed in the computing device 100 shown in Fig. 1 are realizing.It will be appreciated by those skilled in the art that It is that the functional module described by Fig. 4 can combine or be divided into submodule, so as to realize the principle of foregoing invention.Cause This, description herein can support any possible combination to functions described herein module or divide or more enter one The restriction of step.
Referring to Fig. 4, detection means 400 include cutting unit 410, search unit 420, similarity calculated 430 and Multiplicity determining unit 440.
Cutting unit 410 is used to carry out cutting to article to be detected, to obtain multiple sections.It is right that search unit 420 is used for At least partly section performs search operation in multiple sections, to obtain the Search Results corresponding to each section in partially sliced, Wherein, Search Results include one or more results pages.
Wherein, search unit 420 can be using search engine respectively at least partly section is being counted in the plurality of section According to scanning in storehouse.When the quantity of multiple sections is more than the second predetermined threshold, search unit 420 can be in multiple sections Partially sliced execution search operation, when the quantity of multiple sections is less than the second predetermined threshold, search unit 420 can be to multiple Each section in section performs search operation.
Similarity calculated 430 is used to calculate similar between each section and each corresponding results page Degree.Multiplicity determining unit 440 is used to determine the multiplicity of article to be detected according to calculated similarity.Here, weight Multiplicity determining unit 440 can calculate the ratio that similarity accounts for the number of total similarity more than the number of the first predetermined threshold, The ratio can serve as the multiplicity of article to be detected.
Fig. 5 shows the schematic block diagram of the functional module that similarity calculated 430 can have.
Referring to Fig. 5, similarity calculated 430 can include first participle module 4310, the second word-dividing mode 4320, word Frequency computing module 4330 and similarity calculation module 4340.
First participle module 4310 is used to carry out participle to section, to obtain first participle result.Second word-dividing mode 4320 are used to carry out participle to the matching content in results page, to obtain the second word segmentation result.Word frequency computing module 4330 is used In the word frequency for calculating first participle result and the second word segmentation result respectively, to obtain first the second word frequency of word frequency vector sum vector. Similarity calculation module 4340 is used to calculate the similarity of first the second word frequency of word frequency vector sum vector according to cosine similarity, makees For the similarity between section and corresponding results page.
Fig. 4 is returned to, detection means 400 can also include keyword extracting unit 450 and weight given unit 460.
Wherein, keyword extracting unit 450 is used to extract the keyword (one or more) in article to be detected.Weight is assigned Unit 460 is given for according to keyword appearance situation in a slice (number, frequency that keyword occurs etc.), being at least portion Each section in cutting piece gives weight.
Similarity calculated 430, can be with root when the similarity between section and corresponding Search Results is calculated According to the size of the weighted value of section, the Search Results for choosing respective numbers participate in the calculating of similarity.For example, for weight is high Section, when search unit 420 is scanned for it, can choose greater number of Search Results in the top and participate in similar The calculating of degree, the section relatively low for weight, when scanning for it, can choose small number of search in the top As a result the calculating of similarity is participated in.
Above describe detection method, device and the meter of article multiplicity of the invention in detail by reference to accompanying drawing Calculation equipment.
To sum up, following beneficial effect can be realized based on the present invention:The quick detection of the original degree of article, and need not enter and search Index holds up result of page searching carries out artificial contrast;Search engine be more willing to include and preferentially show in Search Results it is original solely The data content of family, by the program exclusive content original relative to search engine can be filtered out, and improve search engine pair The scoring of website, increases the exposure displaying amount of web site contents;The utilization rate of premium content is improved, the overall matter of web site contents is improved Amount, improves the SEO of website;The standards of grading of the original degree of article, there is provided whether also have the value issued to operation reference;To user There is provided the rare content of high-quality, lift user and the cognition degree of website is worth.
Additionally, the method according to the invention is also implemented as a kind of computer program, the computer program include for Perform the computer program code instruction of the above steps limited in the said method of the present invention.Or, it is of the invention Method is also implemented as a kind of computer program, and the computer program includes computer-readable medium, in the meter The computer program of the above-mentioned functions being stored with calculation machine computer-readable recording medium for limiting in the said method for performing the present invention.Ability Field technique personnel will also understand is that, the various illustrative logical blocks, module, circuit and algorithm with reference to described by disclosure herein Step may be implemented as the combination of electronic hardware, computer software or both.
Flow chart and block diagram in accompanying drawing shows the possibility reality of the system and method for multiple embodiments of the invention Existing architectural framework, function and operation.At this point, each square frame in flow chart or block diagram can represent module, a journey A part for sequence section or code a, part for the module, program segment or code is used to realize regulation comprising one or more The executable instruction of logic function.It should also be noted that in some are as the realization replaced, the function of being marked in square frame also may be used With with different from the order marked in accompanying drawing generation.For example, two continuous square frames can essentially be performed substantially in parallel, They can also be performed in the opposite order sometimes, and this is depending on involved function.It is also noted that block diagram and/or stream The combination of each square frame and block diagram and/or the square frame in flow chart in journey figure, can be with the function or operation for performing regulation Special hardware based system realizing, or can be realized with the combination of computer instruction with specialized hardware.
It is described above various embodiments of the present invention, described above is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.In the case of the scope and spirit without departing from illustrated each embodiment, for this skill Many modifications and changes will be apparent from for the those of ordinary skill in art field.The selection of term used herein, purport Best explaining principle, practical application or the improvement to the technology in market of each embodiment, or make the art Other those of ordinary skill are understood that each embodiment disclosed herein.

Claims (13)

1. a kind of computing device, including:
Network interface, the network interface enables the computing device via one or more network services;
Memory, the Internet resources loaded by the network interface are cached in which memory;And
Processor, is connected with the network interface and the memory, and the processor is configured to perform following operation:
Cutting is carried out to article to be detected, to obtain multiple sections;
To at least partly section performs search operation in the plurality of section, with obtain corresponding to it is described it is partially sliced in each cut The Search Results of piece;
Calculate the similarity between each section and corresponding Search Results;
The multiplicity of the article to be detected is determined according to calculated similarity.
2. a kind of detection means of article multiplicity, including:
Cutting unit, for carrying out cutting to article to be detected, to obtain multiple sections;
Search unit, at least partly section performs search operation in the plurality of section, to obtain corresponding to the portion The Search Results of each section in cutting piece;
Similarity calculated, for calculating the similarity between each section and corresponding Search Results;
Multiplicity determining unit, for determining the multiplicity of the article to be detected according to calculated similarity.
3. detection means according to claim 2, wherein, the similarity calculated includes:
First participle module, for carrying out participle to the section, to obtain first participle result;
Second word-dividing mode, for carrying out participle to the matching content in the Search Results, to obtain the second word segmentation result;
Word frequency computing module, for calculating the word frequency of the first participle result and second word segmentation result respectively, to obtain First the second word frequency of word frequency vector sum vector;
Similarity calculation module, for the second word frequency vector according to cosine similarity calculating the first word frequency vector sum Similarity, as the similarity between the section and corresponding Search Results.
4. detection means according to claim 2, wherein, it is pre- more than first that the multiplicity determining unit calculates similarity Determine threshold value number account for total similarity number ratio, the ratio is the multiplicity of the article to be detected.
5. detection means according to claim 2, wherein, the search unit is using search engine respectively to the plurality of At least partly section is scanned in database in section.
6. detection means according to claim 2, wherein,
When the quantity of the plurality of section is more than the second predetermined threshold, the search unit is to the plurality of section partial cut Piece performs search operation,
The plurality of section quantity be less than the second predetermined threshold when, the search unit in the plurality of section each cut Piece performs search operation.
7. detection means according to claim 2, also includes:
Keyword extracting unit, for extracting the article to be detected in keyword;
Weight given unit, is at least part of section for the appearance situation according to the keyword in the section In each section give weight.
8. a kind of detection method of article multiplicity, including:
Cutting is carried out to article to be detected, to obtain multiple sections;
To at least partly section performs search operation in the plurality of section, with obtain corresponding to it is described it is partially sliced in each cut The Search Results of piece;
Calculate the similarity between each section and corresponding Search Results;
The multiplicity of the article to be detected is determined according to calculated similarity.
9. detection method according to claim 8, wherein, it is described to calculate each section and corresponding search is tied Fruit between similarity the step of include:
Participle is carried out to the section, to obtain first participle result;
Participle is carried out to the matching content in the Search Results, to obtain the second word segmentation result;
The word frequency of the first participle result and second word segmentation result is calculated respectively, to obtain the first word frequency vector sum second Word frequency vector;
The similarity of the second word frequency vector according to cosine similarity calculates the first word frequency vector sum, as the section And the similarity between corresponding Search Results.
10. detection method according to claim 8, wherein, it is described according to calculated similarity determining the text The step of multiplicity of chapter, includes:
The ratio that similarity accounts for the number of total similarity more than the number of the first predetermined threshold is calculated, the ratio is treated for described The multiplicity of detection article.
11. detection methods according to claim 8, wherein, at least partly section performs search in the plurality of section The step of operation, includes:
Using search engine respectively at least partly section is scanned in database in the plurality of section.
12. detection methods according to claim 8, wherein,
When the quantity of the plurality of section is more than the second predetermined threshold, search behaviour is performed to the plurality of section partial cut piece Make,
When the quantity of the plurality of section is less than the second predetermined threshold, search behaviour is performed to each section in the plurality of section Make.
13. detection methods according to claim 8, also include:
Extract the keyword in the article to be detected;
Appearance situation according to the keyword in the section, is that each section at least part of section gives power Weight.
CN201710002050.7A 2017-01-03 2017-01-03 Detection method, device and the calculating equipment of article multiplicity Active CN106649871B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710002050.7A CN106649871B (en) 2017-01-03 2017-01-03 Detection method, device and the calculating equipment of article multiplicity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710002050.7A CN106649871B (en) 2017-01-03 2017-01-03 Detection method, device and the calculating equipment of article multiplicity

Publications (2)

Publication Number Publication Date
CN106649871A true CN106649871A (en) 2017-05-10
CN106649871B CN106649871B (en) 2019-10-25

Family

ID=58838303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710002050.7A Active CN106649871B (en) 2017-01-03 2017-01-03 Detection method, device and the calculating equipment of article multiplicity

Country Status (1)

Country Link
CN (1) CN106649871B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920617A (en) * 2018-06-28 2018-11-30 中译语通科技股份有限公司 A kind of decision-making system and method, information data processing terminal of data acquisition
CN109255018A (en) * 2018-08-31 2019-01-22 沈文策 A kind of method and apparatus identifying similar article
CN109446425A (en) * 2018-10-30 2019-03-08 郑州市景安网络科技股份有限公司 A kind of network information gathering and dissemination method, system
CN110348539A (en) * 2019-07-19 2019-10-18 知者信息技术服务成都有限公司 Short text correlation method of discrimination
WO2020063437A1 (en) * 2018-09-27 2020-04-02 北京字节跳动网络技术有限公司 Keyword recommendation method and apparatus, storage medium, and electronic device
CN111737966A (en) * 2020-06-11 2020-10-02 北京百度网讯科技有限公司 Document repetition degree detection method, device, equipment and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957809A (en) * 2010-10-14 2011-01-26 传神联合(北京)信息技术有限公司 Anti-plagiarism method
CN103049467A (en) * 2011-10-12 2013-04-17 杨纯青 Chinese digital anti-plagiarism detection and comparison system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957809A (en) * 2010-10-14 2011-01-26 传神联合(北京)信息技术有限公司 Anti-plagiarism method
CN103049467A (en) * 2011-10-12 2013-04-17 杨纯青 Chinese digital anti-plagiarism detection and comparison system and method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920617A (en) * 2018-06-28 2018-11-30 中译语通科技股份有限公司 A kind of decision-making system and method, information data processing terminal of data acquisition
CN108920617B (en) * 2018-06-28 2022-07-12 中译语通科技股份有限公司 Data acquisition judging system and method and information data processing terminal
CN109255018A (en) * 2018-08-31 2019-01-22 沈文策 A kind of method and apparatus identifying similar article
WO2020063437A1 (en) * 2018-09-27 2020-04-02 北京字节跳动网络技术有限公司 Keyword recommendation method and apparatus, storage medium, and electronic device
CN109446425A (en) * 2018-10-30 2019-03-08 郑州市景安网络科技股份有限公司 A kind of network information gathering and dissemination method, system
CN110348539A (en) * 2019-07-19 2019-10-18 知者信息技术服务成都有限公司 Short text correlation method of discrimination
CN111737966A (en) * 2020-06-11 2020-10-02 北京百度网讯科技有限公司 Document repetition degree detection method, device, equipment and readable storage medium
CN111737966B (en) * 2020-06-11 2024-03-01 北京百度网讯科技有限公司 Document repetition detection method, device, equipment and readable storage medium

Also Published As

Publication number Publication date
CN106649871B (en) 2019-10-25

Similar Documents

Publication Publication Date Title
CN106649871A (en) Detection method, apparatus and computing equipment for repetition degree of articles
US9087108B2 (en) Determination of category information using multiple stages
CN103198057B (en) One kind adds tagged method and apparatus to document automatically
CN105893478A (en) Tag extraction method and equipment
CN106777282B (en) The sort method and device of relevant search
EP3279806A1 (en) Data processing method and apparatus
JP2017142796A (en) Identification and extraction of information
CN109189990A (en) A kind of generation method of search term, device and electronic equipment
CN108304509A (en) A kind of comment spam filter method for indicating mutually to learn based on the multidirectional amount of text
CN107330592A (en) A kind of screening technique, device and the computing device of target Enterprise Object
CN104348871A (en) Similar account expanding method and device
TWI556128B (en) Forensic system, forensic method and evidence collection program
CN108319628A (en) A kind of user interest determines method and device
CN106257449B (en) A kind of information determines method and apparatus
CN103593397B (en) A kind of method and apparatus of acquisition content of microblog
CN106997340A (en) The generation of dictionary and the Document Classification Method and device using dictionary
CN105608183B (en) A kind of method and apparatus that polymeric type is provided and is answered
CN104765864B (en) The output intent and client and server of search result
CN106844488A (en) With reference to the stock class UGC data recommendation methods and device of search
CN103902687B (en) The generation method and device of a kind of Search Results
CN104063422B (en) The feature dictionary iteration update method and device in field in social networks
CN106294765A (en) Process the method and device of news data
CN110427492A (en) Generate the method, apparatus and electronic equipment of keywords database
JP5292336B2 (en) Knowledge amount estimation device, knowledge amount estimation method, and knowledge amount estimation program for each field of search system users
CN106547906A (en) Content of pages generation method, device and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200810

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 510665 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping radio square B tower 13 floor 02 unit self

Patentee before: Guangzhou Aijiuyou Information Technology Co.,Ltd.

TR01 Transfer of patent right