CN111444450A - Method and device for determining reprinted data - Google Patents

Method and device for determining reprinted data Download PDF

Info

Publication number
CN111444450A
CN111444450A CN201910039237.3A CN201910039237A CN111444450A CN 111444450 A CN111444450 A CN 111444450A CN 201910039237 A CN201910039237 A CN 201910039237A CN 111444450 A CN111444450 A CN 111444450A
Authority
CN
China
Prior art keywords
data
original data
determining
original
residual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910039237.3A
Other languages
Chinese (zh)
Inventor
任广永
魏兵锋
张丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201910039237.3A priority Critical patent/CN111444450A/en
Publication of CN111444450A publication Critical patent/CN111444450A/en
Pending legal-status Critical Current

Links

Images

Abstract

The embodiment of the invention provides a method and equipment for determining reprinted data, wherein the method comprises the following steps: determining original data from data acquired from the Internet according to a pre-stored original data identifier; acquiring residual data except the original data from data acquired from the Internet, and performing feature extraction on the residual data; acquiring characteristics of pre-stored original data from an original data pool; and determining the reprinted data in the residual data according to the extracted characteristics of the residual data and the characteristics of the pre-stored original data. The method provided by the embodiment can effectively carry out the transshipment analysis on the internet data, improves the accuracy of the statistical result, and can further be used for analyzing the propagation condition of the data assets, analyzing the copyright tracking and serving as the evaluation basis of editing to meet the actual requirement.

Description

Method and device for determining reprinted data
Technical Field
The embodiment of the invention relates to the technical field of internet, in particular to a method and equipment for determining reprinted data.
Background
With the rapid development of the internet, the rise of media and social media, the transformation of news media newspaper industry, especially the enhancement of copyright awareness, the requirements for the transshipment condition, the transshipment path and the source tracing of internet data are more and more strong, and the method also becomes the important check of internet articles of news media units.
The data transfer amount refers to the number of times of data transfer by other websites after the data are released. Currently, data offloading statistical schemes mainly use search engines to search for relevant data.
However, the search engine lacks effective analysis for internet data transfer, for example, the search results of the search engine may be contaminated with much irrelevant content that needs to be manually identified to be removed, and the accuracy of the statistical results is low.
Disclosure of Invention
The embodiment of the invention provides a method and equipment for determining reprinted data, and aims to solve the problems that an existing search engine is lack of effective reprinting analysis on internet data, and the accuracy of statistical results is low.
In a first aspect, an embodiment of the present invention provides a method for determining reprinted data, including:
determining original data from data acquired from the Internet according to a pre-stored original data identifier;
acquiring residual data except the original data from data acquired from the Internet, and performing feature extraction on the residual data;
acquiring characteristics of pre-stored original data from an original data pool;
and determining the reprinted data in the residual data according to the extracted characteristics of the residual data and the characteristics of the pre-stored original data.
In a possible design, the determining the reprinted data in the remaining data according to the extracted features of the remaining data and the features of the pre-stored original data includes:
comparing the extracted features of the residual data with the features of the pre-stored original data;
and if the similarity of the characteristics of the target data and the characteristics of any original data reaches a preset similarity threshold, judging that the target data is the reprinted data, and the target data is any one of the rest data.
In one possible design, the performing feature extraction on the residual data includes:
extracting text pinyin of the residual data;
determining the pinyin number of the same pinyin letters in the text pinyin;
determining the reprinted data in the residual data according to the extracted features of the residual data and the features of the pre-stored original data, wherein the determining comprises the following steps:
and if the difference value between the pinyin quantity of the target identical pinyin letters of the target data and the pinyin quantity of the target identical pinyin letters of any original data is within a preset threshold range, judging that the target data is reprinted data.
In one possible design, the method further includes:
performing feature extraction on the original data;
updating the original data pool according to the original data and the extracted characteristics of the original data;
and taking the updated original data pool as a new original data pool, and executing the step of acquiring the characteristics of the pre-stored original data from the original data pool.
In a possible design, before determining the original data from the data acquired from the internet according to the pre-stored original data identifier, the method further includes:
the method comprises the steps of capturing data from the Internet in real time through a web crawler technology, collecting data from the Internet through a meta search technology, and taking the captured data and the collected data as the data acquired from the Internet.
In one possible design, after the determining the transferred data in the remaining data according to the extracted features of the remaining data and the features of the pre-stored original data, the method further includes:
and storing the reprinted data, and generating one or more of a chart, a report and sharing according to a storage result.
In a second aspect, an embodiment of the present invention provides a device for determining reprinted data, including:
the system comprises an original data determining module, a data processing module and a data processing module, wherein the original data determining module is used for determining original data from data acquired from the Internet according to a pre-stored original data identifier;
the first feature extraction module is used for acquiring residual data except the original data from data acquired from the Internet and extracting features of the residual data;
the data characteristic acquisition module is used for acquiring the characteristics of the pre-stored original data from the original data pool;
and the reprint analysis module is used for determining the reprint data in the residual data according to the extracted characteristics of the residual data and the characteristics of the pre-stored original data.
In one possible design, the reprint analysis module includes:
the characteristic comparison unit is used for comparing the extracted characteristics of the residual data with the characteristics of the pre-stored original data;
and the reprint judging unit is used for judging that the target data is the reprint data if the similarity between the characteristics of the target data and the characteristics of any original data reaches a preset similarity threshold value, and the target data is any one of the residual data.
In one possible design, the first feature extraction module performs feature extraction on the residual data, including:
extracting text pinyin of the residual data;
determining the pinyin number of the same pinyin letters in the text pinyin;
the reprint analysis module is further used for judging that the target data is reprinted data if the difference value between the pinyin quantity of the target identical pinyin letters of the target data and the pinyin quantity of the target identical pinyin letters of any original data is within a preset threshold range.
In one possible design, the above apparatus further includes:
the second feature extraction module is used for extracting features of the original data;
the data pool updating module is used for updating the original data pool according to the original data and the extracted characteristics of the original data;
the data characteristic obtaining module is further configured to use the updated original data pool as a new original data pool, and execute the step of obtaining the characteristics of the pre-stored original data from the original data pool.
In one possible design, the above apparatus further includes:
and the data acquisition module is used for capturing data from the Internet in real time through a web crawler technology before the original data determination module determines the original data from the data acquired from the Internet according to the pre-stored original data identifier, acquiring the data from the Internet through a meta search technology, and taking the captured data and the acquired data as the data acquired from the Internet.
In one possible design, the above apparatus further includes:
and the storage processing module is used for storing the reprinted data after the reprinted analysis module determines the reprinted data in the residual data according to the extracted characteristics of the residual data and the characteristics of the pre-stored original data, and generating one or more of a chart, a report and sharing according to a storage result.
In a third aspect, an embodiment of the present invention provides a device for determining reprinted data, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of determining reprinted data as set forth in the first aspect and various possible designs of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer-executable instruction is stored in the computer-readable storage medium, and when a processor executes the computer-executable instruction, the method for determining reprinted data according to the first aspect and various possible designs of the first aspect is implemented.
According to the method and the equipment for determining the reprinted data, the original data are determined from the data acquired from the Internet through the pre-stored original data identification, the residual data except the original data are acquired from the data acquired from the Internet, the characteristics of the pre-stored original data are extracted from the original data pool, finally the reprinted data in the residual data are determined according to the extracted characteristics of the residual data and the pre-stored characteristics of the original data, the internet data can be subjected to effective reprint analysis, the accuracy of statistical results is improved, and the reprint analysis results can be further used for analyzing the propagation condition of data assets, analyzing copyright tracking and being used as an editing assessment basis to meet actual requirements.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a system for determining reprinted data according to an embodiment of the present invention;
fig. 2 is a first schematic flow chart of a method for determining reprinted data according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of a second method for determining reprinted data according to an embodiment of the present invention;
fig. 4 is a first schematic structural diagram of a device for determining reprinted data according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a device for determining reprinted data according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of the device for determining reprinted data according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With the rapid development of the internet, the rise of media and social media, the transformation of news media newspaper industry, especially the enhancement of copyright awareness, the requirements for the transshipment condition, the transshipment path and the source tracing of internet data are more and more strong, and the method also becomes the important check of internet articles of news media units. The data transfer amount refers to the number of times of data transfer by other websites after the data are released. Currently, data offloading statistical schemes mainly use search engines to search for relevant data. However, the search engine lacks effective transshipment analysis of internet data, and the accuracy of statistical results is low
Therefore, in view of the above problems, the present invention provides a method for determining reprinted data, which determines original data from data acquired from the internet by using a pre-stored original data identifier, acquires remaining data other than the original data from the data acquired from the internet, performs feature extraction on the remaining data, acquires features of the pre-stored original data from an original data pool, and finally determines the reprinted data in the remaining data according to the features of the extracted remaining data and the features of the pre-stored original data, so as to perform effective reprint analysis on the internet data, improve the accuracy of statistical results, and further use the reprinted analysis results to analyze the propagation condition of data assets, analyze copyright tracking, and serve as an evaluation basis for editing, thereby meeting actual requirements.
Fig. 1 is an application scenario diagram of a method for determining reprinted data according to the present invention. As shown in fig. 1, the terminal device 101 may determine original data from data acquired from the internet 102 according to a pre-stored original data identifier, may perform feature extraction on the data, may acquire features of the pre-stored original data from the original data pool 103, and may also determine, according to the extracted features and the acquired features, reprinted data and the like in the data acquired from the internet 102.
It is to be understood that the terminal devices to which the present invention relates may also be referred to as user equipment, mobile stations, mobile terminals, etc. The terminal device may be a mobile phone, a tablet computer, a computer with a wireless transceiving function, and the like, and the present invention is not limited specifically.
Fig. 2 is a flowchart of a method for determining reprinted data according to an embodiment of the present invention, where an execution subject of this embodiment may be a terminal device in the embodiment shown in fig. 1, or may be a server, and this embodiment is not limited herein. As shown in fig. 2, the method includes:
s201, determining original data from data acquired from the Internet according to the pre-stored original data identification.
Here, each piece of data acquired from the internet carries a data identifier, and the determining of the original data from the data acquired from the internet according to the pre-stored original data identifier may include:
and comparing the data identifier of the data acquired from the Internet with a pre-stored original data identifier, and if the data identifier of certain data acquired from the Internet is consistent with the data identifier of certain original data, judging the data to be original data. The data identification of the data acquired from the internet can be customized according to each site, such as the source of the acquired data, for example, the name of the site. The original data identification can also be customized according to each site. In addition, the original data identifier can also be a special mark for judging whether the data is original or not.
Each piece of data acquired from the internet can also include information such as the title, author, source, text content and the like of the corresponding article. The data obtained from the internet can be obtained by capturing data from the internet in real time.
S202, obtaining the residual data except the original data from the data obtained from the Internet, and performing feature extraction on the residual data.
Optionally, the performing feature extraction on the remaining data includes:
extracting text pinyin of the residual data;
and determining the pinyin number of the same pinyin letters in the text pinyin.
The corresponding relation between the corresponding pinyin letters and the determined pinyin quantity of the same pinyin letters can be recorded, for example, if the pinyin letter le in the text pinyin is three times, the pinyin quantity corresponding to the pinyin letter le is recorded as three times.
Specifically, extracting the text pinyin of the residual data may include: and judging whether the text content of the residual data is simplified text, if the text content of the residual data contains traditional text, converting the traditional text into simplified text, converting the text content of the residual data into pinyin, and extracting the text pinyin of the residual data.
After the feature extraction is performed on the remaining data, the remaining data and the feature thereof may be stored in an internet data message queue, specifically, the remaining data may be stored in the internet data message queue according to a data type (microblog, wechat, website, self-media, etc.), wherein the remaining data may use kafka as a message queue middleware.
And S203, acquiring the characteristics of the pre-stored original data from the original data pool.
An original data pool: the system is used for storing original data and characteristics thereof, and each piece of original data can contain information such as title, author, source, text content and the like of a corresponding article.
Specifically, the original data can be stored by using a memory database such as redis, mongodb and the like to store the original data to be analyzed, so that the data can be loaded quickly.
Optionally, in addition to performing feature extraction on the remaining data, feature extraction may be performed on the original data;
updating the original data pool according to the original data and the extracted characteristics of the original data;
and taking the updated original data pool as a new original data pool, and executing the step of acquiring the characteristics of the pre-stored original data from the original data pool.
The original data and the characteristics thereof determined from the data acquired from the Internet are stored in the original data pool, and the data updating is carried out on the original data pool, so that the original data stored in the original data pool can be more complete, and the accuracy of the subsequent reprint analysis is improved.
S204, determining the reprinted data in the residual data according to the extracted features of the residual data and the features of the pre-stored original data.
Performing feature extraction on the remaining data as described above: extracting the text pinyin of the residual data, and determining the number of the pinyin of the same pinyin letters in the text pinyin as an example, wherein determining the reprinted data in the residual data according to the extracted characteristics of the residual data and the characteristics of the pre-stored original data may include:
if the difference value between the number of the target identical pinyin letters in the target data and the number of the target identical pinyin letters in any original data is within a preset threshold range, the target data is determined to be the reprinting data, wherein the target data is any one of the remaining data, the target identical pinyin letters are any one or more identical pinyin letters in the text pinyin of the remaining data, for example, the pinyin letters le in the target data appear three times, the hao appears five times, the jiao appears ten times, the sheng appears nine times, the same pinyin letters le in one original data appear three times, the hao appears five times, the jiao appears ten times, the sheng appears nine times, and the like, and the target data can be determined to be the reprinting data.
Specifically, the text length of the remaining data may be compared with the text length of the pre-stored original data, and if the difference between the text length of the target data and the text length of any original data is within a preset range, it is determined whether the difference between the number of the target identical pinyin letters of the target data and the number of the target identical pinyin letters of the original data is within a preset threshold range, and if so, it is determined that the target data is reprinted data
Here. If the remaining data is stored in the internet data message queue, storing the original data determined from the data acquired from the internet into an original data pool, and determining the reprinted data in the remaining data according to the extracted features of the remaining data and the pre-stored features of the original data may include: and loading data in the original data pool, consuming the Internet data message queue, comparing the characteristics one by one, obtaining data with characteristic similarity reaching a specified value, and determining the reprinted data in the residual data.
According to the method for determining the reprinted data, the original data are determined from the data acquired from the Internet through the pre-stored original data identification, the residual data except the original data are acquired from the data acquired from the Internet, the characteristics of the residual data are extracted, the characteristics of the pre-stored original data are acquired from the original data pool, and finally the reprinted data in the residual data are determined according to the extracted characteristics of the residual data and the characteristics of the pre-stored original data, so that the Internet data can be subjected to effective reprint analysis, the accuracy of statistical results is improved, and the reprint analysis results can be further used for analyzing the propagation condition of data assets, analyzing copyright tracking and being used as an editing assessment basis to meet actual requirements.
Fig. 3 is a schematic flowchart of a second method for determining reprinted data according to an embodiment of the present invention, and this embodiment describes in detail a specific implementation process of this embodiment on the basis of the embodiment of fig. 2. As shown in fig. 3, the method includes:
s301, capturing data from the Internet in real time through a web crawler technology, collecting data from the Internet through a meta search technology, and taking the captured data and the collected data as data acquired from the Internet.
Here, a web crawler (also referred to as a web spider, a web robot, etc.) is a program or script that automatically captures internet data according to a certain rule.
The meta search technology is related search through a meta search engine, wherein the meta search engine is also called a multiple search engine, and helps a user to select and utilize an appropriate search engine (even several search engines simultaneously) among the multiple search engines to realize a search operation through a unified user interface, and the meta search engine is a global control mechanism for various search tools distributed in a network.
S302, according to the pre-stored original data identification, original data are determined from the data acquired from the Internet.
The pre-stored original data identifier may be adjusted according to actual conditions, for example, corresponding original data identifiers are added or deleted.
S303, acquiring residual data except the original data from the data acquired from the Internet, and performing feature extraction on the residual data.
And S304, acquiring the characteristics of the pre-stored original data from the original data pool.
S305, comparing the extracted features of the residual data with the features of the pre-stored original data.
Here, the characteristics of each data in the remaining data may be compared with the characteristics of pre-stored original data one by one, or the remaining data and the pre-stored original data may be grouped first, and the grouped data of each group are compared correspondingly, so that how to compare the data can be set according to actual situations, and the requirements of various application scenarios are met.
S306, if the similarity between the characteristics of the target data and the characteristics of any original data reaches a preset similarity threshold, determining that the target data is the reprinted data, and the target data is any one of the residual data.
Specifically, the data comparison operation is repeatedly executed until all the data of the remaining data are compared, and the reprinted data in the remaining data is determined.
And S307, storing the transshipment data, and generating one or more of a chart, a report and sharing according to a storage result.
Here, the determined reprinted data may be stored in a message queue of the result of the propagation analysis, and then further stored in a result storage server to provide data for copyright tracking, assessment data, propagation analysis, and the like.
The determined reprinted data can also be directly stored in a search engine such as solr and the like so as to facilitate quick search and analysis and directly provide results for applications.
The method for determining the reprinted data, provided by the embodiment, combines a web crawler technology and a meta search technology to acquire data from the internet, so that the acquired data are more comprehensive and are used for analyzing the propagation condition of data assets, analyzing copyright tracking and being used as an editing assessment basis, and the reprinted analysis result can be finally provided for a client in various forms such as a chart, a report and sharing, so that the user can conveniently check the reprinted analysis result.
Fig. 4 is a first schematic structural diagram of a device for determining reprinted data according to an embodiment of the present invention. As shown in fig. 4, the reprint data determination device 40 includes: the system comprises an original data determining module 401, a first feature extracting module 402, a data feature acquiring module 403 and a transfer analyzing module 404.
The original data determining module 401 is configured to determine original data from data acquired from the internet according to a pre-stored original data identifier.
A first feature extraction module 402, configured to obtain remaining data, except the original data, from data obtained from the internet, and perform feature extraction on the remaining data.
A data characteristic obtaining module 403, configured to obtain characteristics of pre-stored original data from the original data pool.
And a reprint analysis module 404, configured to determine the reprinted data in the remaining data according to the extracted features of the remaining data and the features of the pre-stored original data.
The reprint analysis module 404 may include one or more analysis service units, and is configured to compare features of the remaining data with features of pre-stored original data, and determine reprint data in the remaining data, that is, multiple sets of reprint analysis services may be deployed to implement distributed computation, so as to improve computation efficiency.
The device provided in this embodiment may be used to implement the technical solution of the above method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Fig. 5 is a schematic structural diagram of a second device for determining reprinted data according to an embodiment of the present invention. As shown in fig. 5, this embodiment further includes, on the basis of the embodiment in fig. 4: a second feature extraction module 405, a data pool update module 406, a data acquisition module 407, and a save processing module 408.
In one possible design, the reprint analysis module 404 includes a feature comparison unit 4041 and a reprint determination unit 4042.
The feature comparison unit 4041 is configured to compare the extracted features of the remaining data with the features of the pre-stored original data.
The reprint determining unit 4042 is configured to determine that the target data is reprinted data if the similarity between the feature of the target data and the feature of any original data reaches a preset similarity threshold, where the target data is any one of the remaining data.
In one possible design, the first feature extraction module 402 performs feature extraction on the residual data, including:
extracting text pinyin of the residual data;
and determining the pinyin number of the same pinyin letters in the text pinyin.
The reprint analysis module 404 is further configured to determine that the target data is reprinted data if a difference between the number of the target identical pinyin letters of the target data and the number of the target identical pinyin letters of any original data is within a preset threshold range.
In one possible design, the second feature extraction module 405 is configured to perform feature extraction on the original data.
And the data pool updating module 406 is configured to update the original data pool according to the original data and the extracted features of the original data.
The data characteristic obtaining module 403 is further configured to use the updated original data pool as a new original data pool, and execute the step of obtaining the characteristics of the pre-stored original data from the original data pool.
In a possible design, the data obtaining module 407 is configured to, before the original data determining module 401 determines original data from data obtained from the internet according to a pre-stored original data identifier, capture data from the internet in real time through a web crawler technology, collect data from the internet through a meta search technology, and use the captured data and the collected data as the data obtained from the internet.
In one possible design, the saving processing module 408 is configured to, after the reprint analysis module 404 determines the reprinted data in the remaining data according to the extracted features of the remaining data and the features of the pre-stored original data, save the reprinted data, and generate one or more of a chart, a report, and a share according to a saving result.
The device provided in this embodiment may be used to implement the technical solution of the above method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Fig. 6 is a schematic diagram of a hardware structure of the device for determining reprinted data according to the embodiment of the present invention. As shown in fig. 6, the reprint data determination device 60 of the present embodiment includes: a processor 601 and a memory 602; wherein
A memory 602 for storing computer-executable instructions;
the processor 601 is configured to execute the computer-executable instructions stored in the memory to implement the steps performed by the receiving device in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.
Alternatively, the memory 602 may be separate or integrated with the processor 601.
When the memory 602 is provided separately, the reprint data determination device further includes a bus 603 for connecting the memory 602 and the processor 601.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer execution instruction is stored in the computer-readable storage medium, and when a processor executes the computer execution instruction, the method for determining the reprinted data is implemented.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (14)

1. A method for determining reprinted data, comprising:
determining original data from data acquired from the Internet according to a pre-stored original data identifier;
acquiring residual data except the original data from data acquired from the Internet, and performing feature extraction on the residual data;
acquiring characteristics of pre-stored original data from an original data pool;
and determining the reprinted data in the residual data according to the extracted characteristics of the residual data and the characteristics of the pre-stored original data.
2. The method according to claim 1, wherein the determining the reprinted data in the remaining data according to the extracted features of the remaining data and the features of the pre-stored original data comprises:
comparing the extracted features of the residual data with the features of the pre-stored original data;
and if the similarity of the characteristics of the target data and the characteristics of any original data reaches a preset similarity threshold, judging that the target data is the reprinted data, and the target data is any one of the rest data.
3. The method of claim 1, wherein the feature extracting the remaining data comprises:
extracting text pinyin of the residual data;
determining the pinyin number of the same pinyin letters in the text pinyin;
determining the reprinted data in the residual data according to the extracted features of the residual data and the features of the pre-stored original data, wherein the determining comprises the following steps:
and if the difference value between the pinyin quantity of the target identical pinyin letters of the target data and the pinyin quantity of the target identical pinyin letters of any original data is within a preset threshold range, judging that the target data is reprinted data.
4. The method of claim 1, further comprising:
performing feature extraction on the original data;
updating the original data pool according to the original data and the extracted characteristics of the original data;
and taking the updated original data pool as a new original data pool, and executing the step of acquiring the characteristics of the pre-stored original data from the original data pool.
5. The method of claim 1, prior to determining the original data from the data obtained from the internet based on the pre-stored original data identifier, further comprising:
the method comprises the steps of capturing data from the Internet in real time through a web crawler technology, collecting data from the Internet through a meta search technology, and taking the captured data and the collected data as the data acquired from the Internet.
6. The method according to any one of claims 1 to 5, further comprising, after determining the transferred data in the residual data according to the extracted features of the residual data and the features of the pre-stored original data:
and storing the reprinted data, and generating one or more of a chart, a report and sharing according to a storage result.
7. A reprint data determination device, characterized by comprising:
the system comprises an original data determining module, a data processing module and a data processing module, wherein the original data determining module is used for determining original data from data acquired from the Internet according to a pre-stored original data identifier;
the first feature extraction module is used for acquiring residual data except the original data from data acquired from the Internet and extracting features of the residual data;
the data characteristic acquisition module is used for acquiring the characteristics of the pre-stored original data from the original data pool;
and the reprint analysis module is used for determining the reprint data in the residual data according to the extracted characteristics of the residual data and the characteristics of the pre-stored original data.
8. The apparatus of claim 7, wherein the reprint analysis module comprises:
the characteristic comparison unit is used for comparing the extracted characteristics of the residual data with the characteristics of the pre-stored original data;
and the reprint judging unit is used for judging that the target data is the reprint data if the similarity between the characteristics of the target data and the characteristics of any original data reaches a preset similarity threshold value, and the target data is any one of the residual data.
9. The apparatus of claim 7, wherein the first feature extraction module performs feature extraction on the residual data, comprising:
extracting text pinyin of the residual data;
determining the pinyin number of the same pinyin letters in the text pinyin;
the reprint analysis module is further used for judging that the target data is reprinted data if the difference value between the pinyin quantity of the target identical pinyin letters of the target data and the pinyin quantity of the target identical pinyin letters of any original data is within a preset threshold range.
10. The apparatus of claim 7, further comprising:
the second feature extraction module is used for extracting features of the original data;
the data pool updating module is used for updating the original data pool according to the original data and the extracted characteristics of the original data;
the data characteristic obtaining module is further configured to use the updated original data pool as a new original data pool, and execute the step of obtaining the characteristics of the pre-stored original data from the original data pool.
11. The apparatus of claim 7, further comprising:
and the data acquisition module is used for capturing data from the Internet in real time through a web crawler technology before the original data determination module determines the original data from the data acquired from the Internet according to the pre-stored original data identifier, acquiring the data from the Internet through a meta search technology, and taking the captured data and the acquired data as the data acquired from the Internet.
12. The apparatus of any one of claims 7 to 11, further comprising:
and the storage processing module is used for storing the reprinted data after the reprinted analysis module determines the reprinted data in the residual data according to the extracted characteristics of the residual data and the characteristics of the pre-stored original data, and generating one or more of a chart, a report and sharing according to a storage result.
13. A reprint data determination device, characterized by comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of determining reprinted data as claimed in any one of claims 1 to 6.
14. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor, implement the reprint data determination method according to any one of claims 1 to 6.
CN201910039237.3A 2019-01-16 2019-01-16 Method and device for determining reprinted data Pending CN111444450A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910039237.3A CN111444450A (en) 2019-01-16 2019-01-16 Method and device for determining reprinted data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910039237.3A CN111444450A (en) 2019-01-16 2019-01-16 Method and device for determining reprinted data

Publications (1)

Publication Number Publication Date
CN111444450A true CN111444450A (en) 2020-07-24

Family

ID=71650483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910039237.3A Pending CN111444450A (en) 2019-01-16 2019-01-16 Method and device for determining reprinted data

Country Status (1)

Country Link
CN (1) CN111444450A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
US20160283582A1 (en) * 2013-11-04 2016-09-29 Beijing Qihoo Technology Company Limited Device and method for detecting similar text, and application
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking
CN106776609A (en) * 2015-11-19 2017-05-31 北京国双科技有限公司 Reprint the statistical method and device of quantity in website
CN106815197A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 The determination method and apparatus of text similarity
CN108009599A (en) * 2017-12-27 2018-05-08 福建中金在线信息科技有限公司 A kind of original document determination methods, device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
US20160283582A1 (en) * 2013-11-04 2016-09-29 Beijing Qihoo Technology Company Limited Device and method for detecting similar text, and application
CN106776609A (en) * 2015-11-19 2017-05-31 北京国双科技有限公司 Reprint the statistical method and device of quantity in website
CN106815197A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 The determination method and apparatus of text similarity
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking
CN108009599A (en) * 2017-12-27 2018-05-08 福建中金在线信息科技有限公司 A kind of original document determination methods, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106446816B (en) Face recognition method and device
CN108090567B (en) Fault diagnosis method and device for power communication system
CN110826006B (en) Abnormal collection behavior identification method and device based on privacy data protection
CN109803152B (en) Violation auditing method and device, electronic equipment and storage medium
US20170063913A1 (en) Method, apparatus and system for detecting fraudulant software promotion
CN111163072B (en) Method and device for determining characteristic value in machine learning model and electronic equipment
CN109669795B (en) Crash information processing method and device
US20170126723A1 (en) Method and device for identifying url legitimacy
CN110474900B (en) Game protocol testing method and device
CN106681716B (en) Intelligent terminal and automatic classification method of application programs thereof
CN111241389A (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
CN109088788B (en) Data processing method, device, equipment and computer readable storage medium
CN107748772B (en) Trademark identification method and device
CN114208135B (en) Information pushing method, device, server and storage medium
CN113239290A (en) Data analysis method and device for public opinion monitoring and electronic device
CN111126928A (en) Method and device for auditing release content
CN111124470A (en) Automatic optimization method and device for program package based on cloud platform
CN113076961B (en) Image feature library updating method, image detection method and device
CN112732893A (en) Text information extraction method and device, storage medium and electronic equipment
CN110096478B (en) Document index generation method and device
CN110598115A (en) Sensitive webpage identification method and system based on artificial intelligence multi-engine
CN108304310B (en) Log analysis method and computing device
CN114338102B (en) Security detection method, security detection device, electronic equipment and storage medium
CN111444450A (en) Method and device for determining reprinted data
CN115221874A (en) Construction method of inverted index, list screening method and device, and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230710

Address after: 3007, Hengqin International Financial Center Building, No. 58 Huajin Street, Hengqin New District, Zhuhai City, Guangdong Province, 519030

Applicant after: New founder holdings development Co.,Ltd.

Applicant after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 9 floor

Applicant before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Applicant before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20200724