CN108846117A - The duplicate removal screening technique and device of business news flash - Google Patents

The duplicate removal screening technique and device of business news flash Download PDF

Info

Publication number
CN108846117A
CN108846117A CN201810675645.3A CN201810675645A CN108846117A CN 108846117 A CN108846117 A CN 108846117A CN 201810675645 A CN201810675645 A CN 201810675645A CN 108846117 A CN108846117 A CN 108846117A
Authority
CN
China
Prior art keywords
text
news flash
detected
business news
business
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810675645.3A
Other languages
Chinese (zh)
Inventor
朱迪
柳超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dike Technology Co Ltd
Original Assignee
Beijing Dike Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dike Technology Co Ltd filed Critical Beijing Dike Technology Co Ltd
Priority to CN201810675645.3A priority Critical patent/CN108846117A/en
Publication of CN108846117A publication Critical patent/CN108846117A/en
Pending legal-status Critical Current

Links

Abstract

The present invention provides a kind of duplicate removal screening techniques of business news flash and device, this method to include:Obtain business news flash text to be detected;Business news flash text to be detected is calculated using simhash algorithm, obtains the simhash fingerprint of business news flash text to be detected;According to the Business Name in business news flash text to be detected, the simhash fingerprint set of the target business news flash text in preset time comprising Business Name is extracted in business news flash duplicate removal database;Calculate the Hamming distances in the simhash fingerprint and simhash fingerprint set of business news flash text to be detected between each simhash fingerprint;If each Hamming distances are both greater than preset value in Hamming distances, it is determined that business news flash text to be detected is non-duplicate business news flash text, and the information of business news flash text to be detected is inserted into business news flash duplicate removal database.The technical issues of duplicate removal screening technique substantially reduces the calculation amount of duplicate removal screening, improves the efficiency of duplicate removal screening, and it is computationally intensive to alleviate existing duplicate removal screening technique, inefficiency.

Description

The duplicate removal screening technique and device of business news flash
Technical field
The present invention relates to the technical fields of text information processing, more particularly, to a kind of duplicate removal screening technique of business news flash And device.
Background technique
With constantly increasing for network popularization degree, all there can be the text information of magnanimity to be admitted to internet daily, it is such as new News, microblogging, article etc., but problems faced of having to is exactly that a large amount of duplicate messages are flooded in the text information of magnanimity, according to system It counts, the repetitive file on network accounts for about 25%-35%;
In the epoch of this big data, obtains valuable business data and be equal to grasp initiative, wherein quotient Industry news flash data have the characteristics that broad covered area, real-time, data are accurate, therefore how to extract in real time valuable in internet Business news flash data, and can fast and accurately identify the high text of similarity, avoid repeated data to enterprise data analysis It influences, just becomes the major issue that enterprise obtains worth of data.
In traditional duplicate removal screening technique, the detection of text similarity is first carried out, carries out duplicate removal again according to testing result The process of screening.In the detection of two text similarities, usually first text is segmented, is then converted to feature vector distance Measurement, such as common Euclidean distance, Hamming distances or complementary chord angle etc., obtain two according to the measurement of feature vector distance The testing result of a text similarity.But for this method in the duplicate removal screening in face of mass text data, number of comparisons is various, It is computationally intensive, the inefficiency of duplicate removal screening.
To sum up, existing duplicate removal screening technique is computationally intensive, inefficiency.
Summary of the invention
In view of this, the purpose of the present invention is to provide a kind of duplicate removal screening technique of business news flash and device, to alleviate The technical issues of existing duplicate removal screening technique is computationally intensive, inefficiency.
In a first aspect, the embodiment of the invention provides a kind of duplicate removal screening technique of business news flash, the method includes:
Obtain business news flash text to be detected, wherein include Business Name in the business news flash text to be detected;
The business news flash text to be detected is calculated using simhash algorithm, it is fast to obtain the business to be detected Interrogate the simhash fingerprint of text;
According to the Business Name in the business news flash text to be detected, extracted in business news flash duplicate removal database default The simhash fingerprint set of target business news flash text in time comprising the Business Name;
It calculates each in the simhash fingerprint and the simhash fingerprint set of the business news flash text to be detected Hamming distances between simhash fingerprint;
If each Hamming distances are both greater than preset value in the Hamming distances, it is determined that the business news flash text to be detected This is non-duplicate business news flash text, and the information of the business news flash text to be detected is inserted into the business news flash and removes tuple According to library.
With reference to first aspect, the embodiment of the invention provides the first possible embodiments of first aspect, wherein institute The method of stating further includes:
If in the Hamming distances, there are at least one Hamming distances be not more than the preset value, it is determined that it is described to Detecting business news flash text is to repeat business news flash text, and abandon the business news flash text to be detected.
With reference to first aspect, the embodiment of the invention provides second of possible embodiments of first aspect, wherein institute The method of stating further includes:
If the simhash fingerprint set is not present in the business news flash duplicate removal database, it is determined that described to be checked Survey business news flash text is non-duplicate business news flash text, and the information of the business news flash text to be detected is inserted into the quotient Industry news flash duplicate removal database.
With reference to first aspect, the embodiment of the invention provides the third possible embodiments of first aspect, wherein obtains The business news flash text to be detected is taken to include:
Crawl initial text to be detected in real time on the internet by crawler;
Pre-detection is carried out to the initial text to be detected, the text after being detected;
Text after the detection is pre-processed, the text that obtains that treated;
Business Name mentioned by being extracted in the text of default enterprise name library after the treatment;
If extracting Business Name, described treated that text is the business news flash text to be detected.
With reference to first aspect, the embodiment of the invention provides the 4th kind of possible embodiments of first aspect, wherein institute The method of stating further includes:
If not extracting Business Name, described treated that text is not the business news flash text to be detected, and Abandon treated the text.
With reference to first aspect, the embodiment of the invention provides the 5th kind of possible embodiments of first aspect, wherein right The initial text to be detected carries out pre-detection, and the text after being detected includes:
It obtains and deactivates word list;
The initial text to be detected in the initial text to be detected comprising the deactivated word list Chinese originally is abandoned, is obtained Remaining initial text to be detected;
Using the remaining initial text to be detected as the text after the detection.
With reference to first aspect, the embodiment of the invention provides the 6th kind of possible embodiments of first aspect, wherein right Text after the detection is pre-processed, and obtaining that treated, text includes:
Processing and/or conversion process are removed to the text after the detection;
Wherein, the removal, which is handled, includes:The processing of html tag is removed, the place for the content for including in default label is removed Reason;
It is described to be converted to the processing of all capitalization lowers.
Second aspect, the embodiment of the invention also provides a kind of duplicate removal screening plant of business news flash, described device includes:
Module is obtained, for obtaining business news flash text to be detected, wherein include in the business news flash text to be detected Business Name;
First computing module is obtained for being calculated using simhash algorithm the business news flash text to be detected The simhash fingerprint of the business news flash text to be detected;
Extraction module, for removing tuple in business news flash according to the Business Name in the business news flash text to be detected According to the simhash fingerprint set for extracting the target business news flash text in preset time comprising the Business Name in library;
Second computing module, for calculating the simhash fingerprint and the simhash of the business news flash text to be detected Hamming distances in fingerprint set between each simhash fingerprint;
First determining module, if each Hamming distances are both greater than preset value in the Hamming distances, it is determined that it is described to Detection business news flash text is non-duplicate business news flash text, and will be described in the insertion of the information of the business news flash text to be detected Business news flash duplicate removal database.
In conjunction with second aspect, the embodiment of the invention provides the first possible embodiments of second aspect, wherein institute Stating device further includes:
Second determining module, if in the Hamming distances, there are at least one Hamming distances to be not more than the preset value, Then determine that the business news flash text to be detected is to repeat business news flash text, and abandon the business news flash text to be detected.
In conjunction with second aspect, the embodiment of the invention provides second of possible embodiments of second aspect, wherein institute Stating device further includes:
Third determining module, if the simhash fingerprint set is not present in the business news flash duplicate removal database, Determine that the business news flash text to be detected is non-duplicate business news flash text, and by the letter of the business news flash text to be detected Breath is inserted into the business news flash duplicate removal database.
The embodiment of the present invention brings following beneficial effect:
Existing duplicate removal screening technique is computationally intensive, inefficiency.It is of the invention compared with existing duplicate removal screening technique In the duplicate removal screening technique of business news flash, business news flash text to be detected is first obtained, then using simhash algorithm to be detected Business news flash text is calculated, and obtains the simhash fingerprint of business news flash text to be detected, and then fast according to business to be detected The Business Name in text is interrogated, the target in preset time comprising same companies title is extracted in business news flash duplicate removal database The simhash fingerprint set of business news flash text, further calculate the simhash fingerprint of business news flash text to be detected with Hamming distances in simhash fingerprint set between each simhash fingerprint, if each Hamming distances are big in Hamming distances In preset value, it is determined that business news flash text to be detected is non-duplicate business news flash text, and by business news flash text to be detected Information be inserted into business news flash duplicate removal database.This method is obtained according to the Business Name mentioned in business news flash text to be detected The simhash fingerprint set for mentioning the target business news flash text of same companies title in preset time is taken, is greatly reduced Then the quantity for comparing sample carries out every in the simhash fingerprint and simhash fingerprint set of business news flash text to be detected again The Hamming distances of a simhash fingerprint calculate, so that it is determined that similitude, realizes duplicate removal screening.The duplicate removal screening technique subtracts significantly The calculation amount of small duplicate removal screening, improves the efficiency of duplicate removal screening, and it is computationally intensive to alleviate existing duplicate removal screening technique, effect The low technical problem of rate.
Other features and advantages of the present invention will illustrate in the following description, also, partly become from specification It obtains it is clear that understand through the implementation of the invention.The objectives and other advantages of the invention are in specification, claims And specifically noted structure is achieved and obtained in attached drawing.
To enable the above objects, features and advantages of the present invention to be clearer and more comprehensible, preferred embodiment is cited below particularly, and cooperate Appended attached drawing, is described in detail below.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor It puts, is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of the duplicate removal screening technique of business news flash provided in an embodiment of the present invention;
Fig. 2 is the flow chart of the duplicate removal screening technique of another business news flash provided in an embodiment of the present invention;
Fig. 3 is the method flow diagram provided in an embodiment of the present invention for obtaining business news flash text to be detected;
Fig. 4 is the method flow diagram provided in an embodiment of the present invention that pre-detection is carried out to initial text to be detected;
Fig. 5 is a kind of functional block diagram of the duplicate removal screening plant of business news flash provided in an embodiment of the present invention.
Icon:
11- obtains module;The first computing module of 12-;13- extraction module;The second computing module of 14-;15- first determines mould Block.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with attached drawing to the present invention Technical solution be clearly and completely described, it is clear that described embodiments are some of the embodiments of the present invention, rather than Whole embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present invention.
For the duplicate removal convenient for understanding the present embodiment, first to a kind of business news flash disclosed in the embodiment of the present invention Screening technique describes in detail.
Embodiment one:
A kind of duplicate removal screening technique of business news flash, with reference to Fig. 1, this method includes:
S102, business news flash text to be detected is obtained, wherein include Business Name in business news flash text to be detected;
In embodiments of the present invention, business news flash text to be detected is first obtained in real time.Specifically, business news flash text to be detected It include Business Name in this.It hereinafter describes in detail again to the process for obtaining business news flash text to be detected, herein not It repeats again.
S104, business news flash text to be detected is calculated using simhash algorithm, obtains business news flash text to be detected This simhash fingerprint;
After obtaining business news flash text to be detected, business news flash text to be detected is counted using simhash algorithm It calculates, obtains the simhash fingerprint of business news flash text to be detected.
Specifically, simhash algorithm is the algorithm that google is used to carry out mass text duplicate removal, a text can be converted For 64 fingerprints, such as:
10000100101011011111111000001010110100010011111000010010 11001011, it can be with Text is digital to hash by dimension, and the registration between number can directly reflect the similarity of text, because number calculates two-by-two When, operand is small, so the efficiency that the method based on simhash carries out the screening of text duplicate removal is very high.
S106, according to the Business Name in business news flash text to be detected, extracted in business news flash duplicate removal database pre- If the simhash fingerprint set of the target business news flash text in the time comprising Business Name;
After the simhash fingerprint for obtaining business news flash text to be detected, further according to business news flash text to be detected In Business Name, in business news flash duplicate removal database extract preset time in comprising same companies title target business it is fast The simhash fingerprint set of text is interrogated, contains at least one simhash fingerprint in the simhash fingerprint set.
Specifically, preset time here refers in the preset time before current time (generally i.e. at no distant date), The size of its value can be set according to actual needs, generally will not be too long.Because the time that business news flash updates is quickly, if in advance If the time is too long, which also just loses value, and the sample for being re-used as duplicate removal screening only will increase calculation amount, without Actual meaning.
In embodiments of the present invention, the design of business news flash duplicate removal database is as follows:
Field 1 Field 2 Field 3 Field 4 Field 5
Business news flash ID Mentioned company's full name News flash issuing time Fingerprint News flash detailed content
(1) business news flash ID:The ID that sequencing according to business news flash storage defines;
(2) company's full name mentioned by:The full name of company mentioned in business news flash;
(3) news flash issuing time:The date of business news flash publication;
(4) fingerprint:The finger print information generated based on simhash;
(5) news flash detailed content:The complete content of news flash.
It is each in S108, the simhash fingerprint for calculating business news flash text to be detected and simhash fingerprint set Hamming distances between simhash fingerprint;
After the simhash fingerprint and simhash fingerprint set for obtaining business news flash text to be detected, quotient to be detected is calculated Hamming distances in the simhash fingerprint and simhash fingerprint set of industry news flash text between each simhash fingerprint.
If each Hamming distances are both greater than preset value in S110, Hamming distances, it is determined that business news flash text to be detected For non-duplicate business news flash text, and the information of business news flash text to be detected is inserted into business news flash duplicate removal database.
In embodiments of the present invention, which is 3.If each Hamming distances are both greater than 3 in all Hamming distances, Business news flash text to be detected is non-duplicate business news flash text, then by the information of business news flash text to be detected according to business In the format insertion business news flash duplicate removal database of news flash duplicate removal database.
The present invention is combined by the Business Name referred in business news flash text to be detectedsThe mode of imhash algorithm carries out Duplicate removal screening, the combined use of two ways substantially increase detection efficiency and ensure that the accuracy rate of duplicate removal screening.
Existing duplicate removal screening technique is computationally intensive, inefficiency.It is of the invention compared with existing duplicate removal screening technique In the duplicate removal screening technique of business news flash, business news flash text to be detected is first obtained, then using simhash algorithm to be detected Business news flash text is calculated, and obtains the simhash fingerprint of business news flash text to be detected, and then fast according to business to be detected The Business Name in text is interrogated, the target in preset time comprising same companies title is extracted in business news flash duplicate removal database The simhash fingerprint set of business news flash text, further calculate the simhash fingerprint of business news flash text to be detected with Hamming distances in simhash fingerprint set between each simhash fingerprint, if each Hamming distances are big in Hamming distances In preset value, it is determined that business news flash text to be detected is non-duplicate business news flash text, and by business news flash text to be detected Information be inserted into business news flash duplicate removal database.This method is obtained according to the Business Name mentioned in business news flash text to be detected The simhash fingerprint set for mentioning the target business news flash text of same companies title in preset time is taken, is greatly reduced Then the quantity for comparing sample carries out every in the simhash fingerprint and simhash fingerprint set of business news flash text to be detected again The Hamming distances of a simhash fingerprint calculate, so that it is determined that similitude, realizes duplicate removal screening.The duplicate removal screening technique subtracts significantly The calculation amount of small duplicate removal screening, improves the efficiency of duplicate removal screening, and it is computationally intensive to alleviate existing duplicate removal screening technique, effect The low technical problem of rate.
Above content only describes the case where business news flash text to be detected is non-duplicate business news flash text, below to it Its situation is introduced.
In one optionally embodiment, with reference to Fig. 2, this method further includes:
If in S112, Hamming distances, there are at least one Hamming distances to be not more than preset value, it is determined that business to be detected News flash text is to repeat business news flash text, and abandon business news flash text to be detected.
In one optionally embodiment, this method further includes:
If simhash fingerprint set is not present in business news flash duplicate removal database, it is determined that business news flash text to be detected This is non-duplicate business news flash text, and the information of business news flash text to be detected is inserted into business news flash duplicate removal database.
Specifically, if target business news flash text (i.e. business news flash duplicate removal is not present in business news flash duplicate removal database The not no business news flash text with same companies title in business news flash text to be detected of database near-mid term) when, it is corresponding Simhash fingerprint set is just not present, then business news flash text to be detected is non-duplicate business news flash text, then it will be to be detected The information of business news flash text is inserted into business news flash duplicate removal database.
Above content has carried out whole introduction to the duplicate removal screening technique of business news flash of the invention, below to being directed to To particular content be described in detail.
In one optionally embodiment, with reference to Fig. 3, obtaining business news flash text to be detected includes:
S301, initial text to be detected is crawled in real time on the internet by crawler;
Current era is the epoch of information explosion, and there are many source mode of data.And information relevant for business news flash, It can be believed by real-time micro-blog information of the crawler in the real-time news dynamic of Business Wire media page, microblogging, Corporate finance Initial text to be detected is crawled on the internets such as breath platform in real time.
S302, pre-detection is carried out to initial text to be detected, the text after being detected;
After obtaining initial text to be detected, pre-detection is carried out to initial text to be detected, with reference to Fig. 4, including following step Suddenly:
S401, deactivated word list is obtained;
Specifically, deactivating the text comprising default stop words in word list, the stop words text of each series advertisements may include, Unsound stop words text etc., the embodiment of the present invention is to it without concrete restriction.
Comprising deactivating the initial text to be detected of word list Chinese originally in S402, the initial text to be detected of discarding, remained Remaining initial text to be detected;
S403, using remaining initial text to be detected as the text after detecting.
The process of the pre-detection can reduce the calculation amount of later period duplicate removal screening.
S303, the text after detection is pre-processed, the text that obtains that treated;
Specifically, being removed processing and/or conversion process to the text after detection;
Wherein, removal, which is handled, includes:The processing of html tag is removed, the processing for the content for including in default label is removed;
It is converted to the processing of all capitalization lowers.
The pretreated purpose is those irrelevant informations or meaningless information in the text removed after detecting, reduces this to the greatest extent The influence that category information screens duplicate removal.
S304, mentioned Business Name is extracted according in the default text of enterprise name library after treatment;
Specifically, the common feature of business news flash is exactly that can refer to some related companies (company's full name or company in content Referred to as), so the present invention includes default enterprise name library, industrial and commercial query web registration is included in the default enterprise name library The full name or abbreviation of whole companies.
Determine in treated text whether include Business Name based on default enterprise name library.
If S305, extracting Business Name, described treated that text is the business news flash text to be detected.
If S306, not extracting Business Name, treated, and text is not business news flash text to be detected, and is abandoned Treated text.
The duplicate removal screening technique of business news flash of the invention can efficiently filter out business news flash number in mass data According to and reject the high business news flash data of similarity, avoid influence of the repeated data to enterprise data analysis, for enterprise obtain Valuable business data provides support.
Embodiment two:
A kind of duplicate removal screening plant of business news flash, with reference to Fig. 5, which includes:
Module 11 is obtained, for obtaining business news flash text to be detected, wherein include public affairs in business news flash text to be detected Take charge of title;
First computing module 12, for being calculated business news flash text to be detected using simhash algorithm, obtain to Detect the simhash fingerprint of business news flash text;
Extraction module 13, for according to the Business Name in business news flash text to be detected, in business news flash duplicate removal data The simhash fingerprint set of the target business news flash text in preset time comprising Business Name is extracted in library;
Second computing module 14, for calculating the simhash fingerprint and simhash fingerprint collection of business news flash text to be detected Hamming distances in conjunction between each simhash fingerprint;
First determining module 15, if each Hamming distances are both greater than preset value in Hamming distances, it is determined that quotient to be detected Industry news flash text is non-duplicate business news flash text, and the information of business news flash text to be detected insertion business news flash is removed tuple According to library.
In the duplicate removal screening plant of business news flash of the invention, business news flash text to be detected is first obtained, is then used Simhash algorithm calculates business news flash text to be detected, obtains the simhash fingerprint of business news flash text to be detected, And then according to the Business Name in business news flash text to be detected, extracts in business news flash duplicate removal database and wrapped in preset time The simhash fingerprint set of the target business news flash text of the title containing same companies further calculates business news flash text to be detected Hamming distances in this simhash fingerprint and simhash fingerprint set between each simhash fingerprint, if Hamming distances In each Hamming distances be both greater than preset value, it is determined that business news flash text to be detected be non-duplicate business news flash text, and will The information of business news flash text to be detected is inserted into business news flash duplicate removal database.The device is according in business news flash text to be detected The simhash that mentioned Business Name obtains the target business news flash text for mentioning same companies title in preset time refers to Line set greatly reduces the quantity for comparing sample, then carry out again the simhash fingerprint of business news flash text to be detected with The Hamming distances of each simhash fingerprint calculate in simhash fingerprint set, so that it is determined that similitude, realizes duplicate removal screening.It should Duplicate removal screening plant substantially reduces the calculation amount of duplicate removal screening, improves the efficiency of duplicate removal screening, alleviates existing duplicate removal The technical issues of screening technique is computationally intensive, inefficiency.
Optionally, which further includes:
Second determining module, if in Hamming distances, there are at least one Hamming distances to be not more than preset value, it is determined that Detecting business news flash text is to repeat business news flash text, and abandon business news flash text to be detected.
Optionally, which further includes:
Third determining module, if simhash fingerprint set is not present in business news flash duplicate removal database, it is determined that be checked Survey business news flash text is non-duplicate business news flash text, and the information of business news flash text to be detected insertion business news flash is gone Weight database.
Optionally, obtaining module includes:
Unit is crawled, for crawling initial text to be detected in real time on the internet by crawler;
Pre-detection unit, for carrying out pre-detection to initial text to be detected, the text after being detected;
Pretreatment unit, for being pre-processed to the text after detection, the text that obtains that treated;
Extraction unit, for extracting mentioned company name according in the default text of enterprise name library after treatment Claim;
First setup unit, if extracting Business Name, treated, and text is business news flash text to be detected.
Optionally, obtaining module further includes:
Second setup unit, if not extracting Business Name, treated, and text is not business news flash text to be detected This, and the text after discard processing.
Optionally, pre-detection unit includes:
Subelement is obtained, for obtaining deactivated word list;
Discarding subelement, for abandoning in initial text to be detected comprising deactivating the initial text to be detected of word list Chinese originally This, obtains remaining initial text to be detected;
Subelement is set, for using remaining initial text to be detected as the text after detection.
Optionally, pretreatment unit is also used to:
Processing and/or conversion process are removed to the text after detection;
Wherein, removal, which is handled, includes:The processing of html tag is removed, the processing for the content for including in default label is removed;
It is converted to the processing of all capitalization lowers.
Particular content in the embodiment two can be with reference to the description in above-described embodiment one, and details are not described herein.
The duplicate removal screening technique of business news flash and the computer program product of device provided by the embodiment of the present invention, including The computer readable storage medium of program code is stored, the instruction that said program code includes can be used for executing previous methods reality Method described in example is applied, specific implementation can be found in embodiment of the method, and details are not described herein.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description It with the specific work process of device, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.
In addition, in the description of the embodiment of the present invention unless specifically defined or limited otherwise, term " installation ", " phase Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can To be mechanical connection, it is also possible to be electrically connected;It can be directly connected, can also can be indirectly connected through an intermediary Connection inside two elements.For the ordinary skill in the art, above-mentioned term can be understood at this with concrete condition Concrete meaning in invention.
It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially in other words The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention. And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic or disk.
In the description of the present invention, it should be noted that term " center ", "upper", "lower", "left", "right", "vertical", The orientation or positional relationship of the instructions such as "horizontal", "inner", "outside" be based on the orientation or positional relationship shown in the drawings, merely to Convenient for description the present invention and simplify description, rather than the device or element of indication or suggestion meaning must have a particular orientation, It is constructed and operated in a specific orientation, therefore is not considered as limiting the invention.In addition, term " first ", " second ", " third " is used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance.
Finally it should be noted that:Embodiment described above, only a specific embodiment of the invention, to illustrate the present invention Technical solution, rather than its limitations, scope of protection of the present invention is not limited thereto, although with reference to the foregoing embodiments to this hair It is bright to be described in detail, those skilled in the art should understand that:Anyone skilled in the art In the technical scope disclosed by the present invention, it can still modify to technical solution documented by previous embodiment or can be light It is readily conceivable that variation or equivalent replacement of some of the technical features;And these modifications, variation or replacement, do not make The essence of corresponding technical solution is detached from the spirit and scope of technical solution of the embodiment of the present invention, should all cover in protection of the invention Within the scope of.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. a kind of duplicate removal screening technique of business news flash, which is characterized in that the method includes:
Obtain business news flash text to be detected, wherein include Business Name in the business news flash text to be detected;
The business news flash text to be detected is calculated using simhash algorithm, obtains the business news flash text to be detected This simhash fingerprint;
According to the Business Name in the business news flash text to be detected, preset time is extracted in business news flash duplicate removal database The simhash fingerprint set of the interior target business news flash text comprising the Business Name;
Calculate each simhash in the simhash fingerprint and the simhash fingerprint set of the business news flash text to be detected Hamming distances between fingerprint;
If each Hamming distances are both greater than preset value in the Hamming distances, it is determined that the business news flash text to be detected is Non-duplicate business news flash text, and the information of the business news flash text to be detected is inserted into the business news flash duplicate removal data Library.
2. the method according to claim 1, wherein the method also includes:
If in the Hamming distances, there are at least one Hamming distances to be not more than the preset value, it is determined that described to be detected Business news flash text is to repeat business news flash text, and abandon the business news flash text to be detected.
3. the method according to claim 1, wherein the method also includes:
If the simhash fingerprint set is not present in the business news flash duplicate removal database, it is determined that the quotient to be detected Industry news flash text is non-duplicate business news flash text, and the information of the business news flash text to be detected insertion business is fast Interrogate duplicate removal database.
4. the method according to claim 1, wherein obtaining business news flash text to be detected and including:
Crawl initial text to be detected in real time on the internet by crawler;
Pre-detection is carried out to the initial text to be detected, the text after being detected;
Text after the detection is pre-processed, the text that obtains that treated;
Business Name mentioned by being extracted in the text of default enterprise name library after the treatment;
If extracting Business Name, described treated that text is the business news flash text to be detected.
5. according to the method described in claim 4, it is characterized in that, the method also includes:
If not extracting Business Name, described treated that text is not the business news flash text to be detected, and abandons Treated the text.
6. according to the method described in claim 4, it is characterized in that, being obtained to the initial text progress pre-detection to be detected Text after detection includes:
It obtains and deactivates word list;
The initial text to be detected in the initial text to be detected comprising the deactivated word list Chinese originally is abandoned, residue is obtained Initial text to be detected;
Using the remaining initial text to be detected as the text after the detection.
7. according to the method described in claim 4, obtaining everywhere it is characterized in that, pre-processed to the text after the detection Text after reason includes:
Processing and/or conversion process are removed to the text after the detection;
Wherein, the removal, which is handled, includes:The processing of html tag is removed, the processing for the content for including in default label is removed;
It is described to be converted to the processing of all capitalization lowers.
8. a kind of duplicate removal screening plant of business news flash, which is characterized in that described device includes:
Module is obtained, for obtaining business news flash text to be detected, wherein include company in the business news flash text to be detected Title;
First computing module is obtained described for being calculated using simhash algorithm the business news flash text to be detected The simhash fingerprint of business news flash text to be detected;
Extraction module, for according to the Business Name in the business news flash text to be detected, in business news flash duplicate removal database The simhash fingerprint set of target business news flash text in middle extraction preset time comprising the Business Name;
Second computing module, for calculating the simhash fingerprint and the simhash fingerprint of the business news flash text to be detected Hamming distances in set between each simhash fingerprint;
First determining module, if each Hamming distances are both greater than preset value in the Hamming distances, it is determined that described to be detected Business news flash text is non-duplicate business news flash text, and the information of the business news flash text to be detected is inserted into the business News flash duplicate removal database.
9. device according to claim 8, which is characterized in that described device further includes:
Second determining module, if in the Hamming distances, there are at least one Hamming distances to be not more than the preset value, then really The fixed business news flash text to be detected is to repeat business news flash text, and abandon the business news flash text to be detected.
10. device according to claim 8, which is characterized in that described device further includes:
Third determining module, if the simhash fingerprint set is not present in the business news flash duplicate removal database, it is determined that The business news flash text to be detected is non-duplicate business news flash text, and the information of the business news flash text to be detected is inserted Enter the business news flash duplicate removal database.
CN201810675645.3A 2018-06-26 2018-06-26 The duplicate removal screening technique and device of business news flash Pending CN108846117A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810675645.3A CN108846117A (en) 2018-06-26 2018-06-26 The duplicate removal screening technique and device of business news flash

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810675645.3A CN108846117A (en) 2018-06-26 2018-06-26 The duplicate removal screening technique and device of business news flash

Publications (1)

Publication Number Publication Date
CN108846117A true CN108846117A (en) 2018-11-20

Family

ID=64202810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810675645.3A Pending CN108846117A (en) 2018-06-26 2018-06-26 The duplicate removal screening technique and device of business news flash

Country Status (1)

Country Link
CN (1) CN108846117A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110837555A (en) * 2019-11-11 2020-02-25 苏州朗动网络科技有限公司 Method, equipment and storage medium for removing duplicate and screening of massive texts
CN110955751A (en) * 2019-11-13 2020-04-03 广州供电局有限公司 Method, device and system for removing duplication of work ticket text and computer storage medium
CN111382233A (en) * 2020-03-18 2020-07-07 深圳市随金科技有限公司 Similar text detection method and device, electronic equipment and storage medium
CN111859063A (en) * 2019-04-30 2020-10-30 北京智慧星光信息技术有限公司 Control method and device for monitoring transfer of seal information in Internet
CN114519110A (en) * 2022-01-26 2022-05-20 北京金堤科技有限公司 Public opinion text display method and device
CN114528375A (en) * 2022-01-26 2022-05-24 北京金堤科技有限公司 Similar public opinion text recognition method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000079426A1 (en) * 1999-06-18 2000-12-28 The Trustees Of Columbia University In The City Of New York System and method for detecting text similarity over short passages
WO2010019209A1 (en) * 2008-08-11 2010-02-18 Collective Media, Inc. Method and system for classifying text
US20120163707A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation Matching text to images
CN102799647A (en) * 2012-06-30 2012-11-28 华为技术有限公司 Method and device for webpage reduplication deletion
CN104572787A (en) * 2013-10-29 2015-04-29 腾讯科技(深圳)有限公司 Method and device for recognizing pseudo original website
CN106202561A (en) * 2016-07-29 2016-12-07 北京联创众升科技有限公司 Digitized contingency management case library construction methods based on the big data of text and device
CN106446148A (en) * 2016-09-21 2017-02-22 中国运载火箭技术研究院 Cluster-based text duplicate checking method
CN107562824A (en) * 2017-08-21 2018-01-09 昆明理工大学 A kind of text similarity detection method
CN107609106A (en) * 2017-09-12 2018-01-19 马上消费金融股份有限公司 A kind of similar article lookup method, device, equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000079426A1 (en) * 1999-06-18 2000-12-28 The Trustees Of Columbia University In The City Of New York System and method for detecting text similarity over short passages
WO2010019209A1 (en) * 2008-08-11 2010-02-18 Collective Media, Inc. Method and system for classifying text
US20120163707A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation Matching text to images
CN102799647A (en) * 2012-06-30 2012-11-28 华为技术有限公司 Method and device for webpage reduplication deletion
CN104572787A (en) * 2013-10-29 2015-04-29 腾讯科技(深圳)有限公司 Method and device for recognizing pseudo original website
CN106202561A (en) * 2016-07-29 2016-12-07 北京联创众升科技有限公司 Digitized contingency management case library construction methods based on the big data of text and device
CN106446148A (en) * 2016-09-21 2017-02-22 中国运载火箭技术研究院 Cluster-based text duplicate checking method
CN107562824A (en) * 2017-08-21 2018-01-09 昆明理工大学 A kind of text similarity detection method
CN107609106A (en) * 2017-09-12 2018-01-19 马上消费金融股份有限公司 A kind of similar article lookup method, device, equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859063A (en) * 2019-04-30 2020-10-30 北京智慧星光信息技术有限公司 Control method and device for monitoring transfer of seal information in Internet
CN111859063B (en) * 2019-04-30 2023-11-03 北京智慧星光信息技术有限公司 Control method and device for monitoring transfer seal information in Internet
CN110837555A (en) * 2019-11-11 2020-02-25 苏州朗动网络科技有限公司 Method, equipment and storage medium for removing duplicate and screening of massive texts
CN110955751A (en) * 2019-11-13 2020-04-03 广州供电局有限公司 Method, device and system for removing duplication of work ticket text and computer storage medium
CN111382233A (en) * 2020-03-18 2020-07-07 深圳市随金科技有限公司 Similar text detection method and device, electronic equipment and storage medium
CN114519110A (en) * 2022-01-26 2022-05-20 北京金堤科技有限公司 Public opinion text display method and device
CN114528375A (en) * 2022-01-26 2022-05-24 北京金堤科技有限公司 Similar public opinion text recognition method and device

Similar Documents

Publication Publication Date Title
CN108846117A (en) The duplicate removal screening technique and device of business news flash
CN105138652B (en) A kind of enterprise's incidence relation recognition methods and system
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN104899508B (en) A kind of multistage detection method for phishing site and system
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
CN110602045B (en) Malicious webpage identification method based on feature fusion and machine learning
CN106776567B (en) Internet big data analysis and extraction method and system
CN111797239B (en) Application program classification method and device and terminal equipment
CN110737821B (en) Similar event query method, device, storage medium and terminal equipment
CN110309251B (en) Text data processing method, device and computer readable storage medium
CN105468744A (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN106446124B (en) A kind of Website classification method based on cyberrelationship figure
CN111460803B (en) Equipment identification method based on Web management page of industrial Internet of things equipment
CN109033203A (en) A kind of feature extraction method for parallel processing towards big data
CN105095381A (en) Method and device for new word identification
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN113282754A (en) Public opinion detection method, device, equipment and storage medium for news events
CN112364014A (en) Data query method, device, server and storage medium
CN110580337A (en) professional entity disambiguation implementation method based on entity similarity calculation
CN107085603B (en) Data processing method and device
CN102929948B (en) list page identification system and method
CN108595453B (en) URL (Uniform resource locator) identifier mapping obtaining method and device
CN109739840A (en) Data processing empty value method, apparatus and terminal device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181120