CN107066623A - A kind of article merging method and device - Google Patents

A kind of article merging method and device Download PDF

Info

Publication number
CN107066623A
CN107066623A CN201710335322.5A CN201710335322A CN107066623A CN 107066623 A CN107066623 A CN 107066623A CN 201710335322 A CN201710335322 A CN 201710335322A CN 107066623 A CN107066623 A CN 107066623A
Authority
CN
China
Prior art keywords
article
target
word
hash codes
default
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710335322.5A
Other languages
Chinese (zh)
Inventor
赵海兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Zingrow Information Technology Co Ltd
Original Assignee
Hunan Zingrow Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Zingrow Information Technology Co Ltd filed Critical Hunan Zingrow Information Technology Co Ltd
Priority to CN201710335322.5A priority Critical patent/CN107066623A/en
Publication of CN107066623A publication Critical patent/CN107066623A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The embodiment of the invention discloses a kind of article merging method, first pass through default part of speech storehouse and the professional word database article to be combined to many carries out participle, extract respective target word set;Then Hash codes are asked for using default algorithm to each target word set, the distance between its is calculated to the corresponding Hash codes of each target article for meeting preset time condition successively using the first preset function;When the distance between judging each target article is not more than pre-determined distance threshold value, then corresponding target article is merged.The key of similar article is accurately obtained, the accuracy judged article similarity is improved;It can effectively avoid because mistake merges the article of different content using template, be conducive to improving the degree of accuracy that article merges, also advantageously improve the speed of article merging.In addition, the embodiment of the present invention is additionally provided realizes device accordingly, further such that methods described has more practicality, described device has corresponding advantage.

Description

A kind of article merging method and device
Technical field
The present embodiments relate to text information processing technical field, more particularly to a kind of article merging method and dress Put.
Background technology
With the development of computer technology and Internet technology, user becomes increasingly dependent on network, from access news, study New knowledge, the new technical ability of grasp etc. all obtain resource by network.Literature various is more and more in network, and documents and materials Source it is also more and more wider.Same piece article may be forwarded repeatedly, or in a same piece by many personal progress in a network It is slightly modified on article, then generate other article, etc..This similar article does not occupy a large amount of cyberspaces, And user can be caused to occur multiple identical Internet resources when scanning for, made troubles to user.
In the prior art, it is due to that inaccurate is extracted to article core position though there is the folding to similar article, Do not possess specific aim either, merge article and mistake occur, accuracy rate is relatively low, and different articles are misdeemed and closed for similar article And.For example, for having some articles of fixed form, such as news and issue bulletin, prior art will often make Merged with the different articles of same class template, such as many news of same theme, due to being related to event in text The time of generation is different, and prior art can give tacit consent to this many news for similar article, so as to merge, this results in some years The news of part generation event can not be inquired about on network to be obtained.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of article merging method and device, improves the accurate of article merging Property.
In order to solve the above technical problems, the embodiment of the present invention provides following technical scheme:
On the one hand the embodiment of the present invention provides a kind of article merging method, including:
Obtain many articles to be combined;
Participle is carried out to many articles according to default part of speech storehouse and professional word database, to obtain respective target Word set;The default part of speech storehouse is the part of speech for each target word that the target word is concentrated, and the professional word database includes user Business demand phrase and/or the phrase that inverse document frequency word extraction is carried out in all kinds of article's styles;
Hash codes are asked for using default algorithm to each target word set, the target text for meeting preset time condition is chosen Chapter;Calculating each target article to the corresponding Hash codes of each target article successively using the first preset function away from From;
When the distance between judging each target article is not more than pre-determined distance threshold value, then corresponding target article is entered Row merges.
Optionally, each target word set is asked for also including after Hash codes using default algorithm described:
Each Hash codes are carried out according to the professional word database plus dimension dimensionality reduction.
Optionally, it is described Hash codes are asked for using default algorithm to each target word set to be:
Simhash (test, 64) is called to ask for 64 Hash codes to each target word set.
Optionally, the basis presets part of speech storehouse and professional word database carries out participle to many articles, to obtain Obtaining respective target word set is:
The target phrase in each article is extracted according to the default part of speech storehouse and the professional word database;
According to the corresponding industry type of each article, the target phrase is normalized, to generate each Corresponding target word set.
Optionally, when the distance between each target article is not more than pre-determined distance threshold value, then by corresponding target Article merge including:
By distance between each target article be not more than the corresponding target article of pre-determined distance threshold value select come, and to Family is shown;
The instruction that user is judged the target article similarity of selection is received, target article is closed according to the instruction And.
Optionally, first preset function is getDis functions.
Optionally, it is described ask for Hash codes using default algorithm to each target word set after, in addition to:
Each Hash codes are preserved into Hash server.
Optionally, described choose meets the target article of preset time condition and is:
Obtain each target article delivers the time;
When two target articles of judgement are when delivering the time no more than 15 days, then selected and.
Optionally, the pre-determined distance threshold value is 3.5.
On the other hand the embodiment of the present invention provides a kind of article and merges device, including:
Acquisition module, for obtaining many articles to be combined;
Word-dividing mode, for carrying out participle to many articles according to default part of speech storehouse and professional word database, with Obtain respective target word set;The default part of speech storehouse is the part of speech for each target word that the target word is concentrated, the professional word Database includes customer service demand phrase and/or the phrase of inverse document frequency word extraction is carried out in all kinds of article's styles;
Computing module, for asking for Hash codes using default algorithm to each target word set, when selection meets default Between condition target article;Using the first preset function successively to each mesh of each corresponding Hash codes calculating of target article Mark the distance between article;
Merging module, then will be right for when the distance between judging each target article is not more than pre-determined distance threshold value The target article answered is merged.
The embodiments of the invention provide a kind of article merging method, default part of speech storehouse and professional word database pair are first passed through Many articles to be combined carry out participle, extract respective target word set;Then default algorithm is utilized to each target word set Hash codes are asked for, the corresponding Hash codes of each target article for meeting preset time condition are calculated successively using the first preset function The distance between its;When the distance between judging each target article is not more than pre-determined distance threshold value, then by corresponding target article Merge.
The advantage for the technical scheme that the application is provided is, article to be combined is entered using default dictionary and specialized dictionary Row participle, is conducive to accurately obtaining the key of similar article, improves the accuracy judged article similarity;This Outside, chosen from article to be combined and meet the target article of time preparatory condition and be compared similarity, can effectively avoid by The article of different content is merged in the mistake using template, is conducive to improving the degree of accuracy that article merges, also helps and carry The speed that high article merges.
In addition, the embodiment of the present invention is provided also directed to article merging method realizes device accordingly, further such that institute Method is stated with more practicality, described device has corresponding advantage.
Brief description of the drawings
, below will be to embodiment or existing for the clearer explanation embodiment of the present invention or the technical scheme of prior art The accompanying drawing used required in technology description is briefly described, it should be apparent that, drawings in the following description are only this hair Some bright embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can be with root Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of article merging method provided in an embodiment of the present invention;
Fig. 2 is the schematic flow sheet of another article merging method provided in an embodiment of the present invention;
Fig. 3 is the schematic flow sheet of another article merging method provided in an embodiment of the present invention;
Fig. 4 is a kind of embodiment structure chart that article provided in an embodiment of the present invention merges device;
Fig. 5 is another embodiment structure chart that article provided in an embodiment of the present invention merges device.
Embodiment
In order that those skilled in the art more fully understand the present invention program, with reference to the accompanying drawings and detailed description The present invention is described in further detail.Obviously, described embodiment is only a part of embodiment of the invention, rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise Lower obtained every other embodiment, belongs to the scope of protection of the invention.
Term " first ", " second ", " the 3rd " " in the description and claims of this application and above-mentioned accompanying drawing Four " etc. be for distinguishing different objects, rather than for describing specific order.In addition term " comprising " and " having " and Their any deformations, it is intended that covering is non-exclusive to be included.For example contain the process of series of steps or unit, method, The step of system, product or equipment are not limited to list or unit, but the step of may include not list or unit.
After the technical scheme of the embodiment of the present invention is described, the various non-limiting realities of detailed description below the application Apply mode.
Referring first to Fig. 1, Fig. 1 is a kind of schematic flow sheet of article merging method provided in an embodiment of the present invention, this hair Bright embodiment may include herein below:
S101:Obtain many articles to be combined.
All kinds of contents are obtained from webpage automatically using reptile, then using ETL (Extraction Transformation Loading, data pick-up) corresponding text data is gathered from all kinds of contents of acquisition.It can finally lead to Crossing strToken modules calls stringToken (text) functions to obtain article to be combined.
S102:Participle is carried out to many articles according to default part of speech storehouse and professional word database, to obtain each Target word set.
It by the text dividing of article is independent word one by one that participle, which is,.
The part of speech for each target word that default part of speech storehouse can concentrate for target word, for example may include noun, verb, secondary verb, Name verb, directional verb, form verb, interior verb, inertia term, verb character morpheme, group of mechanism name, other proper names, neologisms Deng.I.e. when article carries out phrase segmentation, the corresponding phrase of part of speech in default dictionary is remained, target word set is put in In.
Professional word database may include customer service demand phrase and/or inverse document frequency carried out in all kinds of article's styles The phrase that word is extracted.Customer service demand phrase is determined some industry according to different users or the different business of same user Business everyday words, is conducive to improving the accuracy rate of the acquisition to the kernel keyword of the article in a certain field.Because article is general It can be obtained from existing news, bulletin, microblogging, wechat, therefore inverse document can be carried out from the data in news, bulletin, microblogging, wechat The extraction of frequency word.Professional word database is self-defined dictionary, can both include customer service demand phrase and inverse document frequency word Group, may also comprise one of them, and this does not influence the realization of the application.
In a kind of specific embodiment, participle first can be carried out according to default part of speech storehouse to article to be combined, obtained just Level target word set, recycles professional word database to screen Primary objectives word set, removes part word, be used as final target Word set.
In order to further improve the accuracy rate for the phrase that the kernel keyword of extraction, i.e. target word are concentrated, in other In embodiment, S102 may particularly include:
The target phrase in each article is extracted according to default part of speech storehouse and professional word database;
According to the corresponding industry type of each article, the target phrase is normalized, to generate each Corresponding target word set.
For example, the industry type of article such as entertainment newses, finance and economics comment, sports agate, educational article, it is different Industry field, the word of the same meaning is often different, in order to improve the accuracy of similarity judgement, can be to the word that extracts Group is normalized.The target word set that described normalization is extracted using the proprietary word change in same industry field In amateur word.
S103:Hash codes are asked for using default algorithm to each target word set, selection meets preset time condition Target article;Using the first preset function successively each target article of each corresponding Hash codes calculating of target article Distance.
Simhash (test, 64) can be called to ask for 64 Hash codes to each target word set, certainly, also can be using other calculations Method, this does not influence the realization of the application.
Simhash is the algorithm of processing mass text duplicate removal.Simhash is exactly, by a document, to be finally converted into one The byte of 64, referred to as tagged word, then judges that the distance for repeating to only need to judge their tagged word is for the time being<n (rule of thumb the general values of this n are 3), it is possible to judge whether two documents are similar.
For the article using fixed form, such as news notifies bulletin, for example, A press release is 2015 shareholders Conference is held, and B news is that 2016 general meetings of shareholders are held.Because 2 news are announced for listed company, article has the template on basis, only It is that author have modified time, and subregion content, if keyword extraction is improper, article can be caused to merge one Rise.In view of this kind of article using same template, typically all limited periods, such as news has ageing, therefore in view of above-mentioned feelings Condition, optionally, before the distance of Hash codes is compared, may also include:
Obtain each target article delivers the time;
When two target articles of judgement are when delivering the time no more than 15 days, then selected and.
If two articles delivered the time more than 15 days, it is not compared.Optionally, can be preferentially within 7 days Article be compared.The limitation of passage time, not only increases the accuracy of similarity judgement, also improves similarity system design Speed.
First preset function can be getDis functions, getDis (String hashCode, int diffDay) function root It is compared according to data in 7 days to 15, on the one hand improves and compare performance, on the other hand improve accuracy rate.Certainly, it can also be used His function, this does not influence the realization of the application.
S104:When the distance between judging each target article is not more than pre-determined distance threshold value, then by corresponding target Article is merged.
Pre-determined distance threshold value can be 3.5, certainly, or other values, and this does not influence the realization of the application.
In technical scheme provided in an embodiment of the present invention, article to be combined is entered using default dictionary and specialized dictionary Row participle, is conducive to accurately obtaining the key of similar article, improves the accuracy judged article similarity;This Outside, chosen from article to be combined and meet the target article of time preparatory condition and be compared similarity, can effectively avoid by The article of different content is merged in the mistake using template, is conducive to improving the degree of accuracy that article merges, also helps and carry The speed that high article merges.
In a kind of specific embodiment, when article to be combined is more, in order to improve the speed that Hash codes compare, base In above-described embodiment, after Hash codes are asked for using default algorithm to each target word set, it may also include:
Each Hash codes are carried out according to the professional word database plus dimension dimensionality reduction.
SimHash is fingerprint generating algorithm or to be fingerprint extraction algorithm, is widely used in hundred million grades of removing duplicate webpages In work, its main thought is dimensionality reduction.For example, an a number of content of text, can after simhash dimensionality reductions The character string that the binary system that a length is 32 or 64 is constituted by 01 can be only obtained, similar identity card is calculated by SimHash Method, can make the things of complexity, can be simplified by dimensionality reduction.SimHash operation principle is one text of preparation;Cross filtering Wash, extract n characteristic key words;Characteristic weighing;The signature of the composition of hash dimensionality reductions 01 is carried out to keyword (above-mentioned is 6);So Vector weighting afterwards, for each of each signature of 6, if 1, hash is just being multiplied with weight, if 0, then Hash and weight negative multiply, so far with regard to that can obtain the vector of each characteristic value;Merge all characteristic vectors to be added, obtain one Final vector, then dimensionality reduction if greater than 0 is 1 for each final vectorial, is otherwise 0, can thus obtain Final simhash fingerprint signature.
Dimension-reduction treatment processing is carried out to Hash codes, is conducive to improving merging speed.
Merge article automatically in view of system, it may appear that mistake merges the article of different content, in consideration of it, the application Another embodiment is additionally provided, referring to Fig. 2, may include:
S201-S203:Specifically, with it is consistent described by the S101-S103 of above-described embodiment, here is omitted.
S204:Distance between each target article is not more than into the corresponding target article of pre-determined distance threshold value to select, And be shown to user;
S205:The instruction that user is judged the target article similarity of selection is received, is instructed according to described by target article Merge.
Doubtful similar article is met the article that distance between article is not more than pre-determined distance threshold value, selected by system Come, a list can be generated, be shown to user.User to doubtful article by determining whether, by the phase of selection One is sent like article and merges instruction, and system is merged according to instruction to similar article, and other to system automatic decision Doubtful article is without merging.
By further confirming that for user, the accuracy rate of article merging is improved.
Due to power-off suddenly, or occur the situation that other equipment breaks down, cause the Hash codes calculated to be lost, in order to avoid The operation of repetition is re-started, based on above-described embodiment, referring to Fig. 3, may also include:
S206:Each Hash codes are preserved into Hash server.
Optionally, it can be calculated by Hash server admin and obtain Hash codes, then can be by the Hash codes compared with team The mode of row first-in last-out is stored in Hash server.
By the way that Hash codes are preserved, the stability and reliability of system are improved, is conducive to improving article merging Speed.
The embodiment of the present invention provides also directed to article merging method and realizes device accordingly, further such that methods described With more practicality.Merge device to article provided in an embodiment of the present invention below to be introduced, article described below merges Device can be mutually to should refer to above-described article merging method.
Referring to Fig. 4, Fig. 4 is that article provided in an embodiment of the present invention merges a kind of structure of the device under embodiment Figure, the device may include:
Acquisition module 401, for obtaining many articles to be combined.
Word-dividing mode 402, for carrying out participle to many articles according to default part of speech storehouse and professional word database, To obtain respective target word set;The default part of speech storehouse is the part of speech for each target word that the target word is concentrated, the specialty Word database includes customer service demand phrase and/or the phrase of inverse document frequency word extraction is carried out in all kinds of article's styles.
Computing module 404, for asking for Hash codes using default algorithm to each target word set, chooses and meets default The target article of time conditions;Each described are calculated to the corresponding Hash codes of each target article successively using the first preset function Distance between target article.
Merging module 404, for when the distance between judging each target article is not more than pre-determined distance threshold value, then will Corresponding target article is merged.
Optionally, in some embodiments of the present embodiment, referring to Fig. 5, described device can also for example include:
Hash codes processing module 405, for each Hash codes to be carried out plus tieed up with dimensionality reduction according to the professional word database.
In some specific embodiments, the word-dividing mode 402 can be according to default part of speech storehouse and professional word number The target phrase in each article is extracted according to storehouse;According to the corresponding industry type of each article, the target phrase is entered Row normalized, to generate the module of each self-corresponding target word collection.
In other embodiment, the merging module 404 can be by distance between each target article not Target article corresponding more than pre-determined distance threshold value, which is selected, to be come, and is shown to user;Receive target of the user to selection The instruction that article similarity judges, according to the module for instructing and merging target article.
Optionally, in other embodiments of the present embodiment, referring to Fig. 5, described device can also for example include:
Memory module 406, for each Hash codes to be preserved into Hash server.
In some embodiments in the present embodiment, the computing module 403 can be each target article of acquisition Deliver the time;When two target articles of judgement are when delivering the time no more than 15 days, then the module come is selected.
The function that article described in the embodiment of the present invention merges each functional module of device can be according in above method embodiment Method implement, it implements the associated description that process is referred to above method embodiment, and here is omitted.
From the foregoing, it will be observed that the embodiment of the present invention carries out participle using default dictionary and specialized dictionary to article to be combined, have Beneficial to the key for accurately obtaining similar article, the accuracy judged article similarity is improved;In addition, to be combined Chosen in article and meet the target article of time preparatory condition and be compared similarity, can effectively avoided due to using template Mistake merges the article of different content, is conducive to improving the degree of accuracy that article merges, also advantageously improves article merging Speed.
The embodiment of each in this specification is described by the way of progressive, what each embodiment was stressed be with it is other Between the difference of embodiment, each embodiment same or similar part mutually referring to.For being filled disclosed in embodiment For putting, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is referring to method part Explanation.
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate hardware and The interchangeability of software, generally describes the composition and step of each example according to function in the above description.These Function is performed with hardware or software mode actually, depending on the application-specific and design constraint of technical scheme.Specialty Technical staff can realize described function to each specific application using distinct methods, but this realization should not Think beyond the scope of this invention.
Directly it can be held with reference to the step of the method or algorithm that the embodiments described herein is described with hardware, processor Capable software module, or the two combination are implemented.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
A kind of article merging method provided by the present invention and device are described in detail above.It is used herein Specific case is set forth to the principle and embodiment of the present invention, and the explanation of above example is only intended to help and understands this The method and its core concept of invention.It should be pointed out that for those skilled in the art, not departing from this hair On the premise of bright principle, some improvement and modification can also be carried out to the present invention, these are improved and modification also falls into power of the present invention In the protection domain that profit is required.

Claims (10)

1. a kind of article merging method, it is characterised in that including:
Obtain many articles to be combined;
Participle is carried out to many articles according to default part of speech storehouse and professional word database, to obtain respective target word Collection;The default part of speech storehouse is the part of speech for each target word that the target word is concentrated, and the professional word database includes user's industry Business demand phrase and/or the phrase that inverse document frequency word extraction is carried out in all kinds of article's styles;
Hash codes are asked for using default algorithm to each target word set, the target article for meeting preset time condition is chosen; Using the first preset function successively the distance each target article of each corresponding Hash codes calculating of target article;
When the distance between judging each target article is not more than pre-determined distance threshold value, then corresponding target article is closed And.
2. according to the method described in claim 1, it is characterised in that default algorithm is utilized to each target word set described Ask for also including after Hash codes:
Each Hash codes are carried out according to the professional word database plus dimension dimensionality reduction.
3. method according to claim 2, it is characterised in that described to be asked using default algorithm each target word set The Hash codes are taken to be:
Simhash (test, 64) is called to ask for 64 Hash codes to each target word set.
4. the method according to claims 1 to 3 any one, it is characterised in that the basis presets part of speech storehouse and special Industry word database carries out participles to many articles, using obtain respective target word set as:
The target phrase in each article is extracted according to the default part of speech storehouse and the professional word database;
According to the corresponding industry type of each article, the target phrase is normalized, to generate respective correspondence Target word set.
5. the method according to claims 1 to 3 any one, it is characterised in that between each target article away from During from no more than pre-determined distance threshold value, then by corresponding target article merge including:
By distance between each target article be not more than the corresponding target article of the pre-determined distance threshold value select come, and to Family is shown;
The instruction that user is judged the target article similarity of selection is received, target article is merged according to the instruction.
6. method according to claim 5, it is characterised in that first preset function is getDis functions.
7. method according to claim 6, it is characterised in that default algorithm is utilized to each target word set described After asking for Hash codes, in addition to:
Each Hash codes are preserved into Hash server.
8. method according to claim 7, it is characterised in that the selection meets the target article of preset time condition For:
Obtain each target article delivers the time;
When two target articles of judgement are when delivering the time no more than 15 days, then selected and.
9. method according to claim 8, it is characterised in that the pre-determined distance threshold value is 3.5.
10. a kind of article merges device, it is characterised in that including:
Acquisition module, for obtaining many articles to be combined;
Word-dividing mode, for carrying out participle to many articles according to default part of speech storehouse and professional word database, to obtain Respective target word set;The default part of speech storehouse is the part of speech for each target word that the target word is concentrated, the professional word data Storehouse includes customer service demand phrase and/or the phrase of inverse document frequency word extraction is carried out in all kinds of article's styles;
Computing module, for asking for Hash codes using default algorithm to each target word set, selection meets preset time bar The target article of part;Successively the corresponding Hash codes of each target article are calculated with each target text using the first preset function Distance between chapter;
Merging module, then will be corresponding for when the distance between judging each target article is not more than pre-determined distance threshold value Target article is merged.
CN201710335322.5A 2017-05-12 2017-05-12 A kind of article merging method and device Pending CN107066623A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710335322.5A CN107066623A (en) 2017-05-12 2017-05-12 A kind of article merging method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710335322.5A CN107066623A (en) 2017-05-12 2017-05-12 A kind of article merging method and device

Publications (1)

Publication Number Publication Date
CN107066623A true CN107066623A (en) 2017-08-18

Family

ID=59597366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710335322.5A Pending CN107066623A (en) 2017-05-12 2017-05-12 A kind of article merging method and device

Country Status (1)

Country Link
CN (1) CN107066623A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555198A (en) * 2018-05-31 2019-12-10 北京百度网讯科技有限公司 method, apparatus, device and computer-readable storage medium for generating article
WO2022105497A1 (en) * 2020-11-19 2022-05-27 深圳壹账通智能科技有限公司 Text screening method and apparatus, device, and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064304A1 (en) * 2002-07-03 2004-04-01 Word Data Corp Text representation and method
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN104252445A (en) * 2013-06-26 2014-12-31 华为技术有限公司 Document similarity calculation method and near-duplicate document detection method and device
CN106569989A (en) * 2016-10-20 2017-04-19 北京智能管家科技有限公司 De-weighting method and apparatus for short text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064304A1 (en) * 2002-07-03 2004-04-01 Word Data Corp Text representation and method
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN104252445A (en) * 2013-06-26 2014-12-31 华为技术有限公司 Document similarity calculation method and near-duplicate document detection method and device
CN106569989A (en) * 2016-10-20 2017-04-19 北京智能管家科技有限公司 De-weighting method and apparatus for short text

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555198A (en) * 2018-05-31 2019-12-10 北京百度网讯科技有限公司 method, apparatus, device and computer-readable storage medium for generating article
CN110555198B (en) * 2018-05-31 2023-05-23 北京百度网讯科技有限公司 Method, apparatus, device and computer readable storage medium for generating articles
WO2022105497A1 (en) * 2020-11-19 2022-05-27 深圳壹账通智能科技有限公司 Text screening method and apparatus, device, and storage medium

Similar Documents

Publication Publication Date Title
CN103336766B (en) Short text garbage identification and modeling method and device
CN110348214B (en) Method and system for detecting malicious codes
CN109582704B (en) Recruitment information and the matched method of job seeker resume
CN103577989B (en) A kind of information classification approach and information classifying system based on product identification
CN111597817B (en) Event information extraction method and device
CN108549723B (en) Text concept classification method and device and server
CN110516251B (en) Method, device, equipment and medium for constructing electronic commerce entity identification model
CN107357777B (en) Method and device for extracting label information
CN101308512B (en) Mutual translation pair extraction method and device based on web page
CN107368489A (en) A kind of information data processing method and device
CN110020430B (en) Malicious information identification method, device, equipment and storage medium
CN105095203B (en) Determination, searching method and the server of synonym
CN110347806A (en) Original text discriminating method, device, equipment and computer readable storage medium
CN107066623A (en) A kind of article merging method and device
CN108388556B (en) Method and system for mining homogeneous entity
CN106503152A (en) Title treating method and apparatus
CN108701126A (en) Theme estimating device, theme presumption method and storage medium
CN107665443B (en) Obtain the method and device of target user
CN107590163B (en) The methods, devices and systems of text feature selection
CN111488452A (en) Webpage tampering detection method, detection system and related equipment
CN115408997A (en) Text generation method, text generation device and readable storage medium
CN104281692A (en) Method and system for realizing paragraph dimensionalized description
JP5824429B2 (en) Spam account score calculation apparatus, spam account score calculation method, and program
CN110619212B (en) Character string-based malicious software identification method, system and related device
CN106547822A (en) A kind of text relevant determines method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170818

RJ01 Rejection of invention patent application after publication