CN107066623A - A kind of article merging method and device - Google Patents
A kind of article merging method and device Download PDFInfo
- Publication number
- CN107066623A CN107066623A CN201710335322.5A CN201710335322A CN107066623A CN 107066623 A CN107066623 A CN 107066623A CN 201710335322 A CN201710335322 A CN 201710335322A CN 107066623 A CN107066623 A CN 107066623A
- Authority
- CN
- China
- Prior art keywords
- article
- target
- word
- hash codes
- default
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The embodiment of the invention discloses a kind of article merging method, first pass through default part of speech storehouse and the professional word database article to be combined to many carries out participle, extract respective target word set;Then Hash codes are asked for using default algorithm to each target word set, the distance between its is calculated to the corresponding Hash codes of each target article for meeting preset time condition successively using the first preset function;When the distance between judging each target article is not more than pre-determined distance threshold value, then corresponding target article is merged.The key of similar article is accurately obtained, the accuracy judged article similarity is improved;It can effectively avoid because mistake merges the article of different content using template, be conducive to improving the degree of accuracy that article merges, also advantageously improve the speed of article merging.In addition, the embodiment of the present invention is additionally provided realizes device accordingly, further such that methods described has more practicality, described device has corresponding advantage.
Description
Technical field
The present embodiments relate to text information processing technical field, more particularly to a kind of article merging method and dress
Put.
Background technology
With the development of computer technology and Internet technology, user becomes increasingly dependent on network, from access news, study
New knowledge, the new technical ability of grasp etc. all obtain resource by network.Literature various is more and more in network, and documents and materials
Source it is also more and more wider.Same piece article may be forwarded repeatedly, or in a same piece by many personal progress in a network
It is slightly modified on article, then generate other article, etc..This similar article does not occupy a large amount of cyberspaces,
And user can be caused to occur multiple identical Internet resources when scanning for, made troubles to user.
In the prior art, it is due to that inaccurate is extracted to article core position though there is the folding to similar article,
Do not possess specific aim either, merge article and mistake occur, accuracy rate is relatively low, and different articles are misdeemed and closed for similar article
And.For example, for having some articles of fixed form, such as news and issue bulletin, prior art will often make
Merged with the different articles of same class template, such as many news of same theme, due to being related to event in text
The time of generation is different, and prior art can give tacit consent to this many news for similar article, so as to merge, this results in some years
The news of part generation event can not be inquired about on network to be obtained.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of article merging method and device, improves the accurate of article merging
Property.
In order to solve the above technical problems, the embodiment of the present invention provides following technical scheme:
On the one hand the embodiment of the present invention provides a kind of article merging method, including:
Obtain many articles to be combined;
Participle is carried out to many articles according to default part of speech storehouse and professional word database, to obtain respective target
Word set;The default part of speech storehouse is the part of speech for each target word that the target word is concentrated, and the professional word database includes user
Business demand phrase and/or the phrase that inverse document frequency word extraction is carried out in all kinds of article's styles;
Hash codes are asked for using default algorithm to each target word set, the target text for meeting preset time condition is chosen
Chapter;Calculating each target article to the corresponding Hash codes of each target article successively using the first preset function away from
From;
When the distance between judging each target article is not more than pre-determined distance threshold value, then corresponding target article is entered
Row merges.
Optionally, each target word set is asked for also including after Hash codes using default algorithm described:
Each Hash codes are carried out according to the professional word database plus dimension dimensionality reduction.
Optionally, it is described Hash codes are asked for using default algorithm to each target word set to be:
Simhash (test, 64) is called to ask for 64 Hash codes to each target word set.
Optionally, the basis presets part of speech storehouse and professional word database carries out participle to many articles, to obtain
Obtaining respective target word set is:
The target phrase in each article is extracted according to the default part of speech storehouse and the professional word database;
According to the corresponding industry type of each article, the target phrase is normalized, to generate each
Corresponding target word set.
Optionally, when the distance between each target article is not more than pre-determined distance threshold value, then by corresponding target
Article merge including:
By distance between each target article be not more than the corresponding target article of pre-determined distance threshold value select come, and to
Family is shown;
The instruction that user is judged the target article similarity of selection is received, target article is closed according to the instruction
And.
Optionally, first preset function is getDis functions.
Optionally, it is described ask for Hash codes using default algorithm to each target word set after, in addition to:
Each Hash codes are preserved into Hash server.
Optionally, described choose meets the target article of preset time condition and is:
Obtain each target article delivers the time;
When two target articles of judgement are when delivering the time no more than 15 days, then selected and.
Optionally, the pre-determined distance threshold value is 3.5.
On the other hand the embodiment of the present invention provides a kind of article and merges device, including:
Acquisition module, for obtaining many articles to be combined;
Word-dividing mode, for carrying out participle to many articles according to default part of speech storehouse and professional word database, with
Obtain respective target word set;The default part of speech storehouse is the part of speech for each target word that the target word is concentrated, the professional word
Database includes customer service demand phrase and/or the phrase of inverse document frequency word extraction is carried out in all kinds of article's styles;
Computing module, for asking for Hash codes using default algorithm to each target word set, when selection meets default
Between condition target article;Using the first preset function successively to each mesh of each corresponding Hash codes calculating of target article
Mark the distance between article;
Merging module, then will be right for when the distance between judging each target article is not more than pre-determined distance threshold value
The target article answered is merged.
The embodiments of the invention provide a kind of article merging method, default part of speech storehouse and professional word database pair are first passed through
Many articles to be combined carry out participle, extract respective target word set;Then default algorithm is utilized to each target word set
Hash codes are asked for, the corresponding Hash codes of each target article for meeting preset time condition are calculated successively using the first preset function
The distance between its;When the distance between judging each target article is not more than pre-determined distance threshold value, then by corresponding target article
Merge.
The advantage for the technical scheme that the application is provided is, article to be combined is entered using default dictionary and specialized dictionary
Row participle, is conducive to accurately obtaining the key of similar article, improves the accuracy judged article similarity;This
Outside, chosen from article to be combined and meet the target article of time preparatory condition and be compared similarity, can effectively avoid by
The article of different content is merged in the mistake using template, is conducive to improving the degree of accuracy that article merges, also helps and carry
The speed that high article merges.
In addition, the embodiment of the present invention is provided also directed to article merging method realizes device accordingly, further such that institute
Method is stated with more practicality, described device has corresponding advantage.
Brief description of the drawings
, below will be to embodiment or existing for the clearer explanation embodiment of the present invention or the technical scheme of prior art
The accompanying drawing used required in technology description is briefly described, it should be apparent that, drawings in the following description are only this hair
Some bright embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can be with root
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of article merging method provided in an embodiment of the present invention;
Fig. 2 is the schematic flow sheet of another article merging method provided in an embodiment of the present invention;
Fig. 3 is the schematic flow sheet of another article merging method provided in an embodiment of the present invention;
Fig. 4 is a kind of embodiment structure chart that article provided in an embodiment of the present invention merges device;
Fig. 5 is another embodiment structure chart that article provided in an embodiment of the present invention merges device.
Embodiment
In order that those skilled in the art more fully understand the present invention program, with reference to the accompanying drawings and detailed description
The present invention is described in further detail.Obviously, described embodiment is only a part of embodiment of the invention, rather than
Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise
Lower obtained every other embodiment, belongs to the scope of protection of the invention.
Term " first ", " second ", " the 3rd " " in the description and claims of this application and above-mentioned accompanying drawing
Four " etc. be for distinguishing different objects, rather than for describing specific order.In addition term " comprising " and " having " and
Their any deformations, it is intended that covering is non-exclusive to be included.For example contain the process of series of steps or unit, method,
The step of system, product or equipment are not limited to list or unit, but the step of may include not list or unit.
After the technical scheme of the embodiment of the present invention is described, the various non-limiting realities of detailed description below the application
Apply mode.
Referring first to Fig. 1, Fig. 1 is a kind of schematic flow sheet of article merging method provided in an embodiment of the present invention, this hair
Bright embodiment may include herein below:
S101:Obtain many articles to be combined.
All kinds of contents are obtained from webpage automatically using reptile, then using ETL (Extraction
Transformation Loading, data pick-up) corresponding text data is gathered from all kinds of contents of acquisition.It can finally lead to
Crossing strToken modules calls stringToken (text) functions to obtain article to be combined.
S102:Participle is carried out to many articles according to default part of speech storehouse and professional word database, to obtain each
Target word set.
It by the text dividing of article is independent word one by one that participle, which is,.
The part of speech for each target word that default part of speech storehouse can concentrate for target word, for example may include noun, verb, secondary verb,
Name verb, directional verb, form verb, interior verb, inertia term, verb character morpheme, group of mechanism name, other proper names, neologisms
Deng.I.e. when article carries out phrase segmentation, the corresponding phrase of part of speech in default dictionary is remained, target word set is put in
In.
Professional word database may include customer service demand phrase and/or inverse document frequency carried out in all kinds of article's styles
The phrase that word is extracted.Customer service demand phrase is determined some industry according to different users or the different business of same user
Business everyday words, is conducive to improving the accuracy rate of the acquisition to the kernel keyword of the article in a certain field.Because article is general
It can be obtained from existing news, bulletin, microblogging, wechat, therefore inverse document can be carried out from the data in news, bulletin, microblogging, wechat
The extraction of frequency word.Professional word database is self-defined dictionary, can both include customer service demand phrase and inverse document frequency word
Group, may also comprise one of them, and this does not influence the realization of the application.
In a kind of specific embodiment, participle first can be carried out according to default part of speech storehouse to article to be combined, obtained just
Level target word set, recycles professional word database to screen Primary objectives word set, removes part word, be used as final target
Word set.
In order to further improve the accuracy rate for the phrase that the kernel keyword of extraction, i.e. target word are concentrated, in other
In embodiment, S102 may particularly include:
The target phrase in each article is extracted according to default part of speech storehouse and professional word database;
According to the corresponding industry type of each article, the target phrase is normalized, to generate each
Corresponding target word set.
For example, the industry type of article such as entertainment newses, finance and economics comment, sports agate, educational article, it is different
Industry field, the word of the same meaning is often different, in order to improve the accuracy of similarity judgement, can be to the word that extracts
Group is normalized.The target word set that described normalization is extracted using the proprietary word change in same industry field
In amateur word.
S103:Hash codes are asked for using default algorithm to each target word set, selection meets preset time condition
Target article;Using the first preset function successively each target article of each corresponding Hash codes calculating of target article
Distance.
Simhash (test, 64) can be called to ask for 64 Hash codes to each target word set, certainly, also can be using other calculations
Method, this does not influence the realization of the application.
Simhash is the algorithm of processing mass text duplicate removal.Simhash is exactly, by a document, to be finally converted into one
The byte of 64, referred to as tagged word, then judges that the distance for repeating to only need to judge their tagged word is for the time being<n
(rule of thumb the general values of this n are 3), it is possible to judge whether two documents are similar.
For the article using fixed form, such as news notifies bulletin, for example, A press release is 2015 shareholders
Conference is held, and B news is that 2016 general meetings of shareholders are held.Because 2 news are announced for listed company, article has the template on basis, only
It is that author have modified time, and subregion content, if keyword extraction is improper, article can be caused to merge one
Rise.In view of this kind of article using same template, typically all limited periods, such as news has ageing, therefore in view of above-mentioned feelings
Condition, optionally, before the distance of Hash codes is compared, may also include:
Obtain each target article delivers the time;
When two target articles of judgement are when delivering the time no more than 15 days, then selected and.
If two articles delivered the time more than 15 days, it is not compared.Optionally, can be preferentially within 7 days
Article be compared.The limitation of passage time, not only increases the accuracy of similarity judgement, also improves similarity system design
Speed.
First preset function can be getDis functions, getDis (String hashCode, int diffDay) function root
It is compared according to data in 7 days to 15, on the one hand improves and compare performance, on the other hand improve accuracy rate.Certainly, it can also be used
His function, this does not influence the realization of the application.
S104:When the distance between judging each target article is not more than pre-determined distance threshold value, then by corresponding target
Article is merged.
Pre-determined distance threshold value can be 3.5, certainly, or other values, and this does not influence the realization of the application.
In technical scheme provided in an embodiment of the present invention, article to be combined is entered using default dictionary and specialized dictionary
Row participle, is conducive to accurately obtaining the key of similar article, improves the accuracy judged article similarity;This
Outside, chosen from article to be combined and meet the target article of time preparatory condition and be compared similarity, can effectively avoid by
The article of different content is merged in the mistake using template, is conducive to improving the degree of accuracy that article merges, also helps and carry
The speed that high article merges.
In a kind of specific embodiment, when article to be combined is more, in order to improve the speed that Hash codes compare, base
In above-described embodiment, after Hash codes are asked for using default algorithm to each target word set, it may also include:
Each Hash codes are carried out according to the professional word database plus dimension dimensionality reduction.
SimHash is fingerprint generating algorithm or to be fingerprint extraction algorithm, is widely used in hundred million grades of removing duplicate webpages
In work, its main thought is dimensionality reduction.For example, an a number of content of text, can after simhash dimensionality reductions
The character string that the binary system that a length is 32 or 64 is constituted by 01 can be only obtained, similar identity card is calculated by SimHash
Method, can make the things of complexity, can be simplified by dimensionality reduction.SimHash operation principle is one text of preparation;Cross filtering
Wash, extract n characteristic key words;Characteristic weighing;The signature of the composition of hash dimensionality reductions 01 is carried out to keyword (above-mentioned is 6);So
Vector weighting afterwards, for each of each signature of 6, if 1, hash is just being multiplied with weight, if 0, then
Hash and weight negative multiply, so far with regard to that can obtain the vector of each characteristic value;Merge all characteristic vectors to be added, obtain one
Final vector, then dimensionality reduction if greater than 0 is 1 for each final vectorial, is otherwise 0, can thus obtain
Final simhash fingerprint signature.
Dimension-reduction treatment processing is carried out to Hash codes, is conducive to improving merging speed.
Merge article automatically in view of system, it may appear that mistake merges the article of different content, in consideration of it, the application
Another embodiment is additionally provided, referring to Fig. 2, may include:
S201-S203:Specifically, with it is consistent described by the S101-S103 of above-described embodiment, here is omitted.
S204:Distance between each target article is not more than into the corresponding target article of pre-determined distance threshold value to select,
And be shown to user;
S205:The instruction that user is judged the target article similarity of selection is received, is instructed according to described by target article
Merge.
Doubtful similar article is met the article that distance between article is not more than pre-determined distance threshold value, selected by system
Come, a list can be generated, be shown to user.User to doubtful article by determining whether, by the phase of selection
One is sent like article and merges instruction, and system is merged according to instruction to similar article, and other to system automatic decision
Doubtful article is without merging.
By further confirming that for user, the accuracy rate of article merging is improved.
Due to power-off suddenly, or occur the situation that other equipment breaks down, cause the Hash codes calculated to be lost, in order to avoid
The operation of repetition is re-started, based on above-described embodiment, referring to Fig. 3, may also include:
S206:Each Hash codes are preserved into Hash server.
Optionally, it can be calculated by Hash server admin and obtain Hash codes, then can be by the Hash codes compared with team
The mode of row first-in last-out is stored in Hash server.
By the way that Hash codes are preserved, the stability and reliability of system are improved, is conducive to improving article merging
Speed.
The embodiment of the present invention provides also directed to article merging method and realizes device accordingly, further such that methods described
With more practicality.Merge device to article provided in an embodiment of the present invention below to be introduced, article described below merges
Device can be mutually to should refer to above-described article merging method.
Referring to Fig. 4, Fig. 4 is that article provided in an embodiment of the present invention merges a kind of structure of the device under embodiment
Figure, the device may include:
Acquisition module 401, for obtaining many articles to be combined.
Word-dividing mode 402, for carrying out participle to many articles according to default part of speech storehouse and professional word database,
To obtain respective target word set;The default part of speech storehouse is the part of speech for each target word that the target word is concentrated, the specialty
Word database includes customer service demand phrase and/or the phrase of inverse document frequency word extraction is carried out in all kinds of article's styles.
Computing module 404, for asking for Hash codes using default algorithm to each target word set, chooses and meets default
The target article of time conditions;Each described are calculated to the corresponding Hash codes of each target article successively using the first preset function
Distance between target article.
Merging module 404, for when the distance between judging each target article is not more than pre-determined distance threshold value, then will
Corresponding target article is merged.
Optionally, in some embodiments of the present embodiment, referring to Fig. 5, described device can also for example include:
Hash codes processing module 405, for each Hash codes to be carried out plus tieed up with dimensionality reduction according to the professional word database.
In some specific embodiments, the word-dividing mode 402 can be according to default part of speech storehouse and professional word number
The target phrase in each article is extracted according to storehouse;According to the corresponding industry type of each article, the target phrase is entered
Row normalized, to generate the module of each self-corresponding target word collection.
In other embodiment, the merging module 404 can be by distance between each target article not
Target article corresponding more than pre-determined distance threshold value, which is selected, to be come, and is shown to user;Receive target of the user to selection
The instruction that article similarity judges, according to the module for instructing and merging target article.
Optionally, in other embodiments of the present embodiment, referring to Fig. 5, described device can also for example include:
Memory module 406, for each Hash codes to be preserved into Hash server.
In some embodiments in the present embodiment, the computing module 403 can be each target article of acquisition
Deliver the time;When two target articles of judgement are when delivering the time no more than 15 days, then the module come is selected.
The function that article described in the embodiment of the present invention merges each functional module of device can be according in above method embodiment
Method implement, it implements the associated description that process is referred to above method embodiment, and here is omitted.
From the foregoing, it will be observed that the embodiment of the present invention carries out participle using default dictionary and specialized dictionary to article to be combined, have
Beneficial to the key for accurately obtaining similar article, the accuracy judged article similarity is improved;In addition, to be combined
Chosen in article and meet the target article of time preparatory condition and be compared similarity, can effectively avoided due to using template
Mistake merges the article of different content, is conducive to improving the degree of accuracy that article merges, also advantageously improves article merging
Speed.
The embodiment of each in this specification is described by the way of progressive, what each embodiment was stressed be with it is other
Between the difference of embodiment, each embodiment same or similar part mutually referring to.For being filled disclosed in embodiment
For putting, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is referring to method part
Explanation.
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description
And algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate hardware and
The interchangeability of software, generally describes the composition and step of each example according to function in the above description.These
Function is performed with hardware or software mode actually, depending on the application-specific and design constraint of technical scheme.Specialty
Technical staff can realize described function to each specific application using distinct methods, but this realization should not
Think beyond the scope of this invention.
Directly it can be held with reference to the step of the method or algorithm that the embodiments described herein is described with hardware, processor
Capable software module, or the two combination are implemented.Software module can be placed in random access memory (RAM), internal memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
A kind of article merging method provided by the present invention and device are described in detail above.It is used herein
Specific case is set forth to the principle and embodiment of the present invention, and the explanation of above example is only intended to help and understands this
The method and its core concept of invention.It should be pointed out that for those skilled in the art, not departing from this hair
On the premise of bright principle, some improvement and modification can also be carried out to the present invention, these are improved and modification also falls into power of the present invention
In the protection domain that profit is required.
Claims (10)
1. a kind of article merging method, it is characterised in that including:
Obtain many articles to be combined;
Participle is carried out to many articles according to default part of speech storehouse and professional word database, to obtain respective target word
Collection;The default part of speech storehouse is the part of speech for each target word that the target word is concentrated, and the professional word database includes user's industry
Business demand phrase and/or the phrase that inverse document frequency word extraction is carried out in all kinds of article's styles;
Hash codes are asked for using default algorithm to each target word set, the target article for meeting preset time condition is chosen;
Using the first preset function successively the distance each target article of each corresponding Hash codes calculating of target article;
When the distance between judging each target article is not more than pre-determined distance threshold value, then corresponding target article is closed
And.
2. according to the method described in claim 1, it is characterised in that default algorithm is utilized to each target word set described
Ask for also including after Hash codes:
Each Hash codes are carried out according to the professional word database plus dimension dimensionality reduction.
3. method according to claim 2, it is characterised in that described to be asked using default algorithm each target word set
The Hash codes are taken to be:
Simhash (test, 64) is called to ask for 64 Hash codes to each target word set.
4. the method according to claims 1 to 3 any one, it is characterised in that the basis presets part of speech storehouse and special
Industry word database carries out participles to many articles, using obtain respective target word set as:
The target phrase in each article is extracted according to the default part of speech storehouse and the professional word database;
According to the corresponding industry type of each article, the target phrase is normalized, to generate respective correspondence
Target word set.
5. the method according to claims 1 to 3 any one, it is characterised in that between each target article away from
During from no more than pre-determined distance threshold value, then by corresponding target article merge including:
By distance between each target article be not more than the corresponding target article of the pre-determined distance threshold value select come, and to
Family is shown;
The instruction that user is judged the target article similarity of selection is received, target article is merged according to the instruction.
6. method according to claim 5, it is characterised in that first preset function is getDis functions.
7. method according to claim 6, it is characterised in that default algorithm is utilized to each target word set described
After asking for Hash codes, in addition to:
Each Hash codes are preserved into Hash server.
8. method according to claim 7, it is characterised in that the selection meets the target article of preset time condition
For:
Obtain each target article delivers the time;
When two target articles of judgement are when delivering the time no more than 15 days, then selected and.
9. method according to claim 8, it is characterised in that the pre-determined distance threshold value is 3.5.
10. a kind of article merges device, it is characterised in that including:
Acquisition module, for obtaining many articles to be combined;
Word-dividing mode, for carrying out participle to many articles according to default part of speech storehouse and professional word database, to obtain
Respective target word set;The default part of speech storehouse is the part of speech for each target word that the target word is concentrated, the professional word data
Storehouse includes customer service demand phrase and/or the phrase of inverse document frequency word extraction is carried out in all kinds of article's styles;
Computing module, for asking for Hash codes using default algorithm to each target word set, selection meets preset time bar
The target article of part;Successively the corresponding Hash codes of each target article are calculated with each target text using the first preset function
Distance between chapter;
Merging module, then will be corresponding for when the distance between judging each target article is not more than pre-determined distance threshold value
Target article is merged.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710335322.5A CN107066623A (en) | 2017-05-12 | 2017-05-12 | A kind of article merging method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710335322.5A CN107066623A (en) | 2017-05-12 | 2017-05-12 | A kind of article merging method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107066623A true CN107066623A (en) | 2017-08-18 |
Family
ID=59597366
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710335322.5A Pending CN107066623A (en) | 2017-05-12 | 2017-05-12 | A kind of article merging method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107066623A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110555198A (en) * | 2018-05-31 | 2019-12-10 | 北京百度网讯科技有限公司 | method, apparatus, device and computer-readable storage medium for generating article |
WO2022105497A1 (en) * | 2020-11-19 | 2022-05-27 | 深圳壹账通智能科技有限公司 | Text screening method and apparatus, device, and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040064304A1 (en) * | 2002-07-03 | 2004-04-01 | Word Data Corp | Text representation and method |
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
CN104252445A (en) * | 2013-06-26 | 2014-12-31 | 华为技术有限公司 | Document similarity calculation method and near-duplicate document detection method and device |
CN106569989A (en) * | 2016-10-20 | 2017-04-19 | 北京智能管家科技有限公司 | De-weighting method and apparatus for short text |
-
2017
- 2017-05-12 CN CN201710335322.5A patent/CN107066623A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040064304A1 (en) * | 2002-07-03 | 2004-04-01 | Word Data Corp | Text representation and method |
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
CN104252445A (en) * | 2013-06-26 | 2014-12-31 | 华为技术有限公司 | Document similarity calculation method and near-duplicate document detection method and device |
CN106569989A (en) * | 2016-10-20 | 2017-04-19 | 北京智能管家科技有限公司 | De-weighting method and apparatus for short text |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110555198A (en) * | 2018-05-31 | 2019-12-10 | 北京百度网讯科技有限公司 | method, apparatus, device and computer-readable storage medium for generating article |
CN110555198B (en) * | 2018-05-31 | 2023-05-23 | 北京百度网讯科技有限公司 | Method, apparatus, device and computer readable storage medium for generating articles |
WO2022105497A1 (en) * | 2020-11-19 | 2022-05-27 | 深圳壹账通智能科技有限公司 | Text screening method and apparatus, device, and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN110348214B (en) | Method and system for detecting malicious codes | |
CN109582704B (en) | Recruitment information and the matched method of job seeker resume | |
CN103577989B (en) | A kind of information classification approach and information classifying system based on product identification | |
CN111597817B (en) | Event information extraction method and device | |
CN108549723B (en) | Text concept classification method and device and server | |
CN110516251B (en) | Method, device, equipment and medium for constructing electronic commerce entity identification model | |
CN107357777B (en) | Method and device for extracting label information | |
CN101308512B (en) | Mutual translation pair extraction method and device based on web page | |
CN107368489A (en) | A kind of information data processing method and device | |
CN110020430B (en) | Malicious information identification method, device, equipment and storage medium | |
CN105095203B (en) | Determination, searching method and the server of synonym | |
CN110347806A (en) | Original text discriminating method, device, equipment and computer readable storage medium | |
CN107066623A (en) | A kind of article merging method and device | |
CN108388556B (en) | Method and system for mining homogeneous entity | |
CN106503152A (en) | Title treating method and apparatus | |
CN108701126A (en) | Theme estimating device, theme presumption method and storage medium | |
CN107665443B (en) | Obtain the method and device of target user | |
CN107590163B (en) | The methods, devices and systems of text feature selection | |
CN111488452A (en) | Webpage tampering detection method, detection system and related equipment | |
CN115408997A (en) | Text generation method, text generation device and readable storage medium | |
CN104281692A (en) | Method and system for realizing paragraph dimensionalized description | |
JP5824429B2 (en) | Spam account score calculation apparatus, spam account score calculation method, and program | |
CN110619212B (en) | Character string-based malicious software identification method, system and related device | |
CN106547822A (en) | A kind of text relevant determines method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170818 |
|
RJ01 | Rejection of invention patent application after publication |