CN106202057A - The recognition methods of similar news information and device - Google Patents

The recognition methods of similar news information and device Download PDF

Info

Publication number
CN106202057A
CN106202057A CN201610765203.9A CN201610765203A CN106202057A CN 106202057 A CN106202057 A CN 106202057A CN 201610765203 A CN201610765203 A CN 201610765203A CN 106202057 A CN106202057 A CN 106202057A
Authority
CN
China
Prior art keywords
headline
news information
similarity
news
similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610765203.9A
Other languages
Chinese (zh)
Other versions
CN106202057B (en
Inventor
麦涛
王磊
张旭
朱志华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201610765203.9A priority Critical patent/CN106202057B/en
Publication of CN106202057A publication Critical patent/CN106202057A/en
Application granted granted Critical
Publication of CN106202057B publication Critical patent/CN106202057B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention proposes recognition methods and the device of a kind of similar news information, wherein, the recognition methods of this similar news information, comprises the following steps: obtain any two news informations to be identified, and described news information includes headline;Judge whether the similarity of two headline meets first pre-conditioned;Similarity in said two headline meet described first pre-conditioned time, determine that said two news information is similar news information.Embodiments of the invention, it is possible to identify analog information in multi information of comforming accurately and rapidly such that it is able to provide foundation for information duplicate removal and comparison.

Description

The recognition methods of similar news information and device
Technical field
The present invention relates to areas of information technology, particularly to recognition methods and the device of a kind of similar news information.
Background technology
Along with the development of the Internet, the network information such as Internet news, article is anxious to be increased.Website due to the information of releasing news Numerous, same news information is frequently found in multiple website.News information collecting system can by numerous news informations from Converge to this locality on different websites, but a large amount of news information repeated brings very big inconvenience to user's reading information.Therefore, How to identify that repetition news information becomes a problem demanding prompt solution.
Summary of the invention
It is contemplated that solve above-mentioned technical problem the most to a certain extent.
To this end, the first of the present invention purpose is to propose the recognition methods of a kind of similar news information, it is possible to accurately, soon Comform fastly and multi information identifies analog information.
Second object of the present invention is to propose the identification device of a kind of similar news information.
For reaching above-mentioned purpose, embodiment proposes the identification side of a kind of similar news information according to a first aspect of the present invention Method, comprises the following steps: obtain any two news informations to be identified, and described news information includes headline;Judge two It is pre-conditioned whether the similarity of headline meets first;Similarity in said two headline meets described first pre- If during condition, determine that said two news information is similar news information.
It addition, can also have a following additional technical feature according to the recognition methods of the similar news information of the present invention:
In one embodiment of the invention, described method also includes:
Similarity in said two headline be unsatisfactory for described first pre-conditioned time, it is judged that described news information is The no original headline comprising correspondence;
When described news information comprises original headline, it is judged that a headline and another original headline Similarity whether to meet second pre-conditioned;
Similarity in one headline Yu another original headline meet described second pre-conditioned time, Determine that said two news information is similar news information.
In one embodiment of the invention, described news information also includes that body, described method also include:
Described news information do not comprise correspondence original headline time or original with another a headline The similarity of headline be unsatisfactory for described second pre-conditioned time, it is judged that whether the similarity of two bodies meets the 3rd Pre-conditioned;
Similarity in said two body meet described 3rd pre-conditioned time, determine said two news information For similar news information.
In one embodiment of the invention, whether the described similarity judging two headline meets first and presets bar Part, including:
Judge that said two headline is the most identical;
Accordingly, the similarity in said two headline meet described first pre-conditioned time, determine said two News information is similar news information, including:
When said two headline is identical, determine that said two news information is similar news information.
In one embodiment of the invention, also include:
When said two headline differs, obtain the word finder after each headline is carried out word segmentation processing Closing, described lexical set includes the some words after headline carries out participle;
Judge whether a lexical set comprises all words in another lexical set;
Accordingly, the similarity in said two headline meet described first pre-conditioned time, determine said two News information is similar news information, including:
During all words in one lexical set comprises another lexical set, determine that said two news is believed Breath is similar news information.
In one embodiment of the invention, also include:
During all words in one lexical set does not comprise another lexical set, it is judged that one vocabulary Whether the word in set is more than first threshold with the first matching degree of the word in another lexical set;And according to interdependent literary composition Method relation judges that the word in one lexical set is the biggest with the second matching degree of the word in another lexical set In Second Threshold;
Accordingly, the similarity in said two headline meet described first pre-conditioned time, determine said two News information is similar news information, including:
When described first matching degree is more than Second Threshold more than first threshold and described second matching degree, determine described two Individual news information is similar news information.
In one embodiment of the invention, one headline of described judgement is similar to another original headline It is second pre-conditioned whether degree meets, including:
Judge that the original headline of one headline and another is the most identical;
Accordingly, the similarity in one headline Yu another original headline meet described second preset During condition, determine that said two news information is similar news information, including:
Described state a headline identical with another original headline time, determine that said two news information is Similar news information.
In one embodiment of the invention, whether the described similarity judging two bodies meets the 3rd and presets article Part, including:
Extract the key word of the body corresponding with said two headline respectively, obtain the first keyword set with Second keyword set;
Determine the weight of each key word in described first keyword set and described second keyword set;
Title and weight according to described each key word determine described first keyword set and described second key word Same keyword in set;
The key word of described first keyword set and described second keyword set is determined according to described same keyword Repetitive rate;
Judge that whether described repetitive rate is more than predetermined probabilities;
Accordingly, the similarity in said two body meet described 3rd pre-conditioned time, determine said two News information is similar news information, including:
When described repetitive rate is more than predetermined probabilities, determine that said two news information is similar news information.
Second aspect present invention embodiment proposes the identification device of a kind of similar news information, including:
Acquisition module, for obtaining any two news informations to be identified, described news information includes headline;
First judge module, pre-conditioned for judging whether the similarity of two headline meets first;
First determines module, for the similarity of said two headline meet described first pre-conditioned time, really Determining said two news information is similar news information.
It addition, can also have a following additional technical feature according to the identification device of the similar news information of the present invention:
In one embodiment of the invention, described device also includes:
Second judge module, for the similarity of said two headline be unsatisfactory for described first pre-conditioned time, Judge whether described news information comprises the original headline of correspondence;
3rd judge module, for when described news information comprises original headline, it is judged that headline with It is pre-conditioned whether the similarity of another original headline meets second;
Second determines module, meets institute for the similarity in one headline Yu another original headline State second pre-conditioned time, determine that said two news information is similar news information.
In one embodiment of the invention, described news information also includes that body, described device also include:
4th judge module, during for not comprising the original headline of correspondence or a news in described news information The similarity of title and another original headline be unsatisfactory for described second pre-conditioned time, it is judged that the phase of two bodies Whether the 3rd is met pre-conditioned like degree;
3rd determines module, for the similarity of said two body meet described 3rd pre-conditioned time, really Determining said two news information is similar news information.
In one embodiment of the invention, described first judge module is used for:
Judge that said two headline is the most identical;
Accordingly, the similarity in said two headline meet described first pre-conditioned time, determine said two News information is similar news information, including:
When said two headline is identical, determine that said two news information is similar news information.
In one embodiment of the invention, described first judge module is used for:
When said two headline differs, obtain the word finder after each headline is carried out word segmentation processing Closing, described lexical set includes the some words after headline carries out participle;
Judge whether a lexical set comprises all words in another lexical set;
Accordingly, described first determine module for:
During all words in one lexical set comprises another lexical set, determine that said two news is believed Breath is similar news information.
In one embodiment of the invention, described first judge module is used for:
During all words in one lexical set does not comprise another lexical set, it is judged that one vocabulary Whether the word in set is more than first threshold with the first matching degree of the word in another lexical set;And according to interdependent literary composition Method relation judges that the word in one lexical set is the biggest with the second matching degree of the word in another lexical set In Second Threshold;
Accordingly, described first determine module for:
When described first matching degree is more than Second Threshold more than first threshold and described second matching degree, determine described two Individual news information is similar news information.
In one embodiment of the invention, described 3rd judge module is used for:
Judge that the original headline of one headline and another is the most identical;
Accordingly, described second determine module for:
Described state a headline identical with another original headline time, determine that said two news information is Similar news information.
In one embodiment of the invention, described 4th judge module is used for:
Extract the key word of the body corresponding with said two headline respectively, obtain the first keyword set with Second keyword set;
Determine the weight of each key word in described first keyword set and described second keyword set;
Title and weight according to described each key word determine described first keyword set and described second key word Same keyword in set;
The key word of described first keyword set and described second keyword set is determined according to described same keyword Repetitive rate;
Judge that whether described repetitive rate is more than predetermined probabilities;
Accordingly, the described 3rd determine module for:
When described repetitive rate is more than predetermined probabilities, determine that said two news information is similar news information.
The recognition methods of the similar news information of the embodiment of the present invention and device, by obtaining to be identified new of any two News information, and judge two news informations headline judge similarity meet first pre-conditioned time, determine two News information is similar news information, it is possible to identify similar news information accurately and rapidly from numerous news informations, thus Foundation can be provided for news information duplicate removal and comparison.
The additional aspect of the present invention and advantage will part be given in the following description, and part will become from the following description Obtain substantially, or recognized by the practice of the present invention.
Accompanying drawing explanation
Above-mentioned and/or the additional aspect of the present invention and advantage are from combining the accompanying drawings below description to embodiment and will become Substantially with easy to understand, wherein:
Fig. 1 is the flow chart of the recognition methods of the similar news information according to one embodiment of the invention;
Fig. 2 is the flow chart of the recognition methods of the similar news information according to another embodiment of the present invention;
Fig. 3 is the flow chart of the recognition methods of the similar news information according to another embodiment of the present invention;
Fig. 4 a is the analysis result schematic diagram of the title one according to one embodiment of the invention;
Fig. 4 b is the analysis result schematic diagram of the title two according to one embodiment of the invention;
Fig. 5 is the flow chart of the recognition methods of the similar news information according to another embodiment of the present invention;
Fig. 6 a is the part of speech analysis result schematic diagram of the body according to one embodiment of the invention;
Fig. 6 b is the entity class recognition result schematic diagram of the body according to one embodiment of the invention;
Fig. 7 is the keyword extraction result schematic diagram according to one embodiment of the invention;
Fig. 8 is the structural representation identifying device of the similar news information according to one embodiment of the invention;
Fig. 9 is the structural representation identifying device of the similar news information according to another embodiment of the present invention.
Detailed description of the invention
Embodiments of the invention are described below in detail, and the example of described embodiment is shown in the drawings, the most from start to finish Same or similar label represents same or similar element or has the element of same or like function.Below with reference to attached The embodiment that figure describes is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.
In describing the invention, it is to be understood that term " multiple " refers to two or more;Term " first ", " second " is only used for describing purpose, and it is not intended that indicate or hint relative importance.
Below with reference to the accompanying drawings recognition methods and the device of similar news information according to embodiments of the present invention are described.
Similar news information is may recognize that, in order to follow-up similar news is gone by embodiments of the invention Weight.
Fig. 1 is the flow chart of the recognition methods of the similar news information according to one embodiment of the invention.
As it is shown in figure 1, the recognition methods of similar news information according to embodiments of the present invention, comprise the following steps.
S101, obtains any two news informations to be identified, and described news information includes headline.
Embodiments of the invention can be applicable to different scene, can obtain and wait to know under different scenes by the way of corresponding Other multiple news information, the present embodiment, when judging the similarity of multiple news informations, needs first to arbitrarily getting Two news informations to be identified carry out similarity judgement, it is judged that obtain another news information after completing again and carry out sentencing next time Disconnected.Below by following application scenarios, the mode obtaining multiple news informations to be identified is illustrated.
Scene one
User is transmitted news searching request by news user end to server, and server searching request based on user obtains Take multiple news informations of correspondence.
Scene two
Server when the user of client pushes news information, can obtain multiple news information according to preset rules, as Hot news, the news etc. of user Focus Area.
Scene three
When user browses news, if it is desired to the similar news in some news informations is identified or screens, then may be used Send similar news identification request to server, and the mark of these news informations is committed to server, and then server can News information is obtained according to the mark received.
It should be noted that above-mentioned scene is exemplary only, should not be construed as limitation of the present invention.The reality of the present invention Execute example and apply also for other scenes, illustrate the most one by one at this.
S102, it is judged that it is pre-conditioned whether the similarity of two headline meets first.
Wherein, when two headline meet at least one condition, i.e. can determine whether that the similarity of two news is full Foot first is pre-conditioned:
Two headline are identical;
Or, whether the lexical set after one of them headline word segmentation processing comprises at another headline participle All words in lexical set after reason;
Or, word and the matching degree of dependency grammar relation thereof in two headline meet pre-conditioned.
S103, the similarity in said two headline meet described first pre-conditioned time, determine that said two is new News information is similar news information.
In one embodiment of the invention, can be by embodiment illustrated in fig. 2 by judging the similar of two headline It is first pre-conditioned whether degree meets, and determines whether two news informations are similar news information.Specifically, as in figure 2 it is shown, wrap Include step S201-S207.
S201, it is judged that said two headline is the most identical.
S202, when said two headline is identical, determines that said two news information is similar news information.
S203, when said two headline differs, obtains the word after each headline carries out word segmentation processing Collecting conjunction, described lexical set includes the some words after headline carries out participle.
When said two headline differs, after can passing through respectively two headline are carried out word segmentation processing, Arrive and two headline corresponding lexical sets respectively.Each lexical set includes the some words after corresponding headline row participle Language.
S204 a, it is judged that whether lexical set comprises all words in another lexical set.
S205, during all words in one lexical set comprises another lexical set, determines said two News information is similar news information.
S206, during all words in one lexical set does not comprise another lexical set, it is judged that described one Whether the word in individual lexical set is more than first threshold with the first matching degree of the word in another lexical set;And according to Dependency grammar relation judges the second matching degree of the word in one lexical set and the word in another lexical set Whether more than Second Threshold.
Wherein, the quantization of the first matching degree identical word quantity in can being two lexical sets embodies.For example, first Matching degree can be identical word quantity and arbitrary lexical set (such as, word in two lexical sets in two lexical sets One lexical set of negligible amounts) in the ratio of word quantity.
Second matching degree can be that the quantization of the similarity of the dependency grammar relation of word in two lexical sets embodies.Its In, for example, the similarity of dependency grammar relation is dependency grammar relation and another headline in shorter headline The dependency grammar relation of middle coincidence accounts for the ratio of dependency grammar relation sum in shorter headline.
Wherein, first threshold and Second Threshold are preset value.For example, first threshold may be configured as 90%, the second threshold Value may be configured as 80%.
S207, when described first matching degree is more than Second Threshold more than first threshold and described second matching degree, determines Said two news information is similar news information.
For example, for the headline of following two news:
Title one: visit the Changjiang river and drag for corpse people and salvaged 70 corpses (from news sources one) for most one week
Title two: method matchmaker visits the Changjiang river and drags for corpse people and salvaged 70 corpses (from news sources two) for most one week
Two headline are done word segmentation processing and Fig. 4 a that dependency grammar analysis can respectively obtain and the result of Fig. 4 b.Its In, Fig. 4 a is the analysis result schematic diagram of title one, and Fig. 4 b is the analysis result schematic diagram of title two.
By above-mentioned analysis result it can be seen that title one is different from title two, and also meet a word finder wherein Close the condition comprising all words in another lexical set, therefore, can be by title one and the vocabulary of title two correspondence Set is mated, and obtains the first matching degree and the second matching degree.The two title first matching degree is more than 90%, and the second coupling Degree more than 80%, then can determine that the news information that title one is corresponding with title two is similar news information.
The recognition methods of similar news information according to embodiments of the present invention, the news to be identified by obtaining any two Information, and judge the similarity that the headline of two news informations judges meet first pre-conditioned time, determine two new News information is similar news information, it is possible to identify similar news information accurately and rapidly from numerous news informations, it is thus possible to Enough provide foundation for news information duplicate removal and comparison.
In one embodiment of the invention, news information can include headline, body, news in brief or news Source webs etc., can be according to one of which or multinomial be identified similar news information.
Illustrate one or more similar news information to be identified according to above-mentioned below by embodiment illustrated in fig. 3. As shown in Figure 3, it may include step S301-S308.
Wherein, S301-S303 is identical with the S101-S103 in embodiment illustrated in fig. 1, can refer to embodiment illustrated in fig. 1.
S304, the similarity in said two headline be unsatisfactory for described first pre-conditioned time, it is judged that described news Whether information comprises the original headline of correspondence.
Owing to the information in number of site selects from other websites, during selecting, former title is carried out letter Dull whole, or quote, this message header selected is incomplete same with prime information title, but is essentially identical or phase Near information.Therefore, in order to this part analog information is identified, in embodiments herein, can be two news The similarity of title be unsatisfactory for described first pre-conditioned time, determine whether said two headline whether comprise correspondence Original headline.If comprising the original headline of correspondence, then can perform S305.
Comprise the headline of original headline, mostly have a keyword that reference structure or expression are quoted, therefore, In some embodiments of the present invention, can judge new according to the structure of headline, keyword (such as keyword: ×× website :) etc. Hear whether title comprises the original header of correspondence.
For example, for headline:
Xinhua News Agency: the cause of innovation calls the talent of innovation,
There is reference structure " Xinhua News Agency: ", therefore, can determine whether the original headline " wound that this headline comprises correspondence New cause calls the talent of innovation ".
S305, when described news information comprises original headline, it is judged that a headline is original newly with another It is pre-conditioned whether the similarity of news title meets second.
S306, the similarity in one headline Yu another original headline meets described second and presets bar During part, determine that said two news information is similar news information.
In one embodiment of the invention, step S305 comprises the steps that and judges that one headline is former with another Beginning headline is the most identical.
Specifically, if two headline have original headline, then can determine whether that two original headline are No identical.If A has original header, B not to have original header in two message headers, then can be by the original headline of A with new Hear title B to mate, the most identical with headline B to judge the original header of headline A.
Accordingly, step S306 comprises the steps that
Described state a headline identical with another original headline time, determine that said two news information is Similar news information.
S307, described news information do not comprise correspondence original headline time or a headline and another The similarity of original headline be unsatisfactory for described second pre-conditioned time, it is judged that whether the similarity of two bodies meets 3rd is pre-conditioned.
S308, the similarity in said two body meet described 3rd pre-conditioned time, determine that said two is new News information is similar news information.
Specifically, in one embodiment of the invention, S307 can include step S501-S505 shown in Fig. 5.Corresponding Ground, S308 can include step S506.
S501, extracts the key word of the body corresponding with said two headline respectively, obtains the first key word Set and the second keyword set.
Specifically, respectively the body that two headline are corresponding can be carried out participle, and respectively to word segmentation result Carry out keyword abstraction, respectively obtain the keyword set of two bodies.
Specifically, after each body carries out participle, each participle can be carried out part of speech analysis.Then, from participle Middle by name, the noun such as proper noun can represent (can include) word of action subject and mark out, as candidate keywords.This Outward, also the entity class of participle in message text can be identified according to the characteristic of information.When find this vocabulary be ProductName, the time, When place, organization name, name, position also using this word as candidate keywords.
For example, Fig. 6 a is the part of speech analysis result schematic diagram of the body according to one embodiment of the invention;Figure 6b is the entity class recognition result schematic diagram of the body according to one embodiment of the invention.
After extracting candidate keywords according to the result of part of speech analysis and entity class analysis, can be to candidate keywords Word frequency in body is added up, and is ranked up candidate keywords from big to small according to word frequency, and according to news The length of body matter accepts or rejects the subsequent key word come below.For example, if body includes 200 words, then may be used Choose and come front 50 candidate keywords as key word.If body includes 100 words, then can choose and come front 30 times Select key word as key word.
S502, determines the weight of each key word in described first keyword set and described second keyword set.
Specifically, the weight of the key word chosen can be calculated according to word frequency.
For example, as it is shown in fig. 7, be the keyword extraction result schematic diagram according to one embodiment of the invention, wherein, Including the weight that lists of keywords and each key word are corresponding.
S503, title and weight according to described each key word determine that described first keyword set is closed with described second Same keyword in keyword set.
In an embodiment of the present invention, if in the key word M in the first keyword set and the second keyword set Key word N meets following condition, then can determine that key word M is identical with key word N:
The title of key word M is identical with the title of key word N, and, (weight of the weight of key word M/key word N) 100 More than percentage threshold.Wherein, percentage threshold is preset value, can be adjusted according to practical situation.For example, percentage ratio Threshold value can be 70%.
S504, determines the pass of described first keyword set and described second keyword set according to described same keyword Keyword repetitive rate.
Wherein, the first keyword set refers to the key word repetitive rate of the second keyword set, the first keyword set The key word identical with the second keyword set accounts for the ratio of key word sum in the less keyword set of key word.
S505, it is judged that whether described repetitive rate is more than predetermined probabilities.
Wherein, predetermined probabilities can adjust according to practical situation.For example, predetermined probabilities can be 80%.
S506, when described repetitive rate is more than predetermined probabilities, determines that said two news information is similar news information.
Should be appreciated that above-described embodiment, using headline and body as matching condition, carries out information similarity Join, in other embodiments of the invention, also can using news in brief or source web etc. as auxiliary similarity mode condition, To improve the precision of similarity mode.
It should be noted that above by headline coupling, the coupling of the original header of headline and news just In the matching process of literary composition, as long as determining that two news informations are analog information, and terminate subsequent match process, it is possible to effectively promote Recognition efficiency.
The recognition methods of similar news information according to embodiments of the present invention, the news to be identified by obtaining any two Information, and judge the similarity that the headline of two news informations judges meet first pre-conditioned time, determine two new News information is similar news information, it is possible to identify similar news information accurately and rapidly from numerous news informations, it is thus possible to Enough provide foundation for news information duplicate removal and comparison.
Further, after identifying similar news information, similar news information can be carried out duplicate removal, and by after duplicate removal News information is supplied to user.Thus, similar news can be removed from a large amount of news, it is provided that to user, promote information reading speed Degree, improves user and obtains the efficiency of information.
The recognition methods embodiment of news information similar to above is corresponding, and the present invention also proposes a kind of similar news information Identification device.
Fig. 8 is the structural representation identifying device of the similar news information according to one embodiment of the invention.
As shown in Figure 8, the identification device of similar news information according to embodiments of the present invention, including: acquisition module 10, One judge module 20 and first determines module 30.
Specifically, acquisition module 10 is for obtaining any two news informations to be identified, and described news information includes news Title.
Embodiments of the invention can be applicable to different scene, can obtain and wait to know under different scenes by the way of corresponding Other multiple news information, the present embodiment, when judging the similarity of multiple news informations, needs first to arbitrarily getting Two news informations to be identified carry out similarity judgement, it is judged that obtain another news information after completing again and carry out sentencing next time Disconnected.Below by following application scenarios, the mode obtaining multiple news informations to be identified is illustrated.
Scene one
User is transmitted news searching request by news user end to server, and server searching request based on user obtains Take multiple news informations of correspondence.
Scene two
Server when the user of client pushes news information, can obtain multiple news information according to preset rules, as Hot news, the news etc. of user Focus Area.
Scene three
When user browses news, if it is desired to the similar news in some news informations is identified or screens, then may be used Send similar news identification request to server, and the mark of these news informations is committed to server, and then server can News information is obtained according to the mark received.
It should be noted that above-mentioned scene is exemplary only, should not be construed as limitation of the present invention.The reality of the present invention Execute example and apply also for other scenes, illustrate the most one by one at this.
First judge module 20 is pre-conditioned for judging whether the similarity of two headline meets first.
Wherein, when two headline meet at least one condition, the first judge module 20 i.e. can determine whether two It is pre-conditioned that the similarity of news meets first:
Two headline are identical;
Or, whether the lexical set after one of them headline word segmentation processing comprises at another headline participle All words in lexical set after reason;
Or, word and the matching degree of dependency grammar relation thereof in two headline meet pre-conditioned.
First determine module 30 for the similarity of said two headline meet described first pre-conditioned time, really Determining said two news information is similar news information.
In one embodiment of the invention, the first judge module 20 can be used for: whether judges said two headline Identical;Accordingly, first determines that module 30, for when said two headline is identical, determines that said two news information is Similar news information.
Further, the first judge module 20 can be additionally used in: when said two headline differs, and obtains each Headline carries out the lexical set after word segmentation processing, and described lexical set includes the some words after headline carries out participle Language;Judge whether a lexical set comprises all words in another lexical set.Accordingly, described first determines module 30 can be used for: during all words in one lexical set comprises another lexical set, determine said two news Information is similar news information.
When said two headline differs, the first judge module 20 can be by carrying out two headline respectively After word segmentation processing, obtain and two headline corresponding lexical set respectively.Each lexical set includes corresponding headline row Some words after participle.
Further, the first judge module 20 can be additionally used in: does not comprise another vocabulary at one lexical set During all words in set, it is judged that the word in one lexical set and first of the word in another lexical set Whether matching degree is more than first threshold;And according to dependency grammar relation judge the word in one lexical set and another Whether the second matching degree of the word in lexical set is more than Second Threshold;Accordingly, first determines that module 30 can be used in institute State the first matching degree more than first threshold and described second matching degree more than Second Threshold time, determine that said two news information is Similar news information.
Wherein, the quantization of the first matching degree identical word quantity in can being two lexical sets embodies.For example, first Matching degree can be identical word quantity and arbitrary lexical set (such as, word in two lexical sets in two lexical sets One lexical set of negligible amounts) in the ratio of word quantity.
Second matching degree can be that the quantization of the similarity of the dependency grammar relation of word in two lexical sets embodies.Its In, for example, the similarity of dependency grammar relation is dependency grammar relation and another headline in shorter headline The dependency grammar relation of middle coincidence accounts for the ratio of dependency grammar relation sum in shorter headline.
Wherein, first threshold and Second Threshold are preset value.For example, first threshold may be configured as 90%, the second threshold Value may be configured as 80%.
For example, for the headline of following two news:
Title one: visit the Changjiang river and drag for corpse people and salvaged 70 corpses (from news sources one) for most one week
Title two: method matchmaker visits the Changjiang river and drags for corpse people and salvaged 70 corpses (from news sources two) for most one week
Two headline are done word segmentation processing and Fig. 4 a that dependency grammar analysis can respectively obtain and the result of Fig. 4 b.Its In, Fig. 4 a is the analysis result schematic diagram of title one, and Fig. 4 b is the analysis result schematic diagram of title two.
By above-mentioned analysis result it can be seen that title one is different from title two, and also meet a word finder wherein Close the condition comprising all words in another lexical set, therefore, can be by title one and the vocabulary of title two correspondence Set is mated, and obtains the first matching degree and the second matching degree.The two title first matching degree is more than 90%, and the second coupling Degree more than 80%, then can determine that the news information that title one is corresponding with title two is similar news information.
The identification device of similar news information according to embodiments of the present invention, the news to be identified by obtaining any two Information, and judge the similarity that the headline of two news informations judges meet first pre-conditioned time, determine two new News information is similar news information, it is possible to identify similar news information accurately and rapidly from numerous news informations, it is thus possible to Enough provide foundation for news information duplicate removal and comparison.
In one embodiment of the invention, news information can include headline, body, news in brief or news Source webs etc., can be according to one of which or multinomial be identified similar news information.
Fig. 9 is the structural representation identifying device of the similar news information according to another embodiment of the present invention.
As it is shown in figure 9, the identification device of similar news information according to embodiments of the present invention, including: acquisition module 10, One judge module 20, first determine module the 30, second judge module the 40, the 3rd judge module 50, second determine module the 60, the 4th Judge module 70 and the 3rd determines module 80.
Wherein, with first, acquisition module the 10, first judge module 20 determines that module 30 is identical with embodiment illustrated in fig. 8, can With reference to embodiment described in Fig. 8.
Second judge module 40 for the similarity of said two headline be unsatisfactory for described first pre-conditioned time, Judge whether described news information comprises the original headline of correspondence.
Owing to the information in number of site selects from other websites, during selecting, former title is carried out letter Dull whole, or quote, this message header selected is incomplete same with prime information title, but is essentially identical or phase Near information.Therefore, in order to this part analog information is identified, in embodiments herein, the second judge module 40 can the similarity of two headline be unsatisfactory for described first pre-conditioned time, determine whether said two headline Whether comprise the original headline of correspondence.
Comprise the headline of original headline, mostly have a keyword that reference structure or expression are quoted, therefore, In some embodiments of the present invention, can judge new according to the structure of headline, keyword (such as keyword: ×× website :) etc. Hear whether title comprises the original header of correspondence.
For example, for headline:
Xinhua News Agency: the cause of innovation calls the talent of innovation,
There is reference structure " Xinhua News Agency: ", therefore, can determine whether the original headline " wound that this headline comprises correspondence New cause calls the talent of innovation ".
3rd judge module 50 is for when described news information comprises original headline, it is judged that headline with It is pre-conditioned whether the similarity of another original headline meets second.
Second determines that module 60 is for meeting in the similarity of one headline with another original headline Described second pre-conditioned time, determine that said two news information is similar news information.
In one embodiment of the invention, the 3rd judge module 50 can be used for judging one headline and another Individual original headline is the most identical.Accordingly, second determines that module 60 can be used for: described state a headline and another When individual original headline is identical, determine that said two news information is similar news information.
Specifically, if two headline have original headline, then the 3rd judge module 50 can determine whether two Original headline is the most identical.If A has original header, B not to have original header in two message headers, then can former by A Beginning headline is mated with headline B, the most identical with headline B to judge the original header of headline A.
When 4th judge module 70 for not comprising the original headline of correspondence in described news information or new at one Hear the similarity of title and another original headline be unsatisfactory for described second pre-conditioned time, it is judged that two bodies It is pre-conditioned whether similarity meets the 3rd.
3rd determine module 80 for the similarity of said two body meet described 3rd pre-conditioned time, really Determining said two news information is similar news information.
In one embodiment of the invention, the 4th judge module 70 can be used for performing step in embodiment described in Fig. 5 S501-S505.Accordingly, the 3rd determines that module 80 can be used for performing step S506 in embodiment described in Fig. 5.Specifically can refer to figure 5 illustrated embodiments.
Should be appreciated that above-described embodiment, using headline and body as matching condition, carries out information similarity Join, in other embodiments of the invention, also can using news in brief or source web etc. as auxiliary similarity mode condition, To improve the precision of similarity mode.
It should be noted that above by headline coupling, the coupling of the original header of headline and news just In the matching process of literary composition, as long as determining that two news informations are analog information, and terminate subsequent match process, it is possible to effectively promote Recognition efficiency.
The identification device of similar news information according to embodiments of the present invention, the news to be identified by obtaining any two Information, and judge the similarity that the headline of two news informations judges meet first pre-conditioned time, determine two new News information is similar news information, it is possible to identify similar news information accurately and rapidly from numerous news informations, it is thus possible to Enough provide foundation for news information duplicate removal and comparison.
Further, after identifying similar news information, similar news information can be carried out duplicate removal, and by after duplicate removal News information is supplied to user.Thus, similar news can be removed from a large amount of news, it is provided that to user, promote information reading speed Degree, improves user and obtains the efficiency of information.
In the description of this specification, reference term " embodiment ", " some embodiments ", " example ", " specifically show Example " or the description of " some examples " etc. means to combine this embodiment or example describes specific features, structure, material or spy Point is contained at least one embodiment or the example of the present invention.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example must be directed to.And, the specific features of description, structure, material or feature can be in office One or more embodiments or example combine in an appropriate manner.Additionally, in the case of the most conflicting, the skill of this area The feature of the different embodiments described in this specification or example and different embodiment or example can be tied by art personnel Close and combination.
Additionally, term " first ", " second " are only used for describing purpose, and it is not intended that instruction or hint relative importance Or the implicit quantity indicating indicated technical characteristic.Thus, define " first ", the feature of " second " can express or Implicitly include at least one this feature.In describing the invention, " multiple " are meant that two or more, unless separately There is the most concrete restriction.
In flow chart or at this, any process described otherwise above or method description are construed as, and expression includes One or more is for realizing the module of code, fragment or the portion of the executable instruction of the step of specific logical function or process Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not by shown or discuss suitable Sequence, including according to involved function by basic mode simultaneously or in the opposite order, performs function, and this should be by the present invention Embodiment person of ordinary skill in the field understood.
Represent in flow charts or the logic described otherwise above at this and/or step, for example, it is possible to be considered as to use In the sequencing list of the executable instruction realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (system such as computer based system, including processor or other can hold from instruction Row system, device or equipment instruction fetch also perform the system instructed) use, or combine these instruction execution systems, device or set Standby and use.For the purpose of this specification, " computer-readable medium " can be any can to comprise, store, communicate, propagate or pass Defeated program is for instruction execution system, device or equipment or combines these instruction execution systems, device or equipment and the dress that uses Put.The more specifically example (non-exhaustive list) of computer-readable medium includes following: have the electricity of one or more wiring Connecting portion (electronic installation), portable computer diskette box (magnetic device), random access memory (RAM), read only memory (ROM), erasable read only memory (EPROM or flash memory), the fiber device edited, and portable optic disk is read-only deposits Reservoir (CDROM).It addition, computer-readable medium can even is that and can print the paper of described program thereon or other are suitable Medium, because then can carry out editing, interpreting or if desired with it such as by paper or other media are carried out optical scanning His suitable method is processed to electronically obtain described program, is then stored in computer storage.
Those skilled in the art are appreciated that and realize all or part of step that above-described embodiment method is carried Suddenly the program that can be by completes to instruct relevant hardware, and described program can be stored in a kind of computer-readable storage medium In matter, this program upon execution, including one or a combination set of the step of embodiment of the method.
Additionally, each functional unit in each embodiment of the present invention can be integrated in a processing module, it is also possible to It is that unit is individually physically present, it is also possible to two or more unit are integrated in a module.Above-mentioned integrated mould Block both can realize to use the form of hardware, it would however also be possible to employ the form of software function module realizes.Described integrated module is such as When fruit is using the form realization of software function module and as independent production marketing or use, it is also possible to be stored in a computer In read/write memory medium.
Storage medium mentioned above can be read only memory, disk or CD etc..Although having shown that above and retouching Embodiments of the invention are stated, it is to be understood that above-described embodiment is exemplary, it is impossible to be interpreted as the limit to the present invention System, above-described embodiment can be changed, revise, replace and become by those of ordinary skill in the art within the scope of the invention Type.

Claims (10)

1. the recognition methods of a similar news information, it is characterised in that comprise the following steps:
Obtaining any two news informations to be identified, described news information includes headline;
Judge whether the similarity of two headline meets first pre-conditioned;
Similarity in said two headline meet described first pre-conditioned time, determine that said two news information is phase Like news information.
2. the method for claim 1, it is characterised in that described method also includes:
Similarity in said two headline be unsatisfactory for described first pre-conditioned time, it is judged that whether described news information wraps Containing corresponding original headline;
When described news information comprises original headline, it is judged that the phase of headline and another original headline Whether second is met pre-conditioned like degree;
Similarity in one headline Yu another original headline meet described second pre-conditioned time, determine Said two news information is similar news information.
3. method as claimed in claim 2, it is characterised in that described news information also includes body, and described method is also Including:
When described news information does not comprise the original headline of correspondence or in a headline and another original news The similarity of title be unsatisfactory for described second pre-conditioned time, it is judged that the similarity of two bodies whether meet the 3rd preset Condition;
Similarity in said two body meet described 3rd pre-conditioned time, determine that said two news information is phase Like news information.
4. the method for claim 1, it is characterised in that whether the described similarity judging two headline meets One is pre-conditioned, including:
Judge that said two headline is the most identical;
Accordingly, the similarity in said two headline meet described first pre-conditioned time, determine said two news Information is similar news information, including:
When said two headline is identical, determine that said two news information is similar news information.
5. method as claimed in claim 4, it is characterised in that also include:
When said two headline differs, obtain the lexical set after each headline is carried out word segmentation processing, institute State lexical set and include the some words after headline is carried out participle;
Judge whether a lexical set comprises all words in another lexical set;
Accordingly, the similarity in said two headline meet described first pre-conditioned time, determine said two news Information is similar news information, including:
During all words in one lexical set comprises another lexical set, determine that said two news information is Similar news information.
6. method as claimed in claim 5, it is characterised in that also include:
During all words in one lexical set does not comprise another lexical set, it is judged that one lexical set In word and another lexical set in the first matching degree of word whether more than first threshold;And close according to dependency grammar System judges whether the second matching degree of word in one lexical set and the word in another lexical set is more than the Two threshold values;
Accordingly, the similarity in said two headline meet described first pre-conditioned time, determine said two news Information is similar news information, including:
When described first matching degree is more than Second Threshold more than first threshold and described second matching degree, determine that said two is new News information is similar news information.
7. method as claimed in claim 2, it is characterised in that one headline of described judgement and another original news mark It is second pre-conditioned whether the similarity of topic meets, including:
Judge that the original headline of one headline and another is the most identical;
Accordingly, meet in one headline and the similarity of another original headline described second pre-conditioned Time, determine that said two news information is similar news information, including:
Described state a headline identical with another original headline time, determine that said two news information is similar News information.
8. method as claimed in claim 3, it is characterised in that whether the described similarity judging two bodies meets the Three is pre-conditioned, including:
Extract the key word of the body corresponding with said two headline respectively, obtain the first keyword set and second Keyword set;
Determine the weight of each key word in described first keyword set and described second keyword set;
Title and weight according to described each key word determine described first keyword set and described second keyword set In same keyword;
Determine that described first keyword set repeats with the key word of described second keyword set according to described same keyword Rate;
Judge that whether described repetitive rate is more than predetermined probabilities;
Accordingly, the similarity in said two body meet described 3rd pre-conditioned time, determine said two news Information is similar news information, including:
When described repetitive rate is more than predetermined probabilities, determine that said two news information is similar news information.
9. the identification device of a similar news information, it is characterised in that including:
Acquisition module, for obtaining any two news informations to be identified, described news information includes headline;
First judge module, pre-conditioned for judging whether the similarity of two headline meets first;
First determines module, for the similarity of said two headline meet described first pre-conditioned time, determine institute Stating two news informations is similar news information.
10. device as claimed in claim 9, it is characterised in that described device also includes:
Second judge module, for the similarity of said two headline be unsatisfactory for described first pre-conditioned time, it is judged that Whether described news information comprises the original headline of correspondence;
3rd judge module, for when described news information comprises original headline, it is judged that a headline and another It is pre-conditioned whether the similarity of individual original headline meets second;
Second determines module, for meeting described the in the similarity of one headline Yu another original headline Two pre-conditioned time, determine that said two news information is similar news information.
CN201610765203.9A 2016-08-30 2016-08-30 The recognition methods of similar news information and device Active CN106202057B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610765203.9A CN106202057B (en) 2016-08-30 2016-08-30 The recognition methods of similar news information and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610765203.9A CN106202057B (en) 2016-08-30 2016-08-30 The recognition methods of similar news information and device

Publications (2)

Publication Number Publication Date
CN106202057A true CN106202057A (en) 2016-12-07
CN106202057B CN106202057B (en) 2019-07-12

Family

ID=58089017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610765203.9A Active CN106202057B (en) 2016-08-30 2016-08-30 The recognition methods of similar news information and device

Country Status (1)

Country Link
CN (1) CN106202057B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609106A (en) * 2017-09-12 2018-01-19 马上消费金融股份有限公司 Similar article searching method, device, equipment and storage medium
CN108304425A (en) * 2017-04-21 2018-07-20 腾讯科技(深圳)有限公司 A kind of graph text information recommends method, apparatus and system
CN108595464A (en) * 2018-01-31 2018-09-28 深圳市富途网络科技有限公司 A kind of method and system for realizing the similar news duplicate removal of multi-source
CN110245275A (en) * 2019-06-18 2019-09-17 中电科大数据研究院有限公司 A kind of extensive similar quick method for normalizing of headline
CN112926298A (en) * 2021-03-02 2021-06-08 北京百度网讯科技有限公司 News content identification method, related device and computer program product
CN113660541A (en) * 2021-07-16 2021-11-16 北京百度网讯科技有限公司 News video abstract generation method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
US20100017390A1 (en) * 2008-07-16 2010-01-21 Kabushiki Kaisha Toshiba Apparatus, method and program product for presenting next search keyword
CN102184256A (en) * 2011-06-02 2011-09-14 北京邮电大学 Clustering method and system aiming at massive similar short texts
CN103678275A (en) * 2013-04-15 2014-03-26 南京邮电大学 Two-level text similarity calculation method based on subjective and objective semantics
CN105630929A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Comment based news recommendation weight determination method and apparatus
CN105760526A (en) * 2016-03-01 2016-07-13 网易(杭州)网络有限公司 News classification method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100017390A1 (en) * 2008-07-16 2010-01-21 Kabushiki Kaisha Toshiba Apparatus, method and program product for presenting next search keyword
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN102184256A (en) * 2011-06-02 2011-09-14 北京邮电大学 Clustering method and system aiming at massive similar short texts
CN103678275A (en) * 2013-04-15 2014-03-26 南京邮电大学 Two-level text similarity calculation method based on subjective and objective semantics
CN105630929A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Comment based news recommendation weight determination method and apparatus
CN105760526A (en) * 2016-03-01 2016-07-13 网易(杭州)网络有限公司 News classification method and device

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304425A (en) * 2017-04-21 2018-07-20 腾讯科技(深圳)有限公司 A kind of graph text information recommends method, apparatus and system
CN108304425B (en) * 2017-04-21 2021-01-08 腾讯科技(深圳)有限公司 Image-text information recommendation method, device and system
CN107609106A (en) * 2017-09-12 2018-01-19 马上消费金融股份有限公司 Similar article searching method, device, equipment and storage medium
CN108595464A (en) * 2018-01-31 2018-09-28 深圳市富途网络科技有限公司 A kind of method and system for realizing the similar news duplicate removal of multi-source
CN110245275A (en) * 2019-06-18 2019-09-17 中电科大数据研究院有限公司 A kind of extensive similar quick method for normalizing of headline
CN110245275B (en) * 2019-06-18 2023-09-01 中电科大数据研究院有限公司 Large-scale similar news headline rapid normalization method
CN112926298A (en) * 2021-03-02 2021-06-08 北京百度网讯科技有限公司 News content identification method, related device and computer program product
CN113660541A (en) * 2021-07-16 2021-11-16 北京百度网讯科技有限公司 News video abstract generation method and device
CN113660541B (en) * 2021-07-16 2023-10-13 北京百度网讯科技有限公司 Method and device for generating abstract of news video

Also Published As

Publication number Publication date
CN106202057B (en) 2019-07-12

Similar Documents

Publication Publication Date Title
CN106202057A (en) The recognition methods of similar news information and device
CN106649818B (en) Application search intention identification method and device, application search method and server
US11853879B2 (en) Generating vector representations of documents
US10346879B2 (en) Method and system for identifying web documents for advertisements
CN106571139B (en) Phonetic search result processing method and device based on artificial intelligence
CN108733779A (en) The method and apparatus of text figure
US20220172247A1 (en) Method, apparatus and program for classifying subject matter of content in a webpage
JP2019504410A (en) Travel guide generation method and system
CN109144954A (en) Edit resource recommendation method, device and the electronic equipment of document
CN106204156A (en) A kind of advertisement placement method for network forum and device
CN106339510A (en) The click prediction method and device based on artificial intelligence
CN106682170B (en) Application search method and device
CN109766550B (en) Text brand recognition method, recognition device and storage medium
CN102043843A (en) Method and obtaining device for obtaining target entry based on target application
CN104268192B (en) A kind of webpage information extracting method, device and terminal
CN111212303A (en) Video recommendation method, server and computer-readable storage medium
CN111506794A (en) Rumor management method and device based on machine learning
CN111291177A (en) Information processing method and device and computer storage medium
KR102428666B1 (en) A device that executes an algorithm that stores and manages real estate data based on big data
CN107168546A (en) Input reminding method and device
CN109726289A (en) Event detecting method and device
CN107122492A (en) Lyric generation method and device based on picture content
CN111414757A (en) Text recognition method and device
CN110909120A (en) Resume searching/delivering method, device and system and electronic equipment
CN106528676A (en) Entity semantic retrieval processing method and device based on artificial intelligence

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant