CN108491512A - The method of abstracting and device of headline - Google Patents

The method of abstracting and device of headline Download PDF

Info

Publication number
CN108491512A
CN108491512A CN201810247766.8A CN201810247766A CN108491512A CN 108491512 A CN108491512 A CN 108491512A CN 201810247766 A CN201810247766 A CN 201810247766A CN 108491512 A CN108491512 A CN 108491512A
Authority
CN
China
Prior art keywords
news
title
original header
headline
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810247766.8A
Other languages
Chinese (zh)
Inventor
邬小鹏
余晓龙
张华泉
王浩
张向征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201810247766.8A priority Critical patent/CN108491512A/en
Publication of CN108491512A publication Critical patent/CN108491512A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Abstract

The present invention provides a kind of method of abstracting of headline and devices.This method includes:The original header for obtaining news carries out morphology syntactic analysis to the original header of news, obtains analysis result;Based on the analysis result, the sentence trunk content in the original header of news is extracted, and using the sentence trunk content of extraction as news candidate's title;Using the abstract quality evaluation strategy of headline, the quality of the news candidate title is assessed, and then news in brief title is determined according to assessment result.The embodiment of the present invention carries out compression abstract using morphology syntactic analysis to headline, remains the keynote message in former headline while so that the trunk content in headline is extracted as far as possible, can obtain more acurrate, more rigorous headline.

Description

The method of abstracting and device of headline
Technical field
The present invention relates to technical field of internet application, the method for abstracting and device of especially a kind of headline.
Background technology
In the huge internet of current information content, the network user using search engine when carrying out news search, generally Content based on headline screens its content needed with description, and then generates click behavior, therefore headline is to corresponding Generality, accuracy and the key message covering power of news information, have been largely fixed user to the search engine Usage experience.
In current search engine products, especially news category is searched for, and directly uses the original header conduct of news mostly Search shows the title of result, however news original header is often full of bulk redundancy to win human eye ball, increase click volume Information, or even excessively emphasize that some side is taken a part for the whole, cause title not rigorous, inaccurate, it is also possible to can be generated to user wrong Misdirecting.Such title in news active push product, can directly result in user can not quick obtaining news key message, The experience of user is influenced, acquisition of information desire of the user for push content is reduced, and reduces the viscosity to pushing product.
Therefore, for the original header of news, remove redundancy, with obtain more acurrate, more rigorous headline at For technical problem urgently to be resolved hurrily.
Invention content
In view of the above problems, it is proposed that the present invention overcoming the above problem in order to provide one kind or solves at least partly State the method for abstracting and device of the headline of problem.
One side according to the present invention provides a kind of method of abstracting of headline, including:
The original header for obtaining news carries out morphology syntactic analysis to the original header of news, obtains analysis result;
Based on the analysis result, the sentence trunk content in the original header of news is extracted, and by the sentence master of extraction Dry content is as news candidate's title;
Using the abstract quality evaluation strategy of headline, the quality of the news candidate title is assessed, in turn News in brief title is determined according to assessment result.
Optionally, the original header for obtaining news, including:
Obtain the crawl log about News Resources of web crawlers crawl;
The original header of news is extracted from crawl log.
Optionally, the original header that news is extracted from crawl log, including:
For being recorded about each item of News Resources in crawl log, the field value for extracting the specific field of this record is made For the original header of news.
Optionally, the original header to news carries out morphology syntactic analysis, obtains analysis result, including:
Word segmentation processing is carried out to the original header of news, obtains multiple participles;
Part-of-speech tagging and entity class mark are carried out respectively to each participle in the multiple participle;
Part-of-speech tagging based on each participle and entity class mark carry out interdependent syntactic analysis to the original header of news, Identify the interdependent node subscript and dependency type of each participle.
Optionally, the method that the original header to news carries out word segmentation processing includes at least one following:
Segmenting method based on string matching;
Segmenting method based on semantic understanding;
Segmenting method based on statistics.
Optionally, entity class mark is carried out to each participle in the multiple participle, including:
Using sequence labelling model, the entity word respectively segmented in the multiple participle is identified, marks entity class Not.
Optionally, the entity class includes following one of arbitrary:
Name, place name, mechanism name, brand name, software name.
Optionally, part-of-speech tagging and the entity class mark based on each participle, to the original header of news carry out according to Syntactic analysis is deposited, identifies the interdependent node subscript and dependency type of each participle, including:
It is marked by the part-of-speech tagging and entity class of each participle, the grammatical item of the original header of news is known Not;
Dependence between each grammatical item that analysis and identification goes out obtains the interdependent node subscript of each participle and interdependent class Type.
Optionally, it is based on the analysis result, extracts the sentence trunk content in the original header of news, including:
According to the part-of-speech tagging of each participle, entity class mark, interdependent node subscript and dependency type, syntax is generated Tree, and then by the screening and beta pruning to syntax tree, generate the sentence trunk content of the original header of news.
Optionally, described according to the part-of-speech tagging of each participle, entity class mark, interdependent node subscript and interdependent class Type generates syntax tree, and then by the screening and beta pruning to syntax tree, generates the sentence trunk content of the original header of news, Including:
It is trunk predicate to choose the corresponding head host nodes of Key Relationships in dependency type;
If part of speech is noun part-of-speech after host node participle, to all certain kinds than the interdependent noun of shallow-layer carry out merger Update predicate;
If part of speech is verb part of speech after host node participle, host node is set as predicate verb;
Negative word attribute is identified and merger enters predicate.
Optionally, the method further includes:
It identifies subject-predicate relationship node, merger is carried out for subject week mid-side node, to coordination node according to subject rule Noun part-of-speech part is kept, remaining carries out node beta pruning, and subject node is arranged.
Optionally, the method further includes:
According to type of object, object is identified if noun, coordination node all removes, and object section is arranged Point.
Optionally, using the abstract quality evaluation strategy of headline, the quality of the news candidate title is commented Estimate, including:
Compression processing is carried out to the original header of news using neural Machine Translation Model, news is obtained and weighs title;
Title and the news candidate title are weighed to the news, sentence is carried out in the language model using language model Under quality score calculate;
By the quality score being calculated as a result, the assessment knot assessed as the quality to the news candidate title Fruit.
Optionally, described that news in brief title is determined according to assessment result, including:
In the news weighs title and the news candidate title, according to the quality score being calculated as a result, really The highest title of quality score is determined as title to be selected;
If the corresponding quality score of candidate's title is more than quality score thresholds, it is pre- to judge whether the title to be selected meets If the condition of audit, if so, the title to be selected is determined as news in brief title.
Optionally, it includes at least one following that whether the described title to be selected, which meets default audit condition,:
Whether the title to be selected is subject-predicate phrase grammer;
Whether the title to be selected is subject-predicate phrase grammer, and predicate verb ingredient containing verb;
Whether the editing distance of the title to be selected and the original header of news is less than editing distance threshold value;
Whether the semantic distance of the title to be selected and the original header of news is less than semantic distance threshold value.
Optionally, after determining news in brief title according to assessment result, the method further includes:
The news in brief title is supplied to real-time hot spot product module, thus will be described by real-time hot spot product module News in brief title is shown as real-time hot spot.
Another aspect according to the present invention additionally provides a kind of summarization device of headline, including:
Acquisition module is suitable for obtaining the original header of news;
Analysis module is suitable for carrying out morphology syntactic analysis to the original header of news, obtains analysis result;
Extraction module is suitable for being based on the analysis result, extracts the sentence trunk content in the original header of news, and will The sentence trunk content of extraction is as news candidate's title;
Determining module is suitable for the abstract quality evaluation strategy using headline, to the quality of the news candidate title It is assessed, and then news in brief title is determined according to assessment result.
Optionally, the acquisition module is further adapted for:
Obtain the crawl log about News Resources of web crawlers crawl;
The original header of news is extracted from crawl log.
Optionally, the acquisition module is further adapted for:
For being recorded about each item of News Resources in crawl log, the field value for extracting the specific field of this record is made For the original header of news.
Optionally, the analysis module includes:
Participle unit is suitable for carrying out word segmentation processing to the original header of news, obtains multiple participles;
Unit is marked, is suitable for carrying out part-of-speech tagging respectively to each participle in the multiple participle and entity class marks;
Recognition unit is suitable for the part-of-speech tagging based on each participle and entity class mark, is carried out to the original header of news Interdependent syntactic analysis identifies the interdependent node subscript and dependency type of each participle.
Optionally, the method that the original header to news carries out word segmentation processing includes at least one following:
Segmenting method based on string matching;
Segmenting method based on semantic understanding;
Segmenting method based on statistics.
Optionally, the mark unit is further adapted for:
Using sequence labelling model, the entity word respectively segmented in the multiple participle is identified, marks entity class Not.
Optionally, the entity class includes following one of arbitrary:
Name, place name, mechanism name, brand name, software name.
Optionally, the recognition unit is further adapted for:
It is marked by the part-of-speech tagging and entity class of each participle, the grammatical item of the original header of news is known Not;
Dependence between each grammatical item that analysis and identification goes out obtains the interdependent node subscript of each participle and interdependent class Type.
Optionally, the extraction module is further adapted for:
According to the part-of-speech tagging of each participle, entity class mark, interdependent node subscript and dependency type, syntax is generated Tree, and then by the screening and beta pruning to syntax tree, generate the sentence trunk content of the original header of news.
Optionally, the extraction module is further adapted for:
It is trunk predicate to choose the corresponding head host nodes of Key Relationships in dependency type;
If part of speech is noun part-of-speech after host node participle, to all certain kinds than the interdependent noun of shallow-layer carry out merger Update predicate;
If part of speech is verb part of speech after host node participle, host node is set as predicate verb;
Negative word attribute is identified and merger enters predicate.
Optionally, the extraction module is further adapted for:
It identifies subject-predicate relationship node, merger is carried out for subject week mid-side node, to coordination node according to subject rule Noun part-of-speech part is kept, remaining carries out node beta pruning, and subject node is arranged.
Optionally, the extraction module is further adapted for:
According to type of object, object is identified if noun, coordination node all removes, and object section is arranged Point.
Optionally, the determining module is further adapted for:
Compression processing is carried out to the original header of news using neural Machine Translation Model, news is obtained and weighs title;
Title and the news candidate title are weighed to the news, sentence is carried out in the language model using language model Under quality score calculate;
By the quality score being calculated as a result, the assessment knot assessed as the quality to the news candidate title Fruit.
Optionally, the determining module is further adapted for:
In the news weighs title and the news candidate title, according to the quality score being calculated as a result, really The highest title of quality score is determined as title to be selected;
If the corresponding quality score of candidate's title is more than quality score thresholds, it is pre- to judge whether the title to be selected meets If the condition of audit, if so, the title to be selected is determined as news in brief title.
Optionally, it includes at least one following that whether the described title to be selected, which meets default audit condition,:
Whether the title to be selected is subject-predicate phrase grammer;
Whether the title to be selected is subject-predicate phrase grammer, and predicate verb ingredient containing verb;
Whether the editing distance of the title to be selected and the original header of news is less than editing distance threshold value;
Whether the semantic distance of the title to be selected and the original header of news is less than semantic distance threshold value.
Optionally, described device further includes:Module is provided, suitable for determining news according to assessment result in the determining module It makes a summary after title, the news in brief title is supplied to real-time hot spot product module, to by real-time hot spot product module The news in brief title is shown as real-time hot spot.
Another aspect according to the present invention, additionally provides a kind of computer storage media, and the computer storage media is deposited Computer program code is contained, when the computer program code is run on the computing device, the computing device is caused to be held The method of abstracting gone according to above-mentioned headline.
According to the present invention in another aspect, additionally provide a kind of computing device, including:Processor;And it is stored with calculating The memory of machine program code;When the computer program code is run by the processor, the computing device is caused to be held The method of abstracting gone according to above-mentioned headline.
An embodiment of the present invention provides a kind of method of abstracting of headline, obtain the original header of news first, then Morphology syntactic analysis is carried out to the original header of news, obtains analysis result;Subsequently, based on analysis result, the original of news is extracted Sentence trunk content in beginning title, and using the sentence trunk content of extraction as news candidate's title;Later, news mark is utilized The abstract quality evaluation strategy of topic, assesses the quality of news candidate's title, and then determine that news is plucked according to assessment result Want title.It can be seen that the embodiment of the present invention carries out compression abstract using morphology syntactic analysis to headline, make news mark Trunk content in topic remains the keynote message in former headline as far as possible while being extracted, can obtain it is more acurrate, At the same time more rigorous headline introduces abstract quality evaluation strategy, assesses the quality of news candidate's title, right It is audited automatically in the abstract preferable result of quality, to reduce the cost of artificial operation audit, and significantly reduces artificial examine Abstract push delay caused by core.
Above description is only the general introduction of technical solution of the present invention, in order to better understand the technical means of the present invention, And can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, below the special specific implementation mode for lifting the present invention.
According to the following detailed description of specific embodiments of the present invention in conjunction with the accompanying drawings, those skilled in the art will be brighter The above and other objects, advantages and features of the present invention.
Description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit are common for this field Technical staff will become clear.Attached drawing only for the purpose of illustrating preferred embodiments, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 illustrates the method for abstracting flow chart of headline according to an embodiment of the invention;
Fig. 2 illustrates the method stream that the original header according to an embodiment of the invention to news carries out morphology syntactic analysis Cheng Tu;
The method that Fig. 3 illustrates the sentence trunk content in the original header of extraction news according to an embodiment of the invention Flow chart;
Fig. 4 illustrates the method flow that the quality according to an embodiment of the invention to news candidate's title is assessed Figure;
Fig. 5 illustrates the method flow according to an embodiment of the invention that news in brief title is determined according to assessment result Figure;
Fig. 6 illustrates the displaying news in brief title according to an embodiment of the invention in search results pages;
Fig. 7 illustrates the method for abstracting flow chart of headline according to another embodiment of the present invention;
Fig. 8 illustrates the structure chart of the summarization device of headline according to an embodiment of the invention;And
Fig. 9 illustrates the structure chart of the summarization device of headline according to another embodiment of the present invention.
Specific implementation mode
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Completely it is communicated to those skilled in the art.
In the related art, the main method that sentence compression uses has:Word is deleted in sentence, word replaces in sentence It changes, reset or is inserted into.Word delet method becomes main stream approach since its complexity is relatively low wherein in sentence, the skill of use Art includes mainly noisy channel model, structuring discrimination model, the conversion of tree to tree, integral linear programming etc..With regard to general effect For, main stream approach technology is limited to the word amount deleted in sentence at present, and compression effectiveness is not obvious, in the following example:
Former sentence:But they are still continuing to search the area try and see if there were,in fact,any further shooting incidents.
Compressed sentence:They are continuing to search the area to see if there were any further incidents.
It refers in the related technology, is deleted based on word in sentence, word is replaced, resets or is inserted into sentence above-mentioned On the one hand mode is difficult on the other hand content and information in capture all titles are based on the revised title of this technical solution It is generally partially long.Thus, no matter from accuracy rate and revised length for heading be all difficult to meet the needs of user is for product with Experience.In addition, due to the effect and present situation of related art scheme, need to carry out manual examination and verification to the result after abstract, audit is logical Later push is carried out to reach the standard grade to meet the high precision demand of consumer products.Therefore, which does not break away from still larger The covering surface of abstract result caused by artificial operation cost expense, and artificial flow is low and poor in timeliness.
In order to solve the above-mentioned technical problem, an embodiment of the present invention provides a kind of method of abstracting of headline.Such as Fig. 1 institutes Show, this method may comprise steps of S102 to step S106.
Step S102 obtains the original header of news, carries out morphology syntactic analysis to the original header of news, is analyzed As a result.
Step S104 is based on analysis result, extracts the sentence trunk content in the original header of news, and by the sentence of extraction Sub- trunk content is as news candidate's title.
Step S106 assesses the quality of news candidate's title using the abstract quality evaluation strategy of headline, And then news in brief title is determined according to assessment result.
An embodiment of the present invention provides a kind of method of abstracting of headline, obtain the original header of news first, then Morphology syntactic analysis is carried out to the original header of news, obtains analysis result;Subsequently, based on analysis result, the original of news is extracted Sentence trunk content in beginning title, and using the sentence trunk content of extraction as news candidate's title;Later, news mark is utilized The abstract quality evaluation strategy of topic, assesses the quality of news candidate's title, and then determine that news is plucked according to assessment result Want title.It can be seen that the embodiment of the present invention carries out compression abstract using morphology syntactic analysis to headline, make news mark Trunk content in topic remains the keynote message in former headline as far as possible while being extracted, can obtain it is more acurrate, At the same time more rigorous headline introduces abstract quality evaluation strategy, assesses the quality of news candidate's title, right It is audited automatically in the abstract preferable result of quality, to reduce the cost of artificial operation audit, and significantly reduces artificial examine Abstract push delay caused by core.
The original header that news is obtained in above step S102, an embodiment of the present invention provides a kind of optional scheme, In the program, the crawl log about News Resources of web crawlers crawl can be obtained, and then extracted newly from crawl log The original header of news.
Here web crawlers (Web Crawlers) is a kind of according to certain rule, automatically captures web message Program or script.Web crawlers is when downloading Internet resources, such as from the homepage of a portal website, first downloads This webpage of portal website's homepage can find all hyperlink in the page then by analyzing this webpage, also etc. In the whole webpages for being aware of this family's portal website homepage and being directly linked, mail, finance and economics, news etc..Next access, The webpages such as the mail of this portal website of family are downloaded and analyzed, and other connected webpages can be found.Computer is allowed ceaselessly to do down It goes, entire internet can be downloaded.Certainly, also to record which page download is crossed, in order to avoid repeat.In web crawlers, The letter whether list that is known as " Hash table " (Hash Table) using one rather than notepad record webpage were downloaded Breath.
Can also be specifically for capturing day above from the scheme of original header for extracting news in crawl log It is recorded about each item of News Resources in will, extracts original header of the field value of the specific field of this record as news. For example, in the crawl log of web crawlers about the record format of News Resources be url_id+ t+url_title+ t+ Crawl_time then extracts original header of the field value as news of url_title.It should be noted that enumerate only herein It is schematical, the embodiment of the present invention is not limited.
Further, morphology syntactic analysis is carried out to the original header of news in above step S102, obtains analysis result, An embodiment of the present invention provides a kind of optional scheme, Fig. 2 illustrates the original mark according to an embodiment of the invention to news Topic carries out the method flow diagram of morphology syntactic analysis.As shown in Fig. 2, this method may comprise steps of S202 to step S206。
Step S202 carries out word segmentation processing to the original header of news, obtains multiple participles.
Step S204 carries out part-of-speech tagging to each participle in multiple participles and entity class marks respectively.
Step S206, the part-of-speech tagging based on each participle and entity class mark, carry out the original header of news interdependent Syntactic analysis identifies the interdependent node subscript and dependency type of each participle.
In step S202, the method for carrying out word segmentation processing to the original header of news may include being based on string matching Segmenting method, the segmenting method based on semantic understanding or the segmenting method etc. based on statistics, the embodiment of the present invention is to this It is not limited.
Segmenting method based on string matching, is called and does mechanical segmentation method, it will be waited for point according to certain strategy The Chinese character string of analysis is matched with the entry in " fully big " machine dictionary, if finding some character string in dictionary, Successful match (identifies a word).According to the difference of scanning direction, String matching segmenting method can be divided into positive matching and inverse To matching;The case where according to different length priority match, can be divided into maximum (longest) matching and minimum (most short) matching.It is common Several mechanical segmentation methods it is as follows:
1) Forward Maximum Method method (by left-to-right direction);
2) reverse maximum matching method (by right to left direction);
3) minimum cutting (keeping the word number cut out in each sentence minimum);
4) two-way maximum matching method (carry out by it is left-to-right, by right to left twice sweep).
During actually segmenting, above-mentioned various methods can also be combined with each other, for example, can be by Forward Maximum Method Method and reverse maximum matching process, which combine, constitutes bi-directional matching method.It is the characteristics of due to Chinese word at word, positive minimum Matching and reverse smallest match are generally rarely employed.It is, in general, that reverse matched cutting precision is slightly above positive matching, encounter Ambiguity it is also less.Statistical result shows that the error rate using Forward Maximum Method is 1/169 merely, is used merely inverse It is 1/245 to maximum matched error rate.But this precision far can not also meet actual needs.The participle system of actual use System need to be also further increased by using various other language messages using mechanical Chinese word segmentation as a kind of just departure section The accuracy rate of cutting.A kind of method is to improve scan mode, referred to as mark scanning or mark cutting, preferentially in character string to be analyzed It middle identification and is syncopated as some and former character string can be divided into smaller using these words as breakpoint with words of obvious characteristic String carries out mechanical Chinese word segmentation again, to reduce matched error rate.Another method is will to segment to combine with part-of-speech tagging, Help is provided to participle decision using abundant grammatical category information, and word segmentation result is examined in turn again in annotation process It tests, adjust, to greatly improve the accuracy rate of cutting.
Segmenting method based on semantic understanding is by allowing the understanding of computer mould personification distich, reaching identification word Effect.Its basic thought is exactly to carry out syntax, semantic analysis while participle, is handled using syntactic information and semantic information Ambiguity.It generally includes three parts:Segment subsystem, syntactic-semantic subsystem, master control part.Association in master control part Under tune, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to judge segmentation ambiguity, i.e., it Understanding process of the people to sentence is simulated, this segmenting method needs to use a large amount of linguistry and information.
Segmenting method based on statistics, formally sees, word is stable combinatorics on words, therefore within a context, adjacent Word simultaneously occur number it is more, be more possible to constitute a word.Therefore the frequency or probability energy of word co-occurrence adjacent with word Enough preferable confidence levels reflected into word.The frequency of each combinatorics on words of adjacent co-occurrence in language material can be counted, be counted Calculate their information that appears alternatively.The information that appears alternatively of two words is defined, the adjacent co-occurrence probabilities of two Chinese characters X, Y are calculated.Appear alternatively information Embody the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold value, it can think that this word group can A word can be constituted.This method need to only count the word group frequency in language material, not need cutting dictionary, thus be called It does no dictionary cutting word method or statistics takes word method.But this method also has certain limitation, can often extract some co-occurrences frequency out Degree is high but is not the commonly used word group of word, for example, " this ", " one of ", " having ", " I ", " many " etc., and to normal The accuracy of identification of word is poor, and space-time expense is big.The statistics Words partition system of practical application will use a basic dictionary for word segmentation (everyday words dictionary) carries out String matching participle, while identifying some new words using statistical method, i.e., by statistical string frequency and string With combining, fast, the efficient feature of matching participle cutting speed is not only played, but also no dictionary cutting word combination context is utilized The advantages of identifying new word, automatic disambiguation.
In addition one kind is the method based on statistical machine learning.A large amount of texts segmented are provided first, utilize statistics Machine learning model learns the rule (referred to as trained) of word segmentation, to realize the cutting to unknown text.It is each in Chinese The ability that word individually makees word is different, and the word having in addition occurs often as prefix, and some words are but often as suffix, knot Close two words it is adjacent when whether at word information, thus obtained it is many with segment related knowledge, this method is exactly to fill Divide and is segmented using the rule of Chinese group word.
Part-of-speech tagging is carried out to each participle in multiple participles in above step S204, the part of speech classification specifically marked can be with It is noun, verb, adjective, adverbial word, conjunction, interjection or numeral-classifier compound etc., the embodiment of the present invention is without limitation.
Entity class mark is being carried out to each participle in multiple participles in step S204, an embodiment of the present invention provides one The optional scheme of kind, that is, sequence labelling model may be used, the entity word respectively segmented in multiple participles is identified, mark Note entity class.Here entity class can be name, place name, mechanism name, brand name or software name etc., and the present invention is implemented Example is without being limited thereto.
In practical applications, sequence labelling model can be HMM (Hidden Markov Model, hidden Markov mould Type), MEMM (Maximum Entropy Markov Model, maximum entropy Hidden Markov Model) and CRF (Conditional Random Field Algorithm, conditional random field models) etc..It is different from general classification problem It is that the output of sequence labelling model is a sequence label.Typically, connected each other between label, constitute label it Between structural information.Using these structural informations, sequence labelling model can often reach than tradition in sequence labelling problem The higher performance of sorting technique.
Shown in the dependency type referred in above step S206 can illustrate such as table 1, it should be noted that illustrate in table 1 Dependency type and example be only illustrative, the embodiment of the present invention is not limited.
Table 1
Dependency type Tag (label) Description (description information) Example
Subject-predicate relationship SBV subject-verb I gives her a bunch of flowers (I<-- send)
Dynamic guest's relationship VOB Direct object, verb-object I give her a bunch of flowers (send -->Flower)
Between guest's relationship IOB Indirect object, indirect-object I give her a bunch of flowers (send -->She)
Preposition object FOB Preposition object, fronting-object He reads (book by any book<-- read)
Relationship in fixed ATT attribute Red apple is (red<-- apple)
Verbal endocentric phrase ADV adverbial It is very beautiful (very<-- it is beautiful)
Structure of complementation CMP complement Do the operation that is over (do -->It is complete)
Coordination COO coordinate Mountain and sea (mountain -->Sea)
Guest's Jie relationship POB preposition-object In trade area (-->It is interior)
Absolute construction IS independent structure Two simple sentences are independent of one another in structure
Key Relationships HED head Refer to the core of entire sentence
And language DBL double He ask I have a meal (ask -->I)
Part-of-speech tagging based on each participle in above step S206 and entity class mark, carry out the original header of news Interdependent syntactic analysis identifies the interdependent node subscript and dependency type of each participle, and an embodiment of the present invention provides a kind of optional Scheme can be marked in the alternative by the part-of-speech tagging and entity class of each participle, to the original header of news Grammatical item is identified, and then the dependence between each grammatical item for going out of analysis and identification, obtains the interdependent section of each participle Point subscript and dependency type.
Based on interdependent syntactic analysis above, above step S104 extracts the original header of news based on analysis result In sentence trunk content when, can be specifically to be marked according to the part-of-speech tagging of each participle, entity class, be marked under interdependent node And dependency type, syntax tree is generated, and then by the screening and beta pruning to syntax tree, generate the sentence master of the original header of news Dry content.
The method that Fig. 3 illustrates the sentence trunk content in the original header of extraction news according to an embodiment of the invention Flow chart.As shown in figure 3, this method may comprise steps of S302 to step S306.
Step S302, it is trunk predicate to choose the corresponding head host nodes of Key Relationships in dependency type.
Step S304, if part of speech is noun part-of-speech after host node participle, to all certain kinds than the interdependent name of shallow-layer Word carries out merger and updates predicate;If part of speech is verb part of speech after host node participle, host node is set as predicate verb.
Step S306 is identified negative word attribute and merger enters predicate.
In the alternative embodiment of the present invention, it can also identify subject-predicate relationship node, subject week mid-side node is returned And noun part-of-speech part is kept according to subject rule to coordination node, remaining carries out node beta pruning, and subject section is arranged Point.Further, it is also possible to according to type of object, object is identified if noun, coordination node all removes, and is arranged Object node.
The embodiment of the present invention carries out compression abstract using morphology syntactic analysis to headline, makes the master in headline The keynote message in former headline is remained while dry content is extracted as far as possible, can be obtained more acurrate, more rigorous Headline.
The abstract quality evaluation strategy that headline is utilized in above step S106 carries out the quality of news candidate's title Assessment, an embodiment of the present invention provides a kind of optional scheme, Fig. 4 illustrates according to an embodiment of the invention to news candidate The method flow diagram that the quality of title is assessed.As shown in figure 4, this method may comprise steps of S402 to step S406。
Step S402 carries out compression processing to the original header of news using neural Machine Translation Model, obtains news Weigh title.
In this step, neural Machine Translation Model can be trained in advance, can be used for example history reach the standard grade it is careful Data pair after core and the data acquisition system manually marked train neural machine to turn over using Seq2Seq combination Attention mechanism Translate model.
Step S404 weighs title and news candidate's title to news, and sentence is carried out in the language mould using language model Quality score under type calculates.
Step S406, by the quality score being calculated as a result, assessing as the quality to news candidate's title Assessment result.
According to step S402 to step S406 by the quality score being calculated as a result, as to news candidate's title After the assessment result that quality is assessed, can news in brief title further be determined according to assessment result.
Fig. 5 illustrates the method flow according to an embodiment of the invention that news in brief title is determined according to assessment result Figure.As shown in figure 5, this method may comprise steps of S502 to step S504.
Step S502, in news weighs title and news candidate's title, according to the quality score being calculated as a result, really The highest title of quality score is determined as title to be selected.
Step S504 judges the title to be selected if the corresponding quality score of candidate's title is more than quality score thresholds Whether satisfaction presets audit condition, if so, the title to be selected is determined as news in brief title.
Here, it may include at least one following that whether which, which meets default audit condition,:
Whether the title to be selected is subject-predicate phrase grammer;
Whether the title to be selected is subject-predicate phrase grammer, and predicate verb ingredient containing verb;
Whether the editing distance of the title to be selected and the original header of news is less than editing distance threshold value;
Whether the semantic distance of the title to be selected and the original header of news is less than semantic distance threshold value.
In practical applications, it can be only to meet to preset one of audit condition, then the title to be selected is determined as news Abstract title;Can also be to meet to preset any two or more than two combinations in audit condition, then it is the title to be selected is true It is set to news in brief title;It can also be satisfaction all default audit conditions, then the title to be selected be determined as news in brief mark Topic.For example, can first determine whether the title to be selected is subject-predicate phrase grammer, if so, continuing whether to judge predicate verb Ingredient containing verb.If predicate verb ingredient containing verb, the editor that continues to judge the title to be selected and the original header of news away from From whether less than editing distance threshold value.If the editing distance of the title to be selected and the original header of news is less than editing distance threshold Value then continues to judge whether the semantic distance of the title to be selected and the original header of news is less than semantic distance threshold value.If this is waited for It selects the semantic distance of title and the original header of news to be less than semantic distance threshold value, then the title to be selected is determined as news in brief Title.
It, can also will be new after determining news in brief title according to assessment result in the alternative embodiment of the present invention It hears abstract title and is supplied to real-time hot spot product module, thus by real-time hot spot product module using news in brief title as in real time Hot spot is shown.In practical applications, real-time hot spot product module can be shown news in brief title as real-time hot spot In search results pages, the search experience of user can be promoted, improves the clicking rate for the search result items that search engine generates.Such as Shown in Fig. 6, in the corresponding search results pages of search term " rural area is revitalized ", news in brief title is shown in the form of real-time hot spot.
A variety of realization methods of the links of embodiment illustrated in fig. 1 are described above, specific embodiment will be passed through below Come be discussed in detail the present invention headline method of abstracting realization process.
Fig. 7 illustrates the method for abstracting flow chart of headline according to another embodiment of the present invention.As shown in fig. 7, should Method may comprise steps of S702 to step S708.
Step S702 captures the News Resources on internet, extracts the corresponding original header of news.
Step S704 knows news original header using participle technique, morphological analysis technology, syntactic analysis technology, entity Other technology extracts the sentence trunk content in news original header.
Step S706 generates corresponding rewriting candidate result using neural Machine Translation Model.
Step S708 rewrites quality using language model and semantic feature assessment, and rewrites result to wherein high quality Automatically it is audited.
The embodiment of the present invention carries out compression abstract using syntactic analysis to news original header, makes in news original header Trunk content the keynote message in former news is remained while be extracted as far as possible, at the same time introduce and rewrite abstract quality Sub-model is assessed rewriting abstract effect, the abstract preferable result of quality is audited automatically, to reduce artificial fortune The cost of audit is sought, and push delay of making a summary caused by significantly reducing manual examination and verification.
Below will be by specific example, i.e., the original header of news is " to have rescued bust heavy snow food market of crushing in Anlu Hubei's Go out 13 people " it is discussed in detail the specific implementation process of each section.
(1) model pre-training is obtained with existing model
Usage history reaches the standard grade the data pair after audit and the data acquisition system that manually marks is combined using Seq2Seq The neural Machine Translation Model of Attention mechanism training, model training tool are 360 existing neural machine translation tools packets.
Training data is that parallel corpora format is as follows:
Ori:Lending 1,600,000 wherein 1,380,000 fails to withdraw bank client manager in violation of rules and regulations
Sum:Bank client manager makes loans in violation of rules and regulations
360 existing language models are obtained as quality point is rewritten to assess.
(2) title obtains and carries out morphological analysis process to title
News original header is obtained from the crawl log in web crawlers.
Format is as follows:url_id+\t+url_title+\t+crawl_time.
Morphological analysis as the basic step in natural language processing technique, the part-of-speech tagging of output, dependence with And entity tag type, it is the foundation characteristic that the technologies such as follow-up sentence trunk extraction, compression abstract are relied on.It calls existing Output after 360 word-dividing modes:
Example:13 people are rescued in bust heavy snow food market of crushing in Anlu Hubei's
After participle:Hubei/the Anlu ns/ns is prominent/d drops/v heavy snow/n crushes/food markets v/n /d rescues/v13 people/mq
The wherein part-of-speech tagging of/after the preceding participle for coarseness as a result ,/after rear participle.
Based on after participle as a result, using the identification based on sequence labelling to proper name therein and entity word.
Raw data format to be marked as shown in 2 first row of table, using sequence labelling model output annotation results such as Shown in second and third row of table 2.In table 2, B indicates that the byte started, E indicate that last byte, LOC table show place.It needs to illustrate , enumerate and be only illustrative herein, the embodiment of the present invention is not limited.
Table 2
Lake B LOC
North E LOC
Peace B LOC
Land E LOC
It is prominent 0
Drop 0
Greatly 0
Snow 0
Pressure 0
Collapse 0
Dish 0
City 0
0
0
It rescues 0
Go out 0
13 0
People 0
To in above-mentioned table 2 result and participle after result carry out merger.
After participle and Entity recognition:
Hubei/the Anlu ns/LOC/ns/LOC is prominent/d/ drops/v/ heavy snow/n/ crushes/food markets v//n/ /d/ rescues/v/13 People/mq/
First is classified as after coarseness participle as a result, part-of-speech tagging after secondary series participle, third row are real after wherein/segmentation Body classification marks.
Based on participle with after identification as a result, 360 basic syntactic analysis modules is called to complete syntactic analysis.Final morphology point Analysing result is:
Hubei/ns/LOC/2/ATT
Anlu/ns/LOC/4/SBV
Prominent/d//4/ADV
Drop/v//0/HEAD
Heavy snow/n//4/VOB
Crush/v//4/COO
Food market/n//6/VOB
/ d//9/ADV
Rescue/v//6/COO
13 people/mq//9/VOB
First is classified as after coarseness participle as a result, part-of-speech tagging after secondary series participle, third row are real after wherein/segmentation Body classification marks, and the 4th is classified as the interdependent node subscript in interdependent syntactic analysis, and the 5th is classified as dependency type.
(3) extraction of sentence trunk content
According to the morphological analysis feature of (2) output above, syntax tree is generated, is generated with beta pruning by the screening to syntax tree Sentence trunk.Specific rules are as follows with algorithm:
It is trunk predicate to choose interdependent syntax head nodes;
If part of speech is noun part-of-speech after host node participle:
To all certain kinds than the interdependent noun of shallow-layer carry out merger and update predicate;
If part of speech is verb part of speech after host node participle:
Host node is set as predicate verb;
Negative word attribute is identified and merger enters predicate;
Identify its subject-predicate logical relation node:
Merger is carried out for subject week mid-side node, to coordination node according to subject rule keep noun part-of-speech part its Remaining progress node beta pruning, and subject node is set;
According to object this journey, object is identified if noun, coordination node all removes, and object section is arranged Point.
Former sentence:13 people are rescued in bust heavy snow food market of crushing in Anlu Hubei's
Sentence trunk:Anlu Hubei's drop heavy snow crushes food market
(4) using neural Machine Translation Model rewrite extensive
For each news original header, compressed using the good neural Machine Translation Model of pre-training after participle Formula is made a summary, and generates candidate, and candidate collection is added simultaneously in sentence trunk.Neural machine translation can carry out sentence article Compression abstract end to end.
Input sample:13 people are rescued in bust heavy snow food market of crushing in Anlu Hubei's
Output candidate collection:
Former sentence trunk:Anlu Hubei's drop heavy snow crushes food market
Neural machine translation result:Hubei heavy snow crushes food market
(5) title based on language model rewrites audit
The score of sentence under the model is carried out to the candidate of each title output using language model to calculate, name quality_score。
(6) rule-based that screening progress automatic wire charging is carried out to high quality titles
Initialize following parameters:
quality_threshold,
jaccard_semantic_gap_threshold,
ed_semantic_gap_threshold;
For reelecting candidate under each original header:
All candidates are ranked up the highest result of rear quality point by final_candidate=by quality point.
For final_candidate, if its quality point is more than quality_threshold:
If its meet subject-predicate phrase grammer, and predicate verb Chinese verb ingredient:
And it is respectively less than corresponding semantic_gap_ with the editing distance of former title and jaccard semantic distances threshold:
Then the final_candidate is the automatic audit compression abstract result of corresponding title.
The embodiment of the present invention significantly reduces a large amount of human inputs needed during legacy titles are rewritten, and solves due to fortune The inconsistent problem of effect is rewritten caused by battalion's personnel's subjective criterion is inconsistent.After obtaining news in brief title, it can also carry The 360 real-time hot spot products of search of supply, the product can be presented in search homepage, on the right side of search results pages, browser homepage or Navigation of person 360 etc..Product rewrites former headline using this method and after automatic wire charging, compared to the method production of former human-edited Product clicking rate is obviously improved.
It should be noted that in practical application, combination may be used in above-mentioned all optional embodiments arbitrary group of mode It closes, forms the alternative embodiment of the present invention, this is no longer going to repeat them.
Based on the method for abstracting for the headline that each embodiment provides above, it is based on same inventive concept, the present invention is real It applies example and additionally provides a kind of summarization device of headline.
Fig. 8 illustrates the structure chart of the summarization device of headline according to an embodiment of the invention.As shown in figure 8, should Device may include acquisition module 810, analysis module 820, extraction module 830 and determining module 840.
Now introduce each composition of the summarization device of the headline of the embodiment of the present invention or the function and each section of device Between connection relation:
Acquisition module 810 is suitable for obtaining the original header of news;
Analysis module 820 is coupled with acquisition module 810, is suitable for carrying out morphology syntactic analysis to the original header of news, Obtain analysis result;
Extraction module 830 is coupled with analysis module 820, is suitable for being based on the analysis result, extracts the original mark of news Sentence trunk content in topic, and using the sentence trunk content of extraction as news candidate's title;
Determining module 840 is coupled with extraction module 830, is suitable for the abstract quality evaluation strategy using headline, right The quality of the news candidate title is assessed, and then determines news in brief title according to assessment result.
In the alternative embodiment of the present invention, the acquisition module 810 is further adapted for:
Obtain the crawl log about News Resources of web crawlers crawl;
The original header of news is extracted from crawl log.
In the alternative embodiment of the present invention, the acquisition module 810 is further adapted for:
For being recorded about each item of News Resources in crawl log, the field value for extracting the specific field of this record is made For the original header of news.
In the alternative embodiment of the present invention, as shown in figure 9, the analysis module 820 of figure 8 above displaying may include:
Participle unit 821 is suitable for carrying out word segmentation processing to the original header of news, obtains multiple participles;
Unit 822 is marked, is coupled with participle unit 821, suitable for being carried out respectively to each participle in the multiple participle Part-of-speech tagging and entity class mark;
Recognition unit 823 is coupled with mark unit 822, is suitable for part-of-speech tagging and entity class mark based on each participle Note carries out interdependent syntactic analysis to the original header of news, identifies the interdependent node subscript and dependency type of each participle.
In the alternative embodiment of the present invention, the method that the original header to news carries out word segmentation processing includes following At least one:
Segmenting method based on string matching;
Segmenting method based on semantic understanding;
Segmenting method based on statistics.
In the alternative embodiment of the present invention, the mark unit 822 is further adapted for:
Using sequence labelling model, the entity word respectively segmented in the multiple participle is identified, marks entity class Not.
In the alternative embodiment of the present invention, the entity class includes following one of arbitrary:
Name, place name, mechanism name, brand name, software name.
In the alternative embodiment of the present invention, the recognition unit 823 is further adapted for:
It is marked by the part-of-speech tagging and entity class of each participle, the grammatical item of the original header of news is known Not;
Dependence between each grammatical item that analysis and identification goes out obtains the interdependent node subscript of each participle and interdependent class Type.
In the alternative embodiment of the present invention, the extraction module 830 is further adapted for:
According to the part-of-speech tagging of each participle, entity class mark, interdependent node subscript and dependency type, syntax is generated Tree, and then by the screening and beta pruning to syntax tree, generate the sentence trunk content of the original header of news.
In the alternative embodiment of the present invention, the extraction module 830 is further adapted for:
It is trunk predicate to choose the corresponding head host nodes of Key Relationships in dependency type;
If part of speech is noun part-of-speech after host node participle, to all certain kinds than the interdependent noun of shallow-layer carry out merger Update predicate;
If part of speech is verb part of speech after host node participle, host node is set as predicate verb;
Negative word attribute is identified and merger enters predicate.
In the alternative embodiment of the present invention, the extraction module 830 is further adapted for:
It identifies subject-predicate relationship node, merger is carried out for subject week mid-side node, to coordination node according to subject rule Noun part-of-speech part is kept, remaining carries out node beta pruning, and subject node is arranged.
In the alternative embodiment of the present invention, the extraction module 830 is further adapted for:
According to type of object, object is identified if noun, coordination node all removes, and object section is arranged Point.
In the alternative embodiment of the present invention, the determining module 840 is further adapted for:
Compression processing is carried out to the original header of news using neural Machine Translation Model, news is obtained and weighs title;
Title and the news candidate title are weighed to the news, sentence is carried out in the language model using language model Under quality score calculate;
By the quality score being calculated as a result, the assessment knot assessed as the quality to the news candidate title Fruit.
In the alternative embodiment of the present invention, the determining module 840 is further adapted for:
In the news weighs title and the news candidate title, according to the quality score being calculated as a result, really The highest title of quality score is determined as title to be selected;
If the corresponding quality score of candidate's title is more than quality score thresholds, it is pre- to judge whether the title to be selected meets If the condition of audit, if so, the title to be selected is determined as news in brief title.
The present invention alternative embodiment in, the described title to be selected whether meet default audit condition include it is following at least One of:
Whether the title to be selected is subject-predicate phrase grammer;
Whether the title to be selected is subject-predicate phrase grammer, and predicate verb ingredient containing verb;
Whether the editing distance of the title to be selected and the original header of news is less than editing distance threshold value;
Whether the semantic distance of the title to be selected and the original header of news is less than semantic distance threshold value.
In the alternative embodiment of the present invention, as shown in figure 9, the device of figure 8 above displaying can also include:
Module 910 is provided, it, will after determining news in brief title according to assessment result in the determining module 840 The news in brief title is supplied to real-time hot spot product module, to by real-time hot spot product module by the news in brief mark Topic is shown as real-time hot spot.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of computer storage media, and the computer is deposited Storage media is stored with computer program code, when the computer program code is run on the computing device, leads to the meter Calculate method of abstracting of the equipment execution according to above-mentioned headline.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of computing device, including:Processor;And it deposits Contain the memory of computer program code;When the computer program code is run by the processor, lead to the meter Calculate method of abstracting of the equipment execution according to above-mentioned headline.
According to the combination of any one above-mentioned alternative embodiment or multiple alternative embodiments, the embodiment of the present invention can reach Following advantageous effect:
An embodiment of the present invention provides a kind of method of abstracting of headline, obtain the original header of news first, then Morphology syntactic analysis is carried out to the original header of news, obtains analysis result;Subsequently, based on analysis result, the original of news is extracted Sentence trunk content in beginning title, and using the sentence trunk content of extraction as news candidate's title;Later, news mark is utilized The abstract quality evaluation strategy of topic, assesses the quality of news candidate's title, and then determine that news is plucked according to assessment result Want title.It can be seen that the embodiment of the present invention carries out compression abstract using morphology syntactic analysis to headline, make news mark Trunk content in topic remains the keynote message in former headline as far as possible while being extracted, can obtain it is more acurrate, At the same time more rigorous headline introduces abstract quality evaluation strategy, assesses the quality of news candidate's title, right It is audited automatically in the abstract preferable result of quality, to reduce the cost of artificial operation audit, and significantly reduces artificial examine Abstract push delay caused by core.
It is apparent to those skilled in the art that the specific work of the system of foregoing description, device and unit Make process, can refer to corresponding processes in the foregoing method embodiment, for brevity, does not repeat separately herein.
In addition, each functional unit in each embodiment of the present invention can be physically independent, can also two or More than two functional units integrate, and can be all integrated in all functional units in a processing unit.It is above-mentioned integrated Functional unit both may be used hardware form realize, can also be realized in the form of software or firmware.
One of ordinary skill in the art will appreciate that:If the integrated functional unit is realized and is made in the form of software It is independent product sale or in use, can be stored in a computer read/write memory medium.Based on this understanding, Technical scheme of the present invention is substantially or all or part of the technical solution can be expressed in the form of software products, The computer software product is stored in a storage medium comprising some instructions, with so that computing device (such as Personal computer, server or network equipment etc.) various embodiments of the present invention the method is executed when running described instruction All or part of step.And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM), random access memory Device (RAM), the various media that can store program code such as magnetic disc or CD.
Alternatively, realizing that all or part of step of preceding method embodiment can be (all by the relevant hardware of program instruction Such as personal computer, the computing device of server or network equipment etc.) it completes, described program instruction can be stored in one In computer read/write memory medium, when described program instruction is executed by the processor of computing device, the computing device is held The all or part of step of row various embodiments of the present invention the method.
Finally it should be noted that:The above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Present invention has been described in detail with reference to the aforementioned embodiments for pipe, it will be understood by those of ordinary skill in the art that:At this It, still can be with technical scheme described in the above embodiments is modified or right within the spirit and principle of invention Which part or all technical features carry out equivalent replacement;And these modifications or replacements, do not make corresponding technical solution de- From protection scope of the present invention.

Claims (10)

1. a kind of method of abstracting of headline, including:
The original header for obtaining news carries out morphology syntactic analysis to the original header of news, obtains analysis result;
Based on the analysis result, the sentence trunk content in the original header of news is extracted, and will be in the sentence trunk of extraction Hold and is used as news candidate's title;
Using the abstract quality evaluation strategy of headline, the quality of the news candidate title is assessed, and then basis Assessment result determines news in brief title.
2. according to the method described in claim 1, wherein, the original header for obtaining news, including:
Obtain the crawl log about News Resources of web crawlers crawl;
The original header of news is extracted from crawl log.
3. method according to claim 1 or 2, wherein the original header for extracting news from crawl log, packet It includes:
For being recorded about each item of News Resources in crawl log, the field value of the specific field of this record is extracted as new The original header of news.
4. method according to any one of claim 1-3, wherein the original header to news carries out morphology syntax Analysis, obtains analysis result, including:
Word segmentation processing is carried out to the original header of news, obtains multiple participles;
Part-of-speech tagging and entity class mark are carried out respectively to each participle in the multiple participle;
Part-of-speech tagging based on each participle and entity class mark carry out interdependent syntactic analysis, identification to the original header of news The interdependent node subscript and dependency type respectively segmented.
5. according to the described method of any one of claim 1-4, wherein the original header to news carries out word segmentation processing Method include at least one following:
Segmenting method based on string matching;
Segmenting method based on semantic understanding;
Segmenting method based on statistics.
6. method according to any one of claims 1-5, wherein carry out entity to each participle in the multiple participle Classification marks, including:
Using sequence labelling model, the entity word respectively segmented in the multiple participle is identified, marks entity class.
7. according to the method described in any one of claim 1-6, wherein the entity class includes following one of arbitrary:
Name, place name, mechanism name, brand name, software name.
8. a kind of summarization device of headline, including:
Acquisition module is suitable for obtaining the original header of news;
Analysis module is suitable for carrying out morphology syntactic analysis to the original header of news, obtains analysis result;
Extraction module is suitable for being based on the analysis result, extracts the sentence trunk content in the original header of news, and will extraction Sentence trunk content as news candidate's title;
Determining module is suitable for the abstract quality evaluation strategy using headline, is carried out to the quality of the news candidate title Assessment, and then news in brief title is determined according to assessment result.
9. a kind of computer storage media, the computer storage media is stored with computer program code, when the computer When program code is run on the computing device, the computing device is caused to execute according to described in any one of claim 1-7 The method of abstracting of headline.
10. a kind of computing device, including:
Processor;And
It is stored with the memory of computer program code;
When the computer program code is run by the processor, the computing device is caused to execute according to claim 1- The method of abstracting of headline described in any one of 7.
CN201810247766.8A 2018-03-23 2018-03-23 The method of abstracting and device of headline Pending CN108491512A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810247766.8A CN108491512A (en) 2018-03-23 2018-03-23 The method of abstracting and device of headline

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810247766.8A CN108491512A (en) 2018-03-23 2018-03-23 The method of abstracting and device of headline

Publications (1)

Publication Number Publication Date
CN108491512A true CN108491512A (en) 2018-09-04

Family

ID=63319650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810247766.8A Pending CN108491512A (en) 2018-03-23 2018-03-23 The method of abstracting and device of headline

Country Status (1)

Country Link
CN (1) CN108491512A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109471933A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of generation method of text snippet, storage medium and server
CN109829161A (en) * 2019-01-30 2019-05-31 延边大学 A kind of method of multilingual autoabstract
CN110287491A (en) * 2019-06-25 2019-09-27 北京百度网讯科技有限公司 Event name generation method and device
CN110516227A (en) * 2019-03-28 2019-11-29 苏州八叉树智能科技有限公司 Title text generation method, device, electronic equipment and computer-readable medium
CN110909021A (en) * 2018-09-12 2020-03-24 北京奇虎科技有限公司 Construction method and device of query rewriting model and application thereof
CN112084381A (en) * 2020-09-11 2020-12-15 广东电网有限责任公司 Event extraction method, system, storage medium and equipment
CN113496118A (en) * 2020-04-07 2021-10-12 北京中科闻歌科技股份有限公司 News subject identification method, equipment and computer readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8051446B1 (en) * 1999-12-06 2011-11-01 Sharp Laboratories Of America, Inc. Method of creating a semantic video summary using information from secondary sources
CN102411621A (en) * 2011-11-22 2012-04-11 华中师范大学 Chinese inquiry oriented multi-document automatic abstraction method based on cloud mode
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
CN103838870A (en) * 2014-03-21 2014-06-04 武汉科技大学 News atomic event extraction method based on information unit fusion
CN105138507A (en) * 2015-08-06 2015-12-09 电子科技大学 Pattern self-learning based Chinese open relationship extraction method
CN105760546A (en) * 2016-03-16 2016-07-13 广州索答信息科技有限公司 Automatic generating method and device for Internet headlines
CN106155999A (en) * 2015-04-09 2016-11-23 科大讯飞股份有限公司 Semantics comprehension on natural language method and system
CN107038229A (en) * 2017-04-07 2017-08-11 云南大学 A kind of use-case extracting method based on natural semantic analysis
CN107656921A (en) * 2017-10-10 2018-02-02 上海数眼科技发展有限公司 A kind of short text dependency analysis method based on deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8051446B1 (en) * 1999-12-06 2011-11-01 Sharp Laboratories Of America, Inc. Method of creating a semantic video summary using information from secondary sources
CN102411621A (en) * 2011-11-22 2012-04-11 华中师范大学 Chinese inquiry oriented multi-document automatic abstraction method based on cloud mode
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
CN103838870A (en) * 2014-03-21 2014-06-04 武汉科技大学 News atomic event extraction method based on information unit fusion
CN106155999A (en) * 2015-04-09 2016-11-23 科大讯飞股份有限公司 Semantics comprehension on natural language method and system
CN105138507A (en) * 2015-08-06 2015-12-09 电子科技大学 Pattern self-learning based Chinese open relationship extraction method
CN105760546A (en) * 2016-03-16 2016-07-13 广州索答信息科技有限公司 Automatic generating method and device for Internet headlines
CN107038229A (en) * 2017-04-07 2017-08-11 云南大学 A kind of use-case extracting method based on natural semantic analysis
CN107656921A (en) * 2017-10-10 2018-02-02 上海数眼科技发展有限公司 A kind of short text dependency analysis method based on deep learning

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909021A (en) * 2018-09-12 2020-03-24 北京奇虎科技有限公司 Construction method and device of query rewriting model and application thereof
CN109471933A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of generation method of text snippet, storage medium and server
CN109829161A (en) * 2019-01-30 2019-05-31 延边大学 A kind of method of multilingual autoabstract
CN109829161B (en) * 2019-01-30 2023-08-04 延边大学 Method for automatically abstracting multiple languages
CN110516227A (en) * 2019-03-28 2019-11-29 苏州八叉树智能科技有限公司 Title text generation method, device, electronic equipment and computer-readable medium
CN110287491A (en) * 2019-06-25 2019-09-27 北京百度网讯科技有限公司 Event name generation method and device
CN110287491B (en) * 2019-06-25 2024-01-12 北京百度网讯科技有限公司 Event name generation method and device
CN113496118A (en) * 2020-04-07 2021-10-12 北京中科闻歌科技股份有限公司 News subject identification method, equipment and computer readable storage medium
CN112084381A (en) * 2020-09-11 2020-12-15 广东电网有限责任公司 Event extraction method, system, storage medium and equipment

Similar Documents

Publication Publication Date Title
CN108491512A (en) The method of abstracting and device of headline
CN110825876B (en) Movie comment viewpoint emotion tendency analysis method
CN108460150A (en) The processing method and processing device of headline
US7461056B2 (en) Text mining apparatus and associated methods
CN108538286A (en) A kind of method and computer of speech recognition
Kim et al. Interpreting semantic relations in noun compounds via verb semantics
CN108470026A (en) The sentence trunk method for extracting content and device of headline
CN108399265A (en) Real-time hot news providing method based on search and device
KR20160026892A (en) Non-factoid question-and-answer system and method
CN107180026B (en) Event phrase learning method and device based on word embedding semantic mapping
CN108363700A (en) The method for evaluating quality and device of headline
JP2006004417A (en) Method and device for recognizing specific type of information file
Refaee et al. Subjectivity and sentiment analysis of arabic twitter feeds with limited resources
Ferschke et al. FlawFinder: A Modular System for Predicting Quality Flaws in Wikipedia.
CN109726274A (en) Problem generation method, device and storage medium
CN102253930A (en) Method and device for translating text
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN112015721A (en) E-commerce platform storage database optimization method based on big data
CN111353306A (en) Entity relationship and dependency Tree-LSTM-based combined event extraction method
CN109657064A (en) A kind of file classification method and device
CN110134934A (en) Text emotion analysis method and device
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
CN104778157A (en) Multi-document abstract sentence generating method
CN111475651B (en) Text classification method, computing device and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180904