CN110377808A - Document processing method, device, electronic equipment and storage medium - Google Patents
Document processing method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN110377808A CN110377808A CN201910517936.4A CN201910517936A CN110377808A CN 110377808 A CN110377808 A CN 110377808A CN 201910517936 A CN201910517936 A CN 201910517936A CN 110377808 A CN110377808 A CN 110377808A
- Authority
- CN
- China
- Prior art keywords
- news documents
- news
- text
- document
- documents
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure is directed to a kind of document processing method, device, electronic equipment and storage mediums.The described method includes: obtaining at least one news documents set as unit of preset time period corresponding with event keyword;Based on multiple news documents in the event keyword and at least one described news documents set, the corresponding relevance score of the multiple news documents is determined;According to the relevance score, the scoring highest top n news documents of score value are extracted from the multiple news documents;N is the positive integer more than or equal to 1;According to the corresponding document text of the top n news documents, the corresponding summary text of the top n news documents is determined, and using the summary text as the summary texts of the top n news documents.The disclosure can be to avoid information redundancy, and participates in without artificial;And corresponding summary texts are extracted, subsequent public sentiment monitoring or information integration can be carried out, without manually checking news one by one, reduces the investment of human cost.
Description
Technical field
This disclosure relates to news documents processing technology field more particularly to a kind of document processing method, device, electronic equipment
And storage medium.
Background technique
With the fast development of Internet technology, network has become a part indispensable in people's life.
In Internet, all kinds of hot news events can all occur daily, and be embodied in each large platform (such as Baidu/
Microblogging/know/top news) heat search on list, but it is the throwing of whole event story line at a time that much heat, which search event itself,
It penetrates, not complete plot.
Currently, the integration of text would generally be carried out using the scheme that crawler and event keyword combine, crawler is used first
Technology grabs the keyword in focus incident, then, is extracted using the methods of filtering, cleaning and obtains the news comprising keyword,
Finally by a plurality of news of manual testing, commented based on NDCG (Normalized Discounted Cumulative Gain) value
Estimate current sort algorithm to be ranked up news, obtains maximally related media event.
In above scheme, sometime node crawls news according to keyword, and the news of the whole network is directed to current time section
The news that point occurs, there are great information redundancies, and use artificial sequence, increase cost of human resources, and the effect that sorts
Rate is lower, also, mainstream news only have title at present, and when carrying out subsequent public sentiment monitoring or information is integrated, it needs artificial
It checks news one by one, than relatively time-consuming, and needs to put into biggish energy cost.
Disclosure
To overcome the problems in correlation technique, the embodiment of the present disclosure provides a kind of document processing method, device, electricity
Sub- equipment and storage medium.
According to the first aspect of the embodiments of the present disclosure, a kind of document processing method is provided, comprising: obtain crucial with event
Corresponding at least one the news documents set as unit of preset time period of word;Based on the event keyword and it is described at least
Multiple news documents in one news documents set determine the corresponding relevance score of the multiple news documents;According to institute
Relevance score is stated, the scoring highest top n news documents of score value are extracted from the multiple news documents;N is more than or equal to 1
Positive integer;According to the corresponding document text of the top n news documents, the corresponding summary of the top n news documents is determined
Text, and using the summary text as the summary texts of the top n news documents.
In one kind of the disclosure in the specific implementation, it is described obtain it is corresponding with event keyword by preset time period as unit of
The step of at least one news documents set, comprising: it is based on the corresponding media event of the event keyword, it is determining and described new
Hear the temperature weight of multiple preset time periods of event correlation;The temperature weight is extracted from the multiple preset time period to be greater than
At least one target preset time period of weight threshold;Based on the event keyword, when at least one described target of acquisition is preset
News documents set in length.
The disclosure one kind in the specific implementation, it is described based on the event keyword and at least one described news text
Multiple news documents in shelves set, the step of determining the multiple news documents corresponding relevance score, comprising: to described
Event keyword is segmented, at least one first participle is obtained;At least one described first participle is calculated the multiple new
Hear at least one of at least one of document the first word frequency and at least one described first participle in all news documents set
A first inverse document frequency;For all news documents, each news documents are segmented, are obtained more
A second participle;Calculate the multiple second multiple second word frequency of the participle in each news documents and the multiple
Multiple second inverse document frequencies of second participle in all news documents set;According to the multiple second word frequency and institute
Multiple second inverse document frequencies are stated, the corresponding document matrix of each news documents is constructed;According to it is described at least one
First word frequency, at least one described first inverse document frequency and the document matrix, determine each news documents
Relevance score.
The disclosure one kind in the specific implementation, described at least one first word frequency according to, it is described at least one
One inverse document frequency and the document matrix, the step of determining the relevance score of each news documents, comprising: meter
At least one described first word frequency and at least one described first inverse document frequency are calculated, it is corresponding with each document matrix
News documents similarity value;Using the similarity value as the relevance score of each news documents.
The disclosure one kind in the specific implementation, described according to the corresponding document text of the top n news documents, determine
The step of top n news documents corresponding summary text, comprising: by the corresponding document text of the top n news documents
Input summary text network model trained in advance;Reception is exported new with the top n by the summary text network model
Hear the corresponding summary text of document.
The disclosure one kind in the specific implementation, it is described the corresponding document text of the top n news documents inputted it is pre-
First the step of trained summary text network model, comprising: the top n news documents are directed to, successively by each news documents
Corresponding document text is split by sentence format, obtains multiple format file texts;By the multiple sentence format file text
This input summary text network model;The reception is exported new with the top n by the summary text network model
The step of hearing document corresponding summary text, comprising: receive being exported by the summary text network model with the sentence format
The corresponding multiple format summary texts of document text;
The multiple sentence format summary texts are merged, the summary text is obtained.
According to the second aspect of an embodiment of the present disclosure, a kind of document processing device, document processing is provided, comprising: news documents set obtains
Modulus block, for obtaining at least one news documents set as unit of preset time period corresponding with event keyword;It is related
Property scoring determining module, for based on multiple news in the event keyword and at least one described news documents set
Document determines the corresponding relevance score of the multiple news documents;News documents extraction module, for according to the correlation
The scoring highest top n news documents of score value are extracted in scoring from the multiple news documents;N is just whole more than or equal to 1
Number;Summary texts determining module, for determining the top n news according to the corresponding document text of the top n news documents
The corresponding summary text of document, and using the summary text as the summary texts of the top n news documents.
The disclosure one kind in the specific implementation, the news documents set obtain module include: temperature weight determine son
Module, for being based on the corresponding media event of the event keyword, when determining associated multiple preset with the media event
Long temperature weight;Target duration extracting sub-module is big for extracting the temperature weight from the multiple preset time period
In at least one target preset time period of weight threshold;News documents set acquisition submodule, for crucial based on the event
Word obtains the news documents set at least one described target preset time period.
The disclosure one kind in the specific implementation, the relevance score determining module include: the first participle obtain submodule
Block obtains at least one first participle for segmenting to the event keyword;First word frequency computational submodule, is used for
Calculate at least one described first participle the first word frequency of at least one of the multiple news documents and it is described at least one
The first participle is in the first inverse document frequency of at least one of all news documents set;Second participle acquisition submodule,
For being directed to all news documents, each news documents are segmented, multiple second participles are obtained;Second word
Frequency meter operator module, for calculating multiple second word frequency and institute of the multiple second participle in each news documents
State multiple second inverse document frequencies of multiple second participles in all news documents set;Document matrix constructs submodule
Block, for constructing each news documents according to the multiple second word frequency and the multiple second inverse document frequency
Corresponding document matrix;Relevance score determines submodule, for according at least one described first word frequency, it is described at least one
First inverse document frequency and the document matrix determine the relevance score of each news documents.
The disclosure one kind in the specific implementation, the relevance score determine submodule include: similarity value calculate son
Module, it is and each described for calculating at least one described first word frequency and at least one described first inverse document frequency
The similarity value of the corresponding news documents of document matrix;Relevance score acquisition submodule, for using the similarity value as
The relevance score of each news documents.
The disclosure one kind in the specific implementation, the summary texts determining module includes: document text input submodule,
For the summary text network model that the corresponding document text input of the top n news documents is trained in advance;Summary text
Receiving submodule, for receiving the summary corresponding with the top n news documents exported by the summary text network model
Text.
The disclosure one kind in the specific implementation, the document text input submodule includes: that format text obtains son
Module is successively torn the corresponding document text of each news documents open by sentence format for being directed to the top n news documents
Point, obtain multiple format file texts;Sentence format text input submodule, for the multiple sentence format file text is defeated
Enter the summary text network model;The summary text receiving submodule includes: a format abstract receiving submodule, for connecing
Receive the multiple format summary texts corresponding with the sentence format file text exported by the summary text network model;Generally
Text acquisition submodule is wanted, for merging the multiple sentence format summary texts, obtains the summary text.
According to the third aspect of an embodiment of the present disclosure, a kind of electronic equipment is provided, comprising: processor;For storing
State the memory of processor-executable instruction;Wherein, the processor is configured to executing at document described in any of the above embodiments
Reason method.
According to a fourth aspect of embodiments of the present disclosure, a kind of non-transitorycomputer readable storage medium is additionally provided, when
When instruction in the storage medium is executed by the processor of first terminal so that the first terminal be able to carry out it is any of the above-described
Document processing method described in.
The technical scheme provided by this disclosed embodiment can include the following benefits:
The embodiment of the present disclosure provides a kind of document processing method, by obtain it is corresponding with event keyword with it is preset when
Length is at least one news documents set of unit, based on multiple in event keyword and at least one news documents set
News documents determine the corresponding relevance score of multiple news documents, according to relevance score, extract from multiple news documents
Score the highest top n news documents of score value, and N is the positive integer more than or equal to 1, according to the corresponding document of top n news documents
Text determines the corresponding summary text of top n news documents, and using summary text as the summary texts of top n news documents.
The embodiment of the present disclosure can extract be with preset time period (such as every year, quarterly) unit news documents set, to every
News documents score automatically, extract the highest top n news text of scoring according to the relevance score of each news documents
Shelves avoid information redundancy, and participate in without artificial;And it is possible to corresponding summary texts are extracted for top n news documents,
And then subsequent public sentiment monitoring or information integration can be carried out, without manually checking news one by one, reduce the throwing of human cost
Enter.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure
Example, and together with specification for explaining the principles of this disclosure.
Fig. 1 is a kind of step flow chart of document processing method shown according to an exemplary embodiment;
Fig. 2 is a kind of step flow chart of document processing method shown according to an exemplary embodiment;
Fig. 3 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment;
Fig. 4 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment;
Fig. 5 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment;
Fig. 6 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended
The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
Embodiment one
Fig. 1 is a kind of step flow chart of document processing method shown according to an exemplary embodiment, as shown in Figure 1,
The document processing method the following steps are included:
In step s 11, at least one news documents as unit of preset time period corresponding with event keyword are obtained
Set.
In the embodiments of the present disclosure, event keyword refers to the keyword for searching for news documents, and event keyword can
To be the keyword input by user extracted according to current hotspot media event.
News documents refer to the corresponding document of media event searched according to event keyword.
Preset time period refers to the length of some time, such as using year as preset time period, or using season as preset time period.
News documents set is the news documents set that the news documents that will be searched are formed according to preset time period for unit,
For example, news documents in search 2019 are formed a news documents set, by news text in search 2018
Shelves one news documents set of composition etc.;The news documents in spring in 2019 are either formed into a news documents set, it will
The news documents in winter in 2019 form news documents set etc..
It is to be appreciated that above-mentioned example is merely to the example for more fully understanding the scheme of the embodiment of the present disclosure and enumerating,
Not as the sole limitation to the embodiment of the present disclosure.
It, can and thing interior using web crawlers technology search preset time period after the event keyword for obtaining user's input
News documents in preset time period are combined into a news documents set in turn by the corresponding news documents of part keyword.
Obtain it is corresponding with event keyword by preset time period as unit of at least one news documents set after, hold
Row step S12.
In step s 12, based on multiple news in the event keyword and at least one described news documents set
Document determines the corresponding relevance score of the multiple news documents.
Relevance score refers to each news documents correlation degree corresponding with time-critical word, it is possible to understand that ground is related
Property the higher correlation degree for indicating news documents and event keyword of scoring it is higher, for example, news documents include news 1, news
2, news 3 and news 4, the relevance score of news 1 are 0.2, and the relevance score of news 2 is 0.5, and the correlation of news 3 is commented
It is divided into 0.7, the relevance score of news 4 is 0.6, then the correlation degree highest of news 3 and event keyword, news 1 and event
The correlation degree of keyword is minimum.
It is to be appreciated that above-mentioned example is merely to more fully understand the technical solution of the embodiment of the present disclosure and showing for enumerating
Example, not as the sole limitation to the embodiment of the present disclosure.
After obtaining event keyword and at least one news documents set, event keyword can be segmented,
In turn, at least one word frequency of multiple news documents at least one news documents set of each participle is obtained, and each
At least one inverse document frequency of a participle in all news documents set.
Also, all news documents are directed to, each news documents are segmented, obtain multiple participles, and combine more
A inverse text frequency for segmenting the word frequency occurred in each news documents and multiple second participles in all news documents set
Rate index, and construct document matrix corresponding with each news documents.
In turn, according at least one corresponding word frequency of the participle of event keyword and at least one inverse document frequency,
And the corresponding document matrix of each news documents, and combine at least one word frequency, at least one inverse document frequency and document
The corresponding relevance score of each news documents is calculated in matrix.
The process of the relevance score of news documents each for above-mentioned calculating will carry out in detail in following embodiments two
Description, the embodiment of the present disclosure no longer limit herein.
Based on multiple news documents in event keyword and at least one news documents set, each news is determined
After the corresponding relevance score of document, step S13 is executed.
In step s 13, according to the relevance score, it is highest that scoring score value is extracted from the multiple news documents
Top n news documents.
In the embodiments of the present disclosure, N can be the positive integer more than or equal to 1, for example, N can be just whole for 1,3,8,12 etc.
Number, specifically, can according to the actual situation depending on, the embodiment of the present disclosure is without restriction to this.
It, can be according to every after the relevance score for obtaining each news documents at least one news documents set
The relevance score of a news documents, the highest news documents of N number of scoring score value from being extracted in multiple news documents.For example,
It is 3 in N, news documents include news a, news b, news c, news d and news e, and the relevance score of news a is 0.9, news
The relevance score of b is 0.2, and the relevance score of news c is 0.6, and the relevance score of news d is 0.8, the correlation of news e
Property scoring be 0.9, when extracting scoring highest preceding 3 news documents of score value, can therefrom extract news a, news e and new
Hear d.
It is to be appreciated that above-mentioned example is merely to the example for more fully understanding the embodiment of the present disclosure and enumerating, not as
To the sole limitation of the embodiment of the present disclosure.
According to relevance score, after extracting the scoring highest top n news documents of score value in multiple news documents,
Execute step S14.
In step S14, according to the corresponding document text of the top n news documents, the top n news documents are determined
Corresponding summary text, and using the summary text as the summary texts of the top n news documents.
Summary text refers to the text of the generality description of each news documents, and summary text is equivalent to the abstract letter of news
Breath, can probably recognize the content that each news documents substantially describe by summary text.
In the disclosure, from extracted in multiple news documents scoring the highest top n news documents of score value after, can
Top n news documents are inputted summary text network model, thus corresponding with each news documents general by model output
Text is wanted, and then using the summary text as the summary texts of top n news documents.
The embodiment of the present disclosure is automatic to every news documents by extracting the news documents set as unit of preset time period
It scores, the highest top n news documents of scoring is extracted according to the relevance score of each news documents, avoid information
Redundancy, and participated in without artificial;And it is possible to extract corresponding summary texts for top n news documents, and then after can carrying out
Continuous public sentiment monitoring or information integration, without manually checking news one by one.
The document processing method that the embodiment of the present disclosure provides, it is corresponding with event keyword by acquisition to be with preset time period
At least one news documents set of unit, based on multiple news in event keyword and at least one news documents set
Document determines the corresponding relevance score of multiple news documents, and according to relevance score, scoring is extracted from multiple news documents
The highest top n news documents of score value, N are positive integer more than or equal to 1, according to the corresponding document text of top n news documents,
Determine the corresponding summary text of top n news documents, and using summary text as the summary texts of top n news documents.This public affairs
Opening embodiment can extract with preset time period (such as year, season) for the news documents set of unit, to every news documents
Automatically it scores, the highest top n news documents of scoring is extracted according to the relevance score of each news documents, are avoided
Information redundancy, and participated in without artificial;And it is possible to extract corresponding summary texts for top n news documents, and then can be into
The subsequent public sentiment monitoring of row or information integration, without manually checking news one by one, reduce the investment of human cost.
Embodiment two
Fig. 2 is a kind of step flow chart of document processing method shown according to an exemplary embodiment, as shown in Fig. 2,
The document processing method the following steps are included:
In the step s 21, it is based on the corresponding media event of the event keyword, determination is associated with the media event
The temperature weight of multiple preset time periods.
In the embodiments of the present disclosure, event keyword refers to the keyword for searching for news documents, and event keyword can
To be the keyword input by user extracted according to current hotspot media event.
Temperature weight refers to media event corresponding with event keyword in the temperature of each preset time period, it is possible to understand that
Ground, temperature weight can be by the pre-set program of research staff, the weighted value as obtained from the search of internet data,
Media event is represented in the different temperatures of each preset time period, for example, existing when preset time period is year for event a
Temperature in 2019 is higher than in temperature in 2018, then for can be set within 2019 a higher temperature weight, and for
A temperature weight for being lower than temperature weight in 2019 can be set within 2018.
The specific acquisition modes of temperature weight can also be searched for according to media event by computer program it is each pre-
The related news in duration are set, thus the temperature weight provided.
Certainly, in practical applications, those skilled in the art can also be associated with using other way acquisition with media event
The corresponding temperature weight of each preset time period, specifically, can according to the actual situation depending on, this is not added in the embodiment of the present disclosure
With limitation.
It is being based on the corresponding media event of event keyword, the determining temperature with the associated multiple preset time periods of media event
After weight, step S22 is executed.
In step S22, the temperature weight is extracted from the multiple preset time period at least greater than weight threshold
One target preset time period.
Weight threshold refers to by research staff's pre-set weight threshold corresponding with temperature weight as needed.
Target preset time period refers to the year for being greater than weight threshold with the temperature weight of the associated preset time period of media event,
For example, preset time period is by taking year as an example, year includes 2019,2018,2017 and 2016, media event A, and new
The temperature weight in news event A associated aforementioned four year are as follows: the temperature weight that temperature weight in 2019 is 0.8,2018
The temperature weight for being 0.7,2017 is that temperature weight in 0.6,2016 is 0.5, and weight threshold 0.6, then what is extracted is greater than
Weight threshold corresponding year is 2018 and 2019, i.e., being considered as target year for 2018 and 2019, (i.e. target is pre-
Set duration).
It is to be appreciated that above-mentioned example is merely to more fully understand the technical solution of the embodiment of the present disclosure and showing for enumerating
Example, not as the sole limitation to the embodiment of the present disclosure.
Extracted from multiple preset time periods temperature weight greater than weight threshold at least one target preset time period it
Afterwards, step S23 is executed.
In step S23, it is based on the event keyword, obtains the news text at least one described target preset time period
Shelves set.
After determining target preset time period, then it can be scanned for according to event keyword, it is crucial with event to obtain
Word associated news documents in target preset time period, and then using all news documents in a target preset time period as one
A news documents set.
Step is being executed after obtaining the news documents set at least one target preset time period based on event keyword
Rapid S24.
In step s 24, the event keyword is segmented, obtains at least one first participle.
The first participle refers to event keyword is segmented after, obtained participle, for example, event keyword be " palace
Protect diced chicken and leek egg ", after being segmented, available " the quick-fried egg in palace ", " leek egg ", "and" word are that connection is closed
Copula can directly be ignored, then " the quick-fried egg in palace ", " leek egg " directly can be considered as the first participle.
Word segmentation processing mode can use participle technique relatively common in the prior art specifically can be according to business
Depending on demand, the embodiment of the present disclosure is without restriction to this.
It is segmented to event keyword, after obtaining at least one first participle, executes step S25.
In step s 25, at least one described first participle is calculated at least one of the multiple news documents the
One word frequency and at least one described first participle refer at least one of all news documents set first against text frequency
Number.
First word frequency (Term Frequency, TF) refers to the frequency that the first participle occurs in each news documents, example
Such as, the first participle is participle a, and news documents include document 1 and document 2, and the first word frequency is to segment what a occurred in document 1
Frequency, and the frequency that participle a occurs in document 2.What the first word frequency can be occurred in each news documents by the first participle
Number is obtained divided by total word number of this document.
First inverse document frequency (Inverse Document Frequency, IDF) refers to the first participle all
The frequency that the frequency namely the first participle occurred in news documents set occurs in all news documents.First against text frequency
Rate index can be by total press number of documents divided by the number of the news documents comprising the first participle, then obtained quotient taken logarithm
It obtains.
It is to be appreciated that being the more mature skill in this field for the calculation of word frequency and inverse document frequency
Art, the embodiment of the present disclosure are no longer described in detail herein.
It is segmented to event keyword, after obtaining at least one first participle, each first participle can be calculated
The first word frequency in each news documents.And calculate the first participle in all news documents set first against text frequency
Index.
It is to be appreciated that each first participle both corresponds to first word frequency in each news documents, each first
Participle corresponds to first inverse document frequency in all news documents set.
In step S26, for all news documents, each news documents is segmented, are obtained more
A second participle.
After second participle refers to and segmented the corresponding document text of each news documents, obtained participle can be with
Understand ground, each news documents include multiple participle texts, i.e., after each news documents are carried out with word segmentation processing
Obtain corresponding multiple second participles of each news documents.
All news documents are being directed to, word segmentation processing is carried out to each news documents, after obtaining multiple second participles,
Execute step S27.
In step s 27, multiple second word frequency of the multiple second participle in each news documents are calculated, and
Multiple second inverse document frequencies of the multiple second participle in all news documents set.
Second word frequency refers to the frequency of appearance of second participle in the news documents in each news documents, the second word
The calculation of frequency is similar to the calculation of the first word frequency in the above process, is referred to the calculation of above-mentioned first word frequency
The second word frequency is calculated, the embodiment of the present disclosure is not repeated here herein.
Second inverse document frequency refers to the frequency namely second that the second participle occurs in all news documents set
Segment the frequency occurred in all news documents.Second inverse document frequency can by total press number of documents divided by comprising
The number of the news documents of second participle, then take logarithm to obtain obtained quotient.
Word segmentation processing is being carried out to each news documents, after obtaining multiple second participles, each second point can be calculated
Second word frequency of the word in corresponding news documents, and calculate each second participle in all news documents second against text
Frequency index executes step S28 in turn.
In step S28, according to the multiple second word frequency and the multiple second inverse document frequency, building is each
The corresponding document matrix of the news documents.
Document matrix refers to the corresponding matrix of each news documents, and document matrix is according to multiple second points in news documents
Constructed by corresponding multiple second word frequency of word and corresponding second inverse document frequency of multiple second participles.
Second word frequency of the second participle of each of each news documents in corresponding news documents is being calculated, and each new
After hearing corresponding second inverse document frequency of the second participle of each of document, each news documents can be directed to, in conjunction with
The second word frequency and the second inverse document frequency that each of news documents second segment, can construct document matrix.
It can be using matrix constructing plan relatively common in the prior art, the disclosure for the building mode of document matrix
Embodiment is not repeated here herein.
Document matrix can embody the frequency that each participle occurs in the news documents in news documents, and each point
Inverse document frequency of the word in the corresponding all news documents of at least one news documents set.
In step S29, according at least one described first word frequency, at least one described first inverse document frequency and
The document matrix determines the relevance score of each news documents.
In above process, at least one first word frequency, at least one first inverse document frequency and document square are obtained
After battle array, it can be determined each according at least one first word frequency, at least one first inverse document frequency and document matrix
The relevance score of news documents.
The process of the relevance score of news documents each for determination is referred to following specific implementations and carries out in detail
Thin description.
The disclosure one kind in the specific implementation, above-mentioned steps S29 may include:
Sub-step A1: calculating at least one described first word frequency and at least one described first inverse document frequency, with
The similarity value of the corresponding news documents of each document matrix.
Similarity value refers to the similarity value of word frequency and inverse document frequency news documents pair corresponding with document matrix,
Namely the similarity value of event keyword and each news documents.
Refer in corresponding first word frequency of at least one first participle for calculating event keyword and first against text frequency
After number, similar or identical one or more can be segmented by least one first participle and news documents multiple second
Second participle, and combine the first word frequency of the first participle and the document matrix of the first inverse document frequency and the news documents
The second word frequency corresponding with the second similar or identical participle of the first participle of middle record and the second inverse document frequency, into
And the similitude segmented according to the first participle and second, and combine above-mentioned first word frequency, the first inverse document frequency, second
First similarity value of at least one word frequency Yu each document matrix is calculated in word frequency and the second inverse document frequency, and
Second similarity value of at least one inverse document frequency and each document matrix can pass through the first similarity value and second
Similarity value and respective weights product, then it is added the mode of summation, the similarity of event keyword and news documents is calculated
Value.
After the similarity value of news documents is calculated, sub-step A2 is executed.
Sub-step A2: using the similarity value as the relevance score of each news documents.
After the similarity value of each news documents and event keyword is calculated, can using the similarity value as
The relevance score of each news documents.
It is to be appreciated that above-mentioned example is merely to one for more fully understanding the technical solution of the embodiment of the present disclosure and enumerating
Kind calculates the mode of the relevance score of news documents, and in the concrete realization, those skilled in the art can also use other sides
Formula calculates the relevance score of news documents, and the embodiment of the present disclosure is without restriction to this.
According at least one first word frequency, at least one first inverse document frequency and document matrix, determine each
After the relevance score of news documents, step 210 is executed.
In step S210, according to the relevance score, scoring score value highest is extracted from the multiple news documents
Top n news documents.
In the embodiments of the present disclosure, N can be the positive integer more than or equal to 1, for example, N can be just whole for 1,3,8,12 etc.
Number, specifically, can according to the actual situation depending on, the embodiment of the present disclosure is without restriction to this.
It, can be according to every after the relevance score for obtaining each news documents at least one news documents set
The relevance score of a news documents, the highest news documents of N number of scoring score value from being extracted in multiple news documents.For example,
It is 3 in N, news documents include news a, news b, news c, news d and news e, and the relevance score of news a is 0.9, news
The relevance score of b is 0.2, and the relevance score of news c is 0.6, and the relevance score of news d is 0.8, the correlation of news e
Property scoring be 0.9, when extracting scoring highest preceding 3 news documents of score value, can therefrom extract news a, news e and new
Hear d.
It is to be appreciated that above-mentioned example is merely to the example for more fully understanding the embodiment of the present disclosure and enumerating, not as
To the sole limitation of the embodiment of the present disclosure.
According to relevance score, after extracting the scoring highest top n news documents of score value in multiple news documents,
Execute step S211.
In step S211, by the corresponding document text input of top n news documents summary text trained in advance
Network model.
Summary text network model refers to the network model of the summary text for extracting document text.
It, can be by top n news after scoring the highest top n news documents of score value from extraction in multiple news documents
The corresponding document text of document inputs summary text network model, and the mistake of summary text output is carried out by summary text network model
Journey.
And following specific implementations are referred to for the input process of document text and are described in detail.
The disclosure one kind in the specific implementation, above-mentioned steps S211 may include:
Sub-step B1: being directed to the top n news documents, and the corresponding document text of each news documents is successively pressed sentence lattice
Formula is split, and multiple format file texts are obtained.
In the embodiments of the present disclosure, sentence format file text, which refers to, splits document text according to sentence format, and obtains
The sentence format text arrived, for example, the corresponding document text of news documents are as follows: " report points out that this has been the Japan since last August
Maritime Self-Defence Force holds joint exercise in South China Sea with U.S.'s aircraft carrier again.This time manoeuvre is marine other than " cloud number out "
Self-defence corps have also sent escort vessel " Murasame " and " daybreak ".They and uss ronald reagan form fleet, have carried out tactics navigation rehearsal.", it can
See, includes three sentences in above-mentioned document text, when being split according to sentence format, can be split as that " report points out that this is
Since last August, Japan Maritime Self Defense Force (MSDF) holds joint exercise in South China Sea with U.S.'s aircraft carrier again.", " this time manoeuvre
Other than " cloud number out ", Maritime Self-Defence Force has also sent escort vessel " Murasame " and " daybreak "." and " they and uss ronald reagan form warship
Team has carried out tactics navigation rehearsal.", i.e., a sentence format file text is used as by every.
It is to be appreciated that above-mentioned example is merely to more fully understand the technical solution of the embodiment of the present disclosure and showing for enumerating
Example, not as the sole limitation to the embodiment of the present disclosure.
After obtaining top n news documents, can for each news documents in top n news documents successively according to
Sentence format is split, and obtains the corresponding multiple format file texts of each news documents.
Top n news documents are being directed to, are successively splitting the corresponding document text of each news documents by sentence format,
After obtaining multiple format file texts, sub-step B2 is executed.
Sub-step B2: by summary text network model described in the multiple sentence format file text input.
It, can be for each after obtaining multiple format file texts of the corresponding document text of each news documents
News documents, successively by the corresponding sentence format file text input summary text network model of each news documents, and then by general
It wants text network model in subsequent process, exports the corresponding summary texts of each format file text.
For top n news documents, the corresponding document text of top n news documents can be inputted to trained summary text
In present networks model, and execute step S212.
In step S212, reception is exported corresponding with the top n news documents by the summary text network model
Summary text.
After top n news documents are input to summary text network model, it can be mentioned by summary text network model
Taking out the corresponding summary text of top n news documents specifically, can be first by a news documents for top n news documents
Corresponding document text inputs summary text network model, to be exported by summary text network model corresponding with the news documents
Summary text, and then using the summary text as the summary texts of the news documents, successively for other top n news documents
The above process is executed, to obtain the corresponding summary texts of each news documents in top n news documents.
And the detailed process for exporting summary text, it is referred to the detailed description of following specific implementations.
The disclosure another kind in the specific implementation, above-mentioned steps S212 may include:
Sub-step C1: reception is exported corresponding more with the sentence format file text by the summary text network model
A format summary texts.
In the embodiments of the present disclosure, sentence format summary texts refer to abstract text corresponding with each format file text
This.
It, can be by summary text network model after by multiple format file text input summary text network models
Export the corresponding summary texts of each format file text, i.e. sentence format summary texts;Such as above-mentioned example, " report points out, this
It has been since last August, Japan Maritime Self Defense Force (MSDF) holds joint exercise in South China Sea with U.S.'s aircraft carrier again ", it is available
Wherein important summary texts, such as " Japan and the U.S. ", " South China Sea holds joint exercise ".
It is to be appreciated that above-mentioned example is merely to more fully understand the technical solution of the embodiment of the present disclosure and showing for enumerating
Example, not as the sole limitation to the embodiment of the present disclosure.
Receiving multiple formats corresponding with the sentence format file text abstract text exported by summary text network model
After this, sub-step C2 is executed.
Sub-step C2: the multiple sentence format summary texts are merged, the summary text is obtained.
It, can after multiple format summary texts for obtaining the corresponding multiple format file texts of each news documents
To merge multiple format summary texts, so that summary text is obtained, for example, news documents A includes sentence format file
Text: text a, text b, text c and text d, the corresponding sentence format summary texts of text a are text 1, the corresponding sentence of text b
Format summary texts are text 2, and the corresponding sentence format summary texts of text c are text 3, the corresponding sentence format abstract text of text d
This is text 4, and then text 1, text 2, text 3 and text 4 can be merged, so as to obtain summary text.
It is to be appreciated that above-mentioned example is merely to more fully understand the technical solution of the embodiment of the present disclosure and showing for enumerating
Example, not as the sole limitation to the embodiment of the present disclosure.
The embodiment of the present disclosure gives text by obtaining news to polymerization and summarizes, significant increase public sentiment/news editor
The efficiency of personnel.
The document processing method that the embodiment of the present disclosure provides, in addition to having the document processing method of the offer of above-described embodiment one
Outside the beneficial effect that embodiment has, news can also be obtained to polymerization and gives text summary, significant increase public sentiment/new
Hear the efficiency of editorial staff.
Embodiment three
Fig. 3 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment.Referring to Fig. 3, the device packet
It includes news documents set and obtains module 131, relevance score determining module 132, news documents extraction module 133 and summary texts
Determining module 134.
The news documents set obtain module 131 be configured as obtain it is corresponding with event keyword with preset time period be singly
At least one news documents set of position;
The relevance score determining module 132 is configured as based on the event keyword and at least one described news
Multiple news documents in collection of document determine the corresponding relevance score of the multiple news documents;
The news documents extraction module 133 is configured as according to the relevance score, from the multiple news documents
Extract the scoring highest top n news documents of score value;N is the positive integer more than or equal to 1;
The summary texts determining module 134 is configured as being determined according to the corresponding document text of the top n news documents
The corresponding summary text of the top n news documents, and using the summary text as the abstract of top n news documents text
This.
The document processing device, document processing that the embodiment of the present disclosure provides, it is corresponding with event keyword by acquisition to be with preset time period
At least one news documents set of unit, based on multiple news in event keyword and at least one news documents set
Document determines the corresponding relevance score of multiple news documents, and according to relevance score, scoring is extracted from multiple news documents
The highest top n news documents of score value, N are positive integer more than or equal to 1, according to the corresponding document text of top n news documents,
Determine the corresponding summary text of top n news documents, and using summary text as the summary texts of top n news documents.This public affairs
Opening embodiment can extract with preset time period (such as year, season) for the news documents set of unit, to every news documents
Automatically it scores, the highest top n news documents of scoring is extracted according to the relevance score of each news documents, are avoided
Information redundancy, and participated in without artificial;And it is possible to extract corresponding summary texts for top n news documents, and then can be into
The subsequent public sentiment monitoring of row or information integration, without manually checking news one by one, reduce the investment of human cost.
Example IV
Fig. 4 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment.Referring to Fig. 4, the device packet
It includes news documents set and obtains module 141, relevance score determining module 142, news documents extraction module 143 and summary texts
Determining module 144.
The news documents set obtain module 141 be configured as obtain it is corresponding with event keyword with preset time period be singly
At least one news documents set of position;
The relevance score determining module 142 is configured as based on the event keyword and at least one described news
Multiple news documents in collection of document determine the corresponding relevance score of the multiple news documents;
The news documents extraction module 143 is configured as according to the relevance score, from the multiple news documents
Extract the scoring highest top n news documents of score value;N is the positive integer more than or equal to 1;
The summary texts determining module 144 is configured as being determined according to the corresponding document text of the top n news documents
The corresponding summary text of the top n news documents, and using the summary text as the abstract of top n news documents text
This.
The disclosure one kind in the specific implementation, the news documents set obtain module 141 include:
Temperature weight determines submodule 1411, for being based on the corresponding media event of the event keyword, determining and institute
State the temperature weight of the associated multiple preset time periods of media event;
Target duration extracting sub-module 1412 is greater than for extracting the temperature weight from the multiple preset time period
At least one target preset time period of weight threshold;
News documents set acquisition submodule 1413 obtains at least one described mesh for being based on the event keyword
Mark the news documents set in preset time period.
The disclosure one kind in the specific implementation, the relevance score determining module 142 includes:
First participle acquisition submodule 1421, for being segmented to the event keyword, obtain at least one first
Participle;
First word frequency computational submodule 1422, for calculating at least one described first participle in the multiple news documents
At least one of the first word frequency and at least one described first participle at least one of all news documents set first
Inverse document frequency;
Second participle acquisition submodule 1423, for being directed to all news documents, to each news documents
It is segmented, obtains multiple second participles;
Second word frequency computational submodule 1424, for calculating the multiple second participle in each news documents
Multiple second inverse document frequencies of multiple second word frequency and the multiple second participle in all news documents set;
Document matrix construct submodule 1425, for according to the multiple second word frequency and the multiple second against text frequency
Rate index constructs the corresponding document matrix of each news documents;
Relevance score determines submodule 1426, for according at least one described first word frequency, it is described at least one the
One inverse document frequency and the document matrix determine the relevance score of each news documents.
The disclosure one kind in the specific implementation, the relevance score determines that submodule 1426 includes:
Similarity value computational submodule, for calculating at least one described first word frequency and at least one described first inverse text
This frequency index, the similarity value of news documents corresponding with each document matrix;
Relevance score acquisition submodule, for being commented the similarity value as the correlation of each news documents
Point.
The disclosure one kind in the specific implementation, the summary texts determining module 144 includes:
Document text input submodule 1441, it is preparatory for inputting the corresponding document text of the top n news documents
Trained summary text network model;
Summary text receiving submodule 1442, for receive by the summary text network model output with the preceding N
The corresponding summary text of a news documents.
The disclosure one kind in the specific implementation, the document text input submodule 1441 includes:
Sentence format text acquisition submodule, it is successively that each news documents are corresponding for being directed to the top n news documents
Document text split by sentence format, obtain multiple format file texts;
Sentence format text input submodule, is used for summary text network described in the multiple sentence format file text input
Model;
The summary text receiving submodule 1442 includes:
Sentence format make a summary receiving submodule, for receives by the summary text network model export with the sentence format
The corresponding multiple format summary texts of document text;
Summary text acquisition submodule obtains the summary for merging the multiple sentence format summary texts
Text.
The document processing device, document processing that the embodiment of the present disclosure provides, in addition to having the document processing device, document processing of the offer of above-described embodiment three
Outside the beneficial effect that embodiment has, news can also be obtained to polymerization and gives text summary, significant increase public sentiment/new
Hear the efficiency of editorial staff.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, no detailed explanation will be given here.
Additionally, the embodiment of the present disclosure additionally provides a kind of electronic equipment, comprising: processor;
Memory for storage processor executable instruction;
Wherein, processor is configured as executing the document processing method of embodiment one to any one of embodiment two.
Fig. 5 is a kind of block diagram for text processing apparatus 800 shown according to an exemplary embodiment.For example, device
800 can be mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, and medical treatment is set
It is standby, body-building equipment, personal digital assistant etc..
Referring to Fig. 5, device 800 may include following one or more components: processing component 802, memory 804, electric power
Component 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor module 814, and
Communication component 816.
The integrated operation of the usual control device 800 of processing component 802, such as with display, telephone call, data communication, phase
Machine operation and record operate associated operation.Processing component 802 may include that one or more processors 820 refer to execute
It enables, to perform all or part of the steps of the methods described above.In addition, processing component 802 may include one or more modules, just
Interaction between processing component 802 and other assemblies.For example, processing component 802 may include multi-media module, it is more to facilitate
Interaction between media component 808 and processing component 802.
Memory 804 is configured as storing various types of data to support the operation in equipment 800.These data are shown
Example includes the instruction of any application or method for operating on device 800, contact data, and telephone book data disappears
Breath, picture, video etc..Memory 804 can be by any kind of volatibility or non-volatile memory device or their group
It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile
Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash
Device, disk or CD.
Power supply module 806 provides electric power for the various assemblies of device 800.Power supply module 806 may include power management system
System, one or more power supplys and other with for device 800 generate, manage, and distribute the associated component of electric power.
Multimedia component 808 includes the screen of one output interface of offer between described device 800 and user.One
In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen
Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings
Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action
Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers
Body component 808 includes a front camera and/or rear camera.When equipment 800 is in operation mode, such as screening-mode or
When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and
Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike
Wind (MIC), when device 800 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched
It is set to reception external audio signal.The received audio signal can be further stored in memory 804 or via communication set
Part 816 is sent.In some embodiments, audio component 810 further includes a loudspeaker, is used for output audio signal.
I/O interface 812 provides interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can
To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock
Determine button.
Sensor module 814 includes one or more sensors, and the state for providing various aspects for device 800 is commented
Estimate.For example, sensor module 814 can detecte the state that opens/closes of equipment 800, and the relative positioning of component, for example, it is described
Component is the display and keypad of device 800, and sensor module 814 can be with 800 1 components of detection device 800 or device
Position change, the existence or non-existence that user contacts with device 800,800 orientation of device or acceleration/deceleration and device 800
Temperature change.Sensor module 814 may include proximity sensor, be configured to detect without any physical contact
Presence of nearby objects.Sensor module 814 can also include optical sensor, such as CMOS or ccd image sensor, at
As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors
Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 816 is configured to facilitate the communication of wired or wireless way between device 800 and other equipment.Device
800 can access the wireless network based on communication standard, such as WiFi, carrier network (such as 2G, 3G, 4G or 5G) or them
Combination.In one exemplary embodiment, communication component 816 is received via broadcast channel from the wide of external broadcasting management system
Broadcast signal or broadcast related information.In one exemplary embodiment, the communication component 816 further includes near-field communication (NFC)
Module, to promote short range communication.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) can be based in NFC module
Technology, ultra wide band (UWB) technology, bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, device 800 can be believed by one or more application specific integrated circuit (ASIC), number
Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array
(FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided
It such as include the memory 804 of instruction, above-metioned instruction can be executed by the processor 820 of device 800 to complete the above method.For example,
The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk
With optical data storage devices etc..
Fig. 6 is a kind of block diagram for text processing apparatus 1900 shown according to an exemplary embodiment.For example, device
1900 may be provided as a server.Referring to Fig. 6, device 1900 includes processing component 1922, further comprise one or
Multiple processors and memory resource represented by a memory 1932, can be by the execution of processing component 1922 for storing
Instruction, such as application program.The application program stored in memory 1932 may include it is one or more each
Module corresponding to one group of instruction.In addition, processing component 1922 is configured as executing instruction, to execute the above method: obtain with
Corresponding at least one the news documents set as unit of preset time period of event keyword;Based on the event keyword, and
Multiple news documents at least one described news documents set determine that the corresponding correlation of the multiple news documents is commented
Point;According to the relevance score, the scoring highest top n news documents of score value are extracted from the multiple news documents;N is
Positive integer more than or equal to 1;According to the corresponding document text of the top n news documents, the top n news documents pair are determined
The summary text answered, and using the summary text as the summary texts of the top n news documents.
Device 1900 can also include that a power supply module 1926 be configured as the power management of executive device 1900, and one
Wired or wireless network interface 1950 is configured as device 1900 being connected to network and input and output (I/O) interface
1958.Device 1900 can be operated based on the operating system for being stored in memory 1932, such as Windows ServerTM, Mac
OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
Those skilled in the art will readily occur to its of the disclosure after considering specification and practicing disclosure disclosed herein
Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or
Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure
Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following
Claim is pointed out.
It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and
And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the accompanying claims.
Claims (10)
1. a kind of document processing method characterized by comprising
Obtain at least one news documents set as unit of preset time period corresponding with event keyword;
Based on multiple news documents in the event keyword and at least one described news documents set, determine described more
The corresponding relevance score of a news documents;
According to the relevance score, the scoring highest top n news documents of score value are extracted from the multiple news documents;N
For the positive integer more than or equal to 1;
According to the corresponding document text of the top n news documents, the corresponding summary text of the top n news documents is determined,
And using the summary text as the summary texts of the top n news documents.
2. the method according to claim 1, wherein the acquisition is corresponding with event keyword with preset time period
For unit at least one news documents set the step of, comprising:
Based on the corresponding media event of the event keyword, the determining heat with the associated multiple preset time periods of the media event
Spend weight;
At least one target preset time period that the temperature weight is greater than weight threshold is extracted from the multiple preset time period;
Based on the event keyword, the news documents set at least one described target preset time period is obtained.
3. the method according to claim 1, wherein described be based on the event keyword and described at least one
Multiple news documents in a news documents set, the step of determining the multiple news documents corresponding relevance score, packet
It includes:
The event keyword is segmented, at least one first participle is obtained;
Calculate at least one described first participle the first word frequency of at least one of the multiple news documents and it is described at least
One first participle is in the first inverse document frequency of at least one of all news documents set;
For all news documents, each news documents are segmented, obtain multiple second participles;
Calculate multiple second word frequency and the multiple second participle of the multiple second participle in each news documents
Multiple second inverse document frequencies in all news documents set;
According to the multiple second word frequency and the multiple second inverse document frequency, it is corresponding to construct each news documents
Document matrix;
According at least one described first word frequency, at least one described first inverse document frequency and the document matrix, really
The relevance score of fixed each news documents.
4. according to the method described in claim 3, it is characterized in that, described at least one first word frequency according to, it is described extremely
Few first inverse document frequency and the document matrix, determine the step of the relevance score of each news documents
Suddenly, comprising:
At least one described first word frequency and at least one described first inverse document frequency are calculated, with each document square
The similarity value of the corresponding news documents of battle array;
Using the similarity value as the relevance score of each news documents.
5. the method according to claim 1, wherein described according to the corresponding document of the top n news documents
Text, the step of determining the top n news documents corresponding summary text, comprising:
By the corresponding document text input of top n news documents summary text network model trained in advance;
Receive the summary text corresponding with the top n news documents exported by the summary text network model.
6. according to the method described in claim 5, it is characterized in that, described by the corresponding document text of the top n news documents
The step of summary text network model that this input is trained in advance, comprising:
For the top n news documents, the corresponding document text of each news documents is split by sentence format successively, is obtained
To multiple format file texts;
By summary text network model described in the multiple sentence format file text input;
The step for receiving the summary text corresponding with the top n news documents exported by the summary text network model
Suddenly, comprising:
It receives and is made a summary by the multiple formats corresponding with the sentence format file text that the summary text network model exports
Text;
The multiple sentence format summary texts are merged, the summary text is obtained.
7. a kind of document processing device, document processing characterized by comprising
News documents set obtains module, for obtaining at least one as unit of preset time period corresponding with event keyword
News documents set;
Relevance score determining module, for based in the event keyword and at least one described news documents set
Multiple news documents determine the corresponding relevance score of the multiple news documents;
News documents extraction module, for extracting scoring score value from the multiple news documents according to the relevance score
Highest top n news documents;N is the positive integer more than or equal to 1;
Summary texts determining module, for determining that the top n is new according to the corresponding document text of the top n news documents
The corresponding summary text of document is heard, and using the summary text as the summary texts of the top n news documents.
8. device according to claim 7, which is characterized in that the news documents set obtains module and includes:
Temperature weight determines submodule, for being based on the corresponding media event of the event keyword, the determining and news thing
The temperature weight of the associated multiple preset time periods of part;
Target duration extracting sub-module, for extracting the temperature weight from the multiple preset time period greater than weight threshold
At least one target preset time period;
News documents set acquisition submodule, for being based on the event keyword, when at least one described target of acquisition is preset
News documents set in length.
9. a kind of electronic equipment characterized by comprising
Processor;
For storing the memory of the processor-executable instruction;
Wherein, the processor is configured to document processing method described in any one of perform claim requirement 1 to 6.
10. a kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of first terminal
When device executes, so that the first terminal is able to carry out document processing method described in any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910517936.4A CN110377808A (en) | 2019-06-14 | 2019-06-14 | Document processing method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910517936.4A CN110377808A (en) | 2019-06-14 | 2019-06-14 | Document processing method, device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110377808A true CN110377808A (en) | 2019-10-25 |
Family
ID=68248831
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910517936.4A Pending CN110377808A (en) | 2019-06-14 | 2019-06-14 | Document processing method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110377808A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112613296A (en) * | 2020-12-07 | 2021-04-06 | 深圳价值在线信息科技股份有限公司 | News importance degree acquisition method and device, terminal equipment and storage medium |
CN114780712A (en) * | 2022-04-06 | 2022-07-22 | 科技日报社 | Quality evaluation-based news topic generation method and device |
CN115391516A (en) * | 2022-10-31 | 2022-11-25 | 成都飞机工业(集团)有限责任公司 | Unstructured document extraction method, device, equipment and medium |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101446940A (en) * | 2007-11-27 | 2009-06-03 | 北京大学 | Method and device of automatically generating a summary for document set |
US8131735B2 (en) * | 2009-07-02 | 2012-03-06 | Battelle Memorial Institute | Rapid automatic keyword extraction for information retrieval and analysis |
CN105930314A (en) * | 2016-04-14 | 2016-09-07 | 清华大学 | Text summarization generation system and method based on coding-decoding deep neural networks |
CN106933878A (en) * | 2015-12-30 | 2017-07-07 | 腾讯科技(北京)有限公司 | A kind of information processing method and device |
CN107169131A (en) * | 2017-06-08 | 2017-09-15 | 广州优视网络科技有限公司 | A kind of video searching method, device and server |
CN107256251A (en) * | 2017-06-08 | 2017-10-17 | 广州优视网络科技有限公司 | A kind of application software searching method, device and server |
CN107273476A (en) * | 2017-06-08 | 2017-10-20 | 广州优视网络科技有限公司 | A kind of article search method, device and server |
CN107977420A (en) * | 2017-11-23 | 2018-05-01 | 广东工业大学 | The abstract extraction method, apparatus and readable storage medium storing program for executing of a kind of evolved document |
CN108280112A (en) * | 2017-06-22 | 2018-07-13 | 腾讯科技(深圳)有限公司 | Abstraction generating method, device and computer equipment |
CN108319668A (en) * | 2018-01-23 | 2018-07-24 | 义语智能科技(上海)有限公司 | Generate the method and apparatus of text snippet |
CN109241272A (en) * | 2018-07-25 | 2019-01-18 | 华南师范大学 | A kind of Chinese text abstraction generating method, computer-readable storage media and computer equipment |
CN109657051A (en) * | 2018-11-30 | 2019-04-19 | 平安科技(深圳)有限公司 | Text snippet generation method, device, computer equipment and storage medium |
CN109726281A (en) * | 2018-12-12 | 2019-05-07 | Tcl集团股份有限公司 | A kind of text snippet generation method, intelligent terminal and storage medium |
-
2019
- 2019-06-14 CN CN201910517936.4A patent/CN110377808A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101446940A (en) * | 2007-11-27 | 2009-06-03 | 北京大学 | Method and device of automatically generating a summary for document set |
US8131735B2 (en) * | 2009-07-02 | 2012-03-06 | Battelle Memorial Institute | Rapid automatic keyword extraction for information retrieval and analysis |
CN106933878A (en) * | 2015-12-30 | 2017-07-07 | 腾讯科技(北京)有限公司 | A kind of information processing method and device |
CN105930314A (en) * | 2016-04-14 | 2016-09-07 | 清华大学 | Text summarization generation system and method based on coding-decoding deep neural networks |
CN107273476A (en) * | 2017-06-08 | 2017-10-20 | 广州优视网络科技有限公司 | A kind of article search method, device and server |
CN107256251A (en) * | 2017-06-08 | 2017-10-17 | 广州优视网络科技有限公司 | A kind of application software searching method, device and server |
CN107169131A (en) * | 2017-06-08 | 2017-09-15 | 广州优视网络科技有限公司 | A kind of video searching method, device and server |
CN108280112A (en) * | 2017-06-22 | 2018-07-13 | 腾讯科技(深圳)有限公司 | Abstraction generating method, device and computer equipment |
CN107977420A (en) * | 2017-11-23 | 2018-05-01 | 广东工业大学 | The abstract extraction method, apparatus and readable storage medium storing program for executing of a kind of evolved document |
CN108319668A (en) * | 2018-01-23 | 2018-07-24 | 义语智能科技(上海)有限公司 | Generate the method and apparatus of text snippet |
CN109241272A (en) * | 2018-07-25 | 2019-01-18 | 华南师范大学 | A kind of Chinese text abstraction generating method, computer-readable storage media and computer equipment |
CN109657051A (en) * | 2018-11-30 | 2019-04-19 | 平安科技(深圳)有限公司 | Text snippet generation method, device, computer equipment and storage medium |
CN109726281A (en) * | 2018-12-12 | 2019-05-07 | Tcl集团股份有限公司 | A kind of text snippet generation method, intelligent terminal and storage medium |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112613296A (en) * | 2020-12-07 | 2021-04-06 | 深圳价值在线信息科技股份有限公司 | News importance degree acquisition method and device, terminal equipment and storage medium |
CN114780712A (en) * | 2022-04-06 | 2022-07-22 | 科技日报社 | Quality evaluation-based news topic generation method and device |
CN115391516A (en) * | 2022-10-31 | 2022-11-25 | 成都飞机工业(集团)有限责任公司 | Unstructured document extraction method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111461089B (en) | Face detection method, and training method and device of face detection model | |
US11120078B2 (en) | Method and device for video processing, electronic device, and storage medium | |
CN110781305B (en) | Text classification method and device based on classification model and model training method | |
TWI728564B (en) | Method, device and electronic equipment for image description statement positioning and storage medium thereof | |
CN109522419B (en) | Session information completion method and device | |
CN109918669B (en) | Entity determining method, device and storage medium | |
CN110008401B (en) | Keyword extraction method, keyword extraction device, and computer-readable storage medium | |
WO2021027343A1 (en) | Human face image recognition method and apparatus, electronic device, and storage medium | |
CN110377808A (en) | Document processing method, device, electronic equipment and storage medium | |
CN111859020B (en) | Recommendation method, recommendation device, electronic equipment and computer readable storage medium | |
CN103650035A (en) | Identifying people that are proximate to a mobile device user via social graphs, speech models, and user context | |
CN109614482A (en) | Processing method, device, electronic equipment and the storage medium of label | |
WO2022166069A1 (en) | Deep learning network determination method and apparatus, and electronic device and storage medium | |
CN110399934A (en) | A kind of video classification methods, device and electronic equipment | |
CN107133354A (en) | The acquisition methods and device of description information of image | |
CN110069624A (en) | Text handling method and device | |
CN108345625A (en) | A kind of information mining method and device, a kind of device for information excavating | |
CN110929176A (en) | Information recommendation method and device and electronic equipment | |
CN112101216A (en) | Face recognition method, device, equipment and storage medium | |
CN116863286A (en) | Double-flow target detection method and model building method thereof | |
CN111222316A (en) | Text detection method, device and storage medium | |
CN111739535A (en) | Voice recognition method and device and electronic equipment | |
CN106156299B (en) | The subject content recognition methods of text information and device | |
CN112884040A (en) | Training sample data optimization method and system, storage medium and electronic equipment | |
CN110177284A (en) | Information displaying method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191025 |