CN110377808A - Document processing method, device, electronic equipment and storage medium - Google Patents

Document processing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110377808A
CN110377808A CN201910517936.4A CN201910517936A CN110377808A CN 110377808 A CN110377808 A CN 110377808A CN 201910517936 A CN201910517936 A CN 201910517936A CN 110377808 A CN110377808 A CN 110377808A
Authority
CN
China
Prior art keywords
news documents
news
text
document
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910517936.4A
Other languages
Chinese (zh)
Inventor
方轲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201910517936.4A priority Critical patent/CN110377808A/en
Publication of CN110377808A publication Critical patent/CN110377808A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure is directed to a kind of document processing method, device, electronic equipment and storage mediums.The described method includes: obtaining at least one news documents set as unit of preset time period corresponding with event keyword;Based on multiple news documents in the event keyword and at least one described news documents set, the corresponding relevance score of the multiple news documents is determined;According to the relevance score, the scoring highest top n news documents of score value are extracted from the multiple news documents;N is the positive integer more than or equal to 1;According to the corresponding document text of the top n news documents, the corresponding summary text of the top n news documents is determined, and using the summary text as the summary texts of the top n news documents.The disclosure can be to avoid information redundancy, and participates in without artificial;And corresponding summary texts are extracted, subsequent public sentiment monitoring or information integration can be carried out, without manually checking news one by one, reduces the investment of human cost.

Description

Document processing method, device, electronic equipment and storage medium
Technical field
This disclosure relates to news documents processing technology field more particularly to a kind of document processing method, device, electronic equipment And storage medium.
Background technique
With the fast development of Internet technology, network has become a part indispensable in people's life.
In Internet, all kinds of hot news events can all occur daily, and be embodied in each large platform (such as Baidu/ Microblogging/know/top news) heat search on list, but it is the throwing of whole event story line at a time that much heat, which search event itself, It penetrates, not complete plot.
Currently, the integration of text would generally be carried out using the scheme that crawler and event keyword combine, crawler is used first Technology grabs the keyword in focus incident, then, is extracted using the methods of filtering, cleaning and obtains the news comprising keyword, Finally by a plurality of news of manual testing, commented based on NDCG (Normalized Discounted Cumulative Gain) value Estimate current sort algorithm to be ranked up news, obtains maximally related media event.
In above scheme, sometime node crawls news according to keyword, and the news of the whole network is directed to current time section The news that point occurs, there are great information redundancies, and use artificial sequence, increase cost of human resources, and the effect that sorts Rate is lower, also, mainstream news only have title at present, and when carrying out subsequent public sentiment monitoring or information is integrated, it needs artificial It checks news one by one, than relatively time-consuming, and needs to put into biggish energy cost.
Disclosure
To overcome the problems in correlation technique, the embodiment of the present disclosure provides a kind of document processing method, device, electricity Sub- equipment and storage medium.
According to the first aspect of the embodiments of the present disclosure, a kind of document processing method is provided, comprising: obtain crucial with event Corresponding at least one the news documents set as unit of preset time period of word;Based on the event keyword and it is described at least Multiple news documents in one news documents set determine the corresponding relevance score of the multiple news documents;According to institute Relevance score is stated, the scoring highest top n news documents of score value are extracted from the multiple news documents;N is more than or equal to 1 Positive integer;According to the corresponding document text of the top n news documents, the corresponding summary of the top n news documents is determined Text, and using the summary text as the summary texts of the top n news documents.
In one kind of the disclosure in the specific implementation, it is described obtain it is corresponding with event keyword by preset time period as unit of The step of at least one news documents set, comprising: it is based on the corresponding media event of the event keyword, it is determining and described new Hear the temperature weight of multiple preset time periods of event correlation;The temperature weight is extracted from the multiple preset time period to be greater than At least one target preset time period of weight threshold;Based on the event keyword, when at least one described target of acquisition is preset News documents set in length.
The disclosure one kind in the specific implementation, it is described based on the event keyword and at least one described news text Multiple news documents in shelves set, the step of determining the multiple news documents corresponding relevance score, comprising: to described Event keyword is segmented, at least one first participle is obtained;At least one described first participle is calculated the multiple new Hear at least one of at least one of document the first word frequency and at least one described first participle in all news documents set A first inverse document frequency;For all news documents, each news documents are segmented, are obtained more A second participle;Calculate the multiple second multiple second word frequency of the participle in each news documents and the multiple Multiple second inverse document frequencies of second participle in all news documents set;According to the multiple second word frequency and institute Multiple second inverse document frequencies are stated, the corresponding document matrix of each news documents is constructed;According to it is described at least one First word frequency, at least one described first inverse document frequency and the document matrix, determine each news documents Relevance score.
The disclosure one kind in the specific implementation, described at least one first word frequency according to, it is described at least one One inverse document frequency and the document matrix, the step of determining the relevance score of each news documents, comprising: meter At least one described first word frequency and at least one described first inverse document frequency are calculated, it is corresponding with each document matrix News documents similarity value;Using the similarity value as the relevance score of each news documents.
The disclosure one kind in the specific implementation, described according to the corresponding document text of the top n news documents, determine The step of top n news documents corresponding summary text, comprising: by the corresponding document text of the top n news documents Input summary text network model trained in advance;Reception is exported new with the top n by the summary text network model Hear the corresponding summary text of document.
The disclosure one kind in the specific implementation, it is described the corresponding document text of the top n news documents inputted it is pre- First the step of trained summary text network model, comprising: the top n news documents are directed to, successively by each news documents Corresponding document text is split by sentence format, obtains multiple format file texts;By the multiple sentence format file text This input summary text network model;The reception is exported new with the top n by the summary text network model The step of hearing document corresponding summary text, comprising: receive being exported by the summary text network model with the sentence format The corresponding multiple format summary texts of document text;
The multiple sentence format summary texts are merged, the summary text is obtained.
According to the second aspect of an embodiment of the present disclosure, a kind of document processing device, document processing is provided, comprising: news documents set obtains Modulus block, for obtaining at least one news documents set as unit of preset time period corresponding with event keyword;It is related Property scoring determining module, for based on multiple news in the event keyword and at least one described news documents set Document determines the corresponding relevance score of the multiple news documents;News documents extraction module, for according to the correlation The scoring highest top n news documents of score value are extracted in scoring from the multiple news documents;N is just whole more than or equal to 1 Number;Summary texts determining module, for determining the top n news according to the corresponding document text of the top n news documents The corresponding summary text of document, and using the summary text as the summary texts of the top n news documents.
The disclosure one kind in the specific implementation, the news documents set obtain module include: temperature weight determine son Module, for being based on the corresponding media event of the event keyword, when determining associated multiple preset with the media event Long temperature weight;Target duration extracting sub-module is big for extracting the temperature weight from the multiple preset time period In at least one target preset time period of weight threshold;News documents set acquisition submodule, for crucial based on the event Word obtains the news documents set at least one described target preset time period.
The disclosure one kind in the specific implementation, the relevance score determining module include: the first participle obtain submodule Block obtains at least one first participle for segmenting to the event keyword;First word frequency computational submodule, is used for Calculate at least one described first participle the first word frequency of at least one of the multiple news documents and it is described at least one The first participle is in the first inverse document frequency of at least one of all news documents set;Second participle acquisition submodule, For being directed to all news documents, each news documents are segmented, multiple second participles are obtained;Second word Frequency meter operator module, for calculating multiple second word frequency and institute of the multiple second participle in each news documents State multiple second inverse document frequencies of multiple second participles in all news documents set;Document matrix constructs submodule Block, for constructing each news documents according to the multiple second word frequency and the multiple second inverse document frequency Corresponding document matrix;Relevance score determines submodule, for according at least one described first word frequency, it is described at least one First inverse document frequency and the document matrix determine the relevance score of each news documents.
The disclosure one kind in the specific implementation, the relevance score determine submodule include: similarity value calculate son Module, it is and each described for calculating at least one described first word frequency and at least one described first inverse document frequency The similarity value of the corresponding news documents of document matrix;Relevance score acquisition submodule, for using the similarity value as The relevance score of each news documents.
The disclosure one kind in the specific implementation, the summary texts determining module includes: document text input submodule, For the summary text network model that the corresponding document text input of the top n news documents is trained in advance;Summary text Receiving submodule, for receiving the summary corresponding with the top n news documents exported by the summary text network model Text.
The disclosure one kind in the specific implementation, the document text input submodule includes: that format text obtains son Module is successively torn the corresponding document text of each news documents open by sentence format for being directed to the top n news documents Point, obtain multiple format file texts;Sentence format text input submodule, for the multiple sentence format file text is defeated Enter the summary text network model;The summary text receiving submodule includes: a format abstract receiving submodule, for connecing Receive the multiple format summary texts corresponding with the sentence format file text exported by the summary text network model;Generally Text acquisition submodule is wanted, for merging the multiple sentence format summary texts, obtains the summary text.
According to the third aspect of an embodiment of the present disclosure, a kind of electronic equipment is provided, comprising: processor;For storing State the memory of processor-executable instruction;Wherein, the processor is configured to executing at document described in any of the above embodiments Reason method.
According to a fourth aspect of embodiments of the present disclosure, a kind of non-transitorycomputer readable storage medium is additionally provided, when When instruction in the storage medium is executed by the processor of first terminal so that the first terminal be able to carry out it is any of the above-described Document processing method described in.
The technical scheme provided by this disclosed embodiment can include the following benefits:
The embodiment of the present disclosure provides a kind of document processing method, by obtain it is corresponding with event keyword with it is preset when Length is at least one news documents set of unit, based on multiple in event keyword and at least one news documents set News documents determine the corresponding relevance score of multiple news documents, according to relevance score, extract from multiple news documents Score the highest top n news documents of score value, and N is the positive integer more than or equal to 1, according to the corresponding document of top n news documents Text determines the corresponding summary text of top n news documents, and using summary text as the summary texts of top n news documents. The embodiment of the present disclosure can extract be with preset time period (such as every year, quarterly) unit news documents set, to every News documents score automatically, extract the highest top n news text of scoring according to the relevance score of each news documents Shelves avoid information redundancy, and participate in without artificial;And it is possible to corresponding summary texts are extracted for top n news documents, And then subsequent public sentiment monitoring or information integration can be carried out, without manually checking news one by one, reduce the throwing of human cost Enter.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure Example, and together with specification for explaining the principles of this disclosure.
Fig. 1 is a kind of step flow chart of document processing method shown according to an exemplary embodiment;
Fig. 2 is a kind of step flow chart of document processing method shown according to an exemplary embodiment;
Fig. 3 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment;
Fig. 4 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment;
Fig. 5 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment;
Fig. 6 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
Embodiment one
Fig. 1 is a kind of step flow chart of document processing method shown according to an exemplary embodiment, as shown in Figure 1, The document processing method the following steps are included:
In step s 11, at least one news documents as unit of preset time period corresponding with event keyword are obtained Set.
In the embodiments of the present disclosure, event keyword refers to the keyword for searching for news documents, and event keyword can To be the keyword input by user extracted according to current hotspot media event.
News documents refer to the corresponding document of media event searched according to event keyword.
Preset time period refers to the length of some time, such as using year as preset time period, or using season as preset time period.
News documents set is the news documents set that the news documents that will be searched are formed according to preset time period for unit, For example, news documents in search 2019 are formed a news documents set, by news text in search 2018 Shelves one news documents set of composition etc.;The news documents in spring in 2019 are either formed into a news documents set, it will The news documents in winter in 2019 form news documents set etc..
It is to be appreciated that above-mentioned example is merely to the example for more fully understanding the scheme of the embodiment of the present disclosure and enumerating, Not as the sole limitation to the embodiment of the present disclosure.
It, can and thing interior using web crawlers technology search preset time period after the event keyword for obtaining user's input News documents in preset time period are combined into a news documents set in turn by the corresponding news documents of part keyword.
Obtain it is corresponding with event keyword by preset time period as unit of at least one news documents set after, hold Row step S12.
In step s 12, based on multiple news in the event keyword and at least one described news documents set Document determines the corresponding relevance score of the multiple news documents.
Relevance score refers to each news documents correlation degree corresponding with time-critical word, it is possible to understand that ground is related Property the higher correlation degree for indicating news documents and event keyword of scoring it is higher, for example, news documents include news 1, news 2, news 3 and news 4, the relevance score of news 1 are 0.2, and the relevance score of news 2 is 0.5, and the correlation of news 3 is commented It is divided into 0.7, the relevance score of news 4 is 0.6, then the correlation degree highest of news 3 and event keyword, news 1 and event The correlation degree of keyword is minimum.
It is to be appreciated that above-mentioned example is merely to more fully understand the technical solution of the embodiment of the present disclosure and showing for enumerating Example, not as the sole limitation to the embodiment of the present disclosure.
After obtaining event keyword and at least one news documents set, event keyword can be segmented, In turn, at least one word frequency of multiple news documents at least one news documents set of each participle is obtained, and each At least one inverse document frequency of a participle in all news documents set.
Also, all news documents are directed to, each news documents are segmented, obtain multiple participles, and combine more A inverse text frequency for segmenting the word frequency occurred in each news documents and multiple second participles in all news documents set Rate index, and construct document matrix corresponding with each news documents.
In turn, according at least one corresponding word frequency of the participle of event keyword and at least one inverse document frequency, And the corresponding document matrix of each news documents, and combine at least one word frequency, at least one inverse document frequency and document The corresponding relevance score of each news documents is calculated in matrix.
The process of the relevance score of news documents each for above-mentioned calculating will carry out in detail in following embodiments two Description, the embodiment of the present disclosure no longer limit herein.
Based on multiple news documents in event keyword and at least one news documents set, each news is determined After the corresponding relevance score of document, step S13 is executed.
In step s 13, according to the relevance score, it is highest that scoring score value is extracted from the multiple news documents Top n news documents.
In the embodiments of the present disclosure, N can be the positive integer more than or equal to 1, for example, N can be just whole for 1,3,8,12 etc. Number, specifically, can according to the actual situation depending on, the embodiment of the present disclosure is without restriction to this.
It, can be according to every after the relevance score for obtaining each news documents at least one news documents set The relevance score of a news documents, the highest news documents of N number of scoring score value from being extracted in multiple news documents.For example, It is 3 in N, news documents include news a, news b, news c, news d and news e, and the relevance score of news a is 0.9, news The relevance score of b is 0.2, and the relevance score of news c is 0.6, and the relevance score of news d is 0.8, the correlation of news e Property scoring be 0.9, when extracting scoring highest preceding 3 news documents of score value, can therefrom extract news a, news e and new Hear d.
It is to be appreciated that above-mentioned example is merely to the example for more fully understanding the embodiment of the present disclosure and enumerating, not as To the sole limitation of the embodiment of the present disclosure.
According to relevance score, after extracting the scoring highest top n news documents of score value in multiple news documents, Execute step S14.
In step S14, according to the corresponding document text of the top n news documents, the top n news documents are determined Corresponding summary text, and using the summary text as the summary texts of the top n news documents.
Summary text refers to the text of the generality description of each news documents, and summary text is equivalent to the abstract letter of news Breath, can probably recognize the content that each news documents substantially describe by summary text.
In the disclosure, from extracted in multiple news documents scoring the highest top n news documents of score value after, can Top n news documents are inputted summary text network model, thus corresponding with each news documents general by model output Text is wanted, and then using the summary text as the summary texts of top n news documents.
The embodiment of the present disclosure is automatic to every news documents by extracting the news documents set as unit of preset time period It scores, the highest top n news documents of scoring is extracted according to the relevance score of each news documents, avoid information Redundancy, and participated in without artificial;And it is possible to extract corresponding summary texts for top n news documents, and then after can carrying out Continuous public sentiment monitoring or information integration, without manually checking news one by one.
The document processing method that the embodiment of the present disclosure provides, it is corresponding with event keyword by acquisition to be with preset time period At least one news documents set of unit, based on multiple news in event keyword and at least one news documents set Document determines the corresponding relevance score of multiple news documents, and according to relevance score, scoring is extracted from multiple news documents The highest top n news documents of score value, N are positive integer more than or equal to 1, according to the corresponding document text of top n news documents, Determine the corresponding summary text of top n news documents, and using summary text as the summary texts of top n news documents.This public affairs Opening embodiment can extract with preset time period (such as year, season) for the news documents set of unit, to every news documents Automatically it scores, the highest top n news documents of scoring is extracted according to the relevance score of each news documents, are avoided Information redundancy, and participated in without artificial;And it is possible to extract corresponding summary texts for top n news documents, and then can be into The subsequent public sentiment monitoring of row or information integration, without manually checking news one by one, reduce the investment of human cost.
Embodiment two
Fig. 2 is a kind of step flow chart of document processing method shown according to an exemplary embodiment, as shown in Fig. 2, The document processing method the following steps are included:
In the step s 21, it is based on the corresponding media event of the event keyword, determination is associated with the media event The temperature weight of multiple preset time periods.
In the embodiments of the present disclosure, event keyword refers to the keyword for searching for news documents, and event keyword can To be the keyword input by user extracted according to current hotspot media event.
Temperature weight refers to media event corresponding with event keyword in the temperature of each preset time period, it is possible to understand that Ground, temperature weight can be by the pre-set program of research staff, the weighted value as obtained from the search of internet data, Media event is represented in the different temperatures of each preset time period, for example, existing when preset time period is year for event a Temperature in 2019 is higher than in temperature in 2018, then for can be set within 2019 a higher temperature weight, and for A temperature weight for being lower than temperature weight in 2019 can be set within 2018.
The specific acquisition modes of temperature weight can also be searched for according to media event by computer program it is each pre- The related news in duration are set, thus the temperature weight provided.
Certainly, in practical applications, those skilled in the art can also be associated with using other way acquisition with media event The corresponding temperature weight of each preset time period, specifically, can according to the actual situation depending on, this is not added in the embodiment of the present disclosure With limitation.
It is being based on the corresponding media event of event keyword, the determining temperature with the associated multiple preset time periods of media event After weight, step S22 is executed.
In step S22, the temperature weight is extracted from the multiple preset time period at least greater than weight threshold One target preset time period.
Weight threshold refers to by research staff's pre-set weight threshold corresponding with temperature weight as needed.
Target preset time period refers to the year for being greater than weight threshold with the temperature weight of the associated preset time period of media event, For example, preset time period is by taking year as an example, year includes 2019,2018,2017 and 2016, media event A, and new The temperature weight in news event A associated aforementioned four year are as follows: the temperature weight that temperature weight in 2019 is 0.8,2018 The temperature weight for being 0.7,2017 is that temperature weight in 0.6,2016 is 0.5, and weight threshold 0.6, then what is extracted is greater than Weight threshold corresponding year is 2018 and 2019, i.e., being considered as target year for 2018 and 2019, (i.e. target is pre- Set duration).
It is to be appreciated that above-mentioned example is merely to more fully understand the technical solution of the embodiment of the present disclosure and showing for enumerating Example, not as the sole limitation to the embodiment of the present disclosure.
Extracted from multiple preset time periods temperature weight greater than weight threshold at least one target preset time period it Afterwards, step S23 is executed.
In step S23, it is based on the event keyword, obtains the news text at least one described target preset time period Shelves set.
After determining target preset time period, then it can be scanned for according to event keyword, it is crucial with event to obtain Word associated news documents in target preset time period, and then using all news documents in a target preset time period as one A news documents set.
Step is being executed after obtaining the news documents set at least one target preset time period based on event keyword Rapid S24.
In step s 24, the event keyword is segmented, obtains at least one first participle.
The first participle refers to event keyword is segmented after, obtained participle, for example, event keyword be " palace Protect diced chicken and leek egg ", after being segmented, available " the quick-fried egg in palace ", " leek egg ", "and" word are that connection is closed Copula can directly be ignored, then " the quick-fried egg in palace ", " leek egg " directly can be considered as the first participle.
Word segmentation processing mode can use participle technique relatively common in the prior art specifically can be according to business Depending on demand, the embodiment of the present disclosure is without restriction to this.
It is segmented to event keyword, after obtaining at least one first participle, executes step S25.
In step s 25, at least one described first participle is calculated at least one of the multiple news documents the One word frequency and at least one described first participle refer at least one of all news documents set first against text frequency Number.
First word frequency (Term Frequency, TF) refers to the frequency that the first participle occurs in each news documents, example Such as, the first participle is participle a, and news documents include document 1 and document 2, and the first word frequency is to segment what a occurred in document 1 Frequency, and the frequency that participle a occurs in document 2.What the first word frequency can be occurred in each news documents by the first participle Number is obtained divided by total word number of this document.
First inverse document frequency (Inverse Document Frequency, IDF) refers to the first participle all The frequency that the frequency namely the first participle occurred in news documents set occurs in all news documents.First against text frequency Rate index can be by total press number of documents divided by the number of the news documents comprising the first participle, then obtained quotient taken logarithm It obtains.
It is to be appreciated that being the more mature skill in this field for the calculation of word frequency and inverse document frequency Art, the embodiment of the present disclosure are no longer described in detail herein.
It is segmented to event keyword, after obtaining at least one first participle, each first participle can be calculated The first word frequency in each news documents.And calculate the first participle in all news documents set first against text frequency Index.
It is to be appreciated that each first participle both corresponds to first word frequency in each news documents, each first Participle corresponds to first inverse document frequency in all news documents set.
In step S26, for all news documents, each news documents is segmented, are obtained more A second participle.
After second participle refers to and segmented the corresponding document text of each news documents, obtained participle can be with Understand ground, each news documents include multiple participle texts, i.e., after each news documents are carried out with word segmentation processing Obtain corresponding multiple second participles of each news documents.
All news documents are being directed to, word segmentation processing is carried out to each news documents, after obtaining multiple second participles, Execute step S27.
In step s 27, multiple second word frequency of the multiple second participle in each news documents are calculated, and Multiple second inverse document frequencies of the multiple second participle in all news documents set.
Second word frequency refers to the frequency of appearance of second participle in the news documents in each news documents, the second word The calculation of frequency is similar to the calculation of the first word frequency in the above process, is referred to the calculation of above-mentioned first word frequency The second word frequency is calculated, the embodiment of the present disclosure is not repeated here herein.
Second inverse document frequency refers to the frequency namely second that the second participle occurs in all news documents set Segment the frequency occurred in all news documents.Second inverse document frequency can by total press number of documents divided by comprising The number of the news documents of second participle, then take logarithm to obtain obtained quotient.
Word segmentation processing is being carried out to each news documents, after obtaining multiple second participles, each second point can be calculated Second word frequency of the word in corresponding news documents, and calculate each second participle in all news documents second against text Frequency index executes step S28 in turn.
In step S28, according to the multiple second word frequency and the multiple second inverse document frequency, building is each The corresponding document matrix of the news documents.
Document matrix refers to the corresponding matrix of each news documents, and document matrix is according to multiple second points in news documents Constructed by corresponding multiple second word frequency of word and corresponding second inverse document frequency of multiple second participles.
Second word frequency of the second participle of each of each news documents in corresponding news documents is being calculated, and each new After hearing corresponding second inverse document frequency of the second participle of each of document, each news documents can be directed to, in conjunction with The second word frequency and the second inverse document frequency that each of news documents second segment, can construct document matrix.
It can be using matrix constructing plan relatively common in the prior art, the disclosure for the building mode of document matrix Embodiment is not repeated here herein.
Document matrix can embody the frequency that each participle occurs in the news documents in news documents, and each point Inverse document frequency of the word in the corresponding all news documents of at least one news documents set.
In step S29, according at least one described first word frequency, at least one described first inverse document frequency and The document matrix determines the relevance score of each news documents.
In above process, at least one first word frequency, at least one first inverse document frequency and document square are obtained After battle array, it can be determined each according at least one first word frequency, at least one first inverse document frequency and document matrix The relevance score of news documents.
The process of the relevance score of news documents each for determination is referred to following specific implementations and carries out in detail Thin description.
The disclosure one kind in the specific implementation, above-mentioned steps S29 may include:
Sub-step A1: calculating at least one described first word frequency and at least one described first inverse document frequency, with The similarity value of the corresponding news documents of each document matrix.
Similarity value refers to the similarity value of word frequency and inverse document frequency news documents pair corresponding with document matrix, Namely the similarity value of event keyword and each news documents.
Refer in corresponding first word frequency of at least one first participle for calculating event keyword and first against text frequency After number, similar or identical one or more can be segmented by least one first participle and news documents multiple second Second participle, and combine the first word frequency of the first participle and the document matrix of the first inverse document frequency and the news documents The second word frequency corresponding with the second similar or identical participle of the first participle of middle record and the second inverse document frequency, into And the similitude segmented according to the first participle and second, and combine above-mentioned first word frequency, the first inverse document frequency, second First similarity value of at least one word frequency Yu each document matrix is calculated in word frequency and the second inverse document frequency, and Second similarity value of at least one inverse document frequency and each document matrix can pass through the first similarity value and second Similarity value and respective weights product, then it is added the mode of summation, the similarity of event keyword and news documents is calculated Value.
After the similarity value of news documents is calculated, sub-step A2 is executed.
Sub-step A2: using the similarity value as the relevance score of each news documents.
After the similarity value of each news documents and event keyword is calculated, can using the similarity value as The relevance score of each news documents.
It is to be appreciated that above-mentioned example is merely to one for more fully understanding the technical solution of the embodiment of the present disclosure and enumerating Kind calculates the mode of the relevance score of news documents, and in the concrete realization, those skilled in the art can also use other sides Formula calculates the relevance score of news documents, and the embodiment of the present disclosure is without restriction to this.
According at least one first word frequency, at least one first inverse document frequency and document matrix, determine each After the relevance score of news documents, step 210 is executed.
In step S210, according to the relevance score, scoring score value highest is extracted from the multiple news documents Top n news documents.
In the embodiments of the present disclosure, N can be the positive integer more than or equal to 1, for example, N can be just whole for 1,3,8,12 etc. Number, specifically, can according to the actual situation depending on, the embodiment of the present disclosure is without restriction to this.
It, can be according to every after the relevance score for obtaining each news documents at least one news documents set The relevance score of a news documents, the highest news documents of N number of scoring score value from being extracted in multiple news documents.For example, It is 3 in N, news documents include news a, news b, news c, news d and news e, and the relevance score of news a is 0.9, news The relevance score of b is 0.2, and the relevance score of news c is 0.6, and the relevance score of news d is 0.8, the correlation of news e Property scoring be 0.9, when extracting scoring highest preceding 3 news documents of score value, can therefrom extract news a, news e and new Hear d.
It is to be appreciated that above-mentioned example is merely to the example for more fully understanding the embodiment of the present disclosure and enumerating, not as To the sole limitation of the embodiment of the present disclosure.
According to relevance score, after extracting the scoring highest top n news documents of score value in multiple news documents, Execute step S211.
In step S211, by the corresponding document text input of top n news documents summary text trained in advance Network model.
Summary text network model refers to the network model of the summary text for extracting document text.
It, can be by top n news after scoring the highest top n news documents of score value from extraction in multiple news documents The corresponding document text of document inputs summary text network model, and the mistake of summary text output is carried out by summary text network model Journey.
And following specific implementations are referred to for the input process of document text and are described in detail.
The disclosure one kind in the specific implementation, above-mentioned steps S211 may include:
Sub-step B1: being directed to the top n news documents, and the corresponding document text of each news documents is successively pressed sentence lattice Formula is split, and multiple format file texts are obtained.
In the embodiments of the present disclosure, sentence format file text, which refers to, splits document text according to sentence format, and obtains The sentence format text arrived, for example, the corresponding document text of news documents are as follows: " report points out that this has been the Japan since last August Maritime Self-Defence Force holds joint exercise in South China Sea with U.S.'s aircraft carrier again.This time manoeuvre is marine other than " cloud number out " Self-defence corps have also sent escort vessel " Murasame " and " daybreak ".They and uss ronald reagan form fleet, have carried out tactics navigation rehearsal.", it can See, includes three sentences in above-mentioned document text, when being split according to sentence format, can be split as that " report points out that this is Since last August, Japan Maritime Self Defense Force (MSDF) holds joint exercise in South China Sea with U.S.'s aircraft carrier again.", " this time manoeuvre Other than " cloud number out ", Maritime Self-Defence Force has also sent escort vessel " Murasame " and " daybreak "." and " they and uss ronald reagan form warship Team has carried out tactics navigation rehearsal.", i.e., a sentence format file text is used as by every.
It is to be appreciated that above-mentioned example is merely to more fully understand the technical solution of the embodiment of the present disclosure and showing for enumerating Example, not as the sole limitation to the embodiment of the present disclosure.
After obtaining top n news documents, can for each news documents in top n news documents successively according to Sentence format is split, and obtains the corresponding multiple format file texts of each news documents.
Top n news documents are being directed to, are successively splitting the corresponding document text of each news documents by sentence format, After obtaining multiple format file texts, sub-step B2 is executed.
Sub-step B2: by summary text network model described in the multiple sentence format file text input.
It, can be for each after obtaining multiple format file texts of the corresponding document text of each news documents News documents, successively by the corresponding sentence format file text input summary text network model of each news documents, and then by general It wants text network model in subsequent process, exports the corresponding summary texts of each format file text.
For top n news documents, the corresponding document text of top n news documents can be inputted to trained summary text In present networks model, and execute step S212.
In step S212, reception is exported corresponding with the top n news documents by the summary text network model Summary text.
After top n news documents are input to summary text network model, it can be mentioned by summary text network model Taking out the corresponding summary text of top n news documents specifically, can be first by a news documents for top n news documents Corresponding document text inputs summary text network model, to be exported by summary text network model corresponding with the news documents Summary text, and then using the summary text as the summary texts of the news documents, successively for other top n news documents The above process is executed, to obtain the corresponding summary texts of each news documents in top n news documents.
And the detailed process for exporting summary text, it is referred to the detailed description of following specific implementations.
The disclosure another kind in the specific implementation, above-mentioned steps S212 may include:
Sub-step C1: reception is exported corresponding more with the sentence format file text by the summary text network model A format summary texts.
In the embodiments of the present disclosure, sentence format summary texts refer to abstract text corresponding with each format file text This.
It, can be by summary text network model after by multiple format file text input summary text network models Export the corresponding summary texts of each format file text, i.e. sentence format summary texts;Such as above-mentioned example, " report points out, this It has been since last August, Japan Maritime Self Defense Force (MSDF) holds joint exercise in South China Sea with U.S.'s aircraft carrier again ", it is available Wherein important summary texts, such as " Japan and the U.S. ", " South China Sea holds joint exercise ".
It is to be appreciated that above-mentioned example is merely to more fully understand the technical solution of the embodiment of the present disclosure and showing for enumerating Example, not as the sole limitation to the embodiment of the present disclosure.
Receiving multiple formats corresponding with the sentence format file text abstract text exported by summary text network model After this, sub-step C2 is executed.
Sub-step C2: the multiple sentence format summary texts are merged, the summary text is obtained.
It, can after multiple format summary texts for obtaining the corresponding multiple format file texts of each news documents To merge multiple format summary texts, so that summary text is obtained, for example, news documents A includes sentence format file Text: text a, text b, text c and text d, the corresponding sentence format summary texts of text a are text 1, the corresponding sentence of text b Format summary texts are text 2, and the corresponding sentence format summary texts of text c are text 3, the corresponding sentence format abstract text of text d This is text 4, and then text 1, text 2, text 3 and text 4 can be merged, so as to obtain summary text.
It is to be appreciated that above-mentioned example is merely to more fully understand the technical solution of the embodiment of the present disclosure and showing for enumerating Example, not as the sole limitation to the embodiment of the present disclosure.
The embodiment of the present disclosure gives text by obtaining news to polymerization and summarizes, significant increase public sentiment/news editor The efficiency of personnel.
The document processing method that the embodiment of the present disclosure provides, in addition to having the document processing method of the offer of above-described embodiment one Outside the beneficial effect that embodiment has, news can also be obtained to polymerization and gives text summary, significant increase public sentiment/new Hear the efficiency of editorial staff.
Embodiment three
Fig. 3 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment.Referring to Fig. 3, the device packet It includes news documents set and obtains module 131, relevance score determining module 132, news documents extraction module 133 and summary texts Determining module 134.
The news documents set obtain module 131 be configured as obtain it is corresponding with event keyword with preset time period be singly At least one news documents set of position;
The relevance score determining module 132 is configured as based on the event keyword and at least one described news Multiple news documents in collection of document determine the corresponding relevance score of the multiple news documents;
The news documents extraction module 133 is configured as according to the relevance score, from the multiple news documents Extract the scoring highest top n news documents of score value;N is the positive integer more than or equal to 1;
The summary texts determining module 134 is configured as being determined according to the corresponding document text of the top n news documents The corresponding summary text of the top n news documents, and using the summary text as the abstract of top n news documents text This.
The document processing device, document processing that the embodiment of the present disclosure provides, it is corresponding with event keyword by acquisition to be with preset time period At least one news documents set of unit, based on multiple news in event keyword and at least one news documents set Document determines the corresponding relevance score of multiple news documents, and according to relevance score, scoring is extracted from multiple news documents The highest top n news documents of score value, N are positive integer more than or equal to 1, according to the corresponding document text of top n news documents, Determine the corresponding summary text of top n news documents, and using summary text as the summary texts of top n news documents.This public affairs Opening embodiment can extract with preset time period (such as year, season) for the news documents set of unit, to every news documents Automatically it scores, the highest top n news documents of scoring is extracted according to the relevance score of each news documents, are avoided Information redundancy, and participated in without artificial;And it is possible to extract corresponding summary texts for top n news documents, and then can be into The subsequent public sentiment monitoring of row or information integration, without manually checking news one by one, reduce the investment of human cost.
Example IV
Fig. 4 is a kind of block diagram of document processing device, document processing shown according to an exemplary embodiment.Referring to Fig. 4, the device packet It includes news documents set and obtains module 141, relevance score determining module 142, news documents extraction module 143 and summary texts Determining module 144.
The news documents set obtain module 141 be configured as obtain it is corresponding with event keyword with preset time period be singly At least one news documents set of position;
The relevance score determining module 142 is configured as based on the event keyword and at least one described news Multiple news documents in collection of document determine the corresponding relevance score of the multiple news documents;
The news documents extraction module 143 is configured as according to the relevance score, from the multiple news documents Extract the scoring highest top n news documents of score value;N is the positive integer more than or equal to 1;
The summary texts determining module 144 is configured as being determined according to the corresponding document text of the top n news documents The corresponding summary text of the top n news documents, and using the summary text as the abstract of top n news documents text This.
The disclosure one kind in the specific implementation, the news documents set obtain module 141 include:
Temperature weight determines submodule 1411, for being based on the corresponding media event of the event keyword, determining and institute State the temperature weight of the associated multiple preset time periods of media event;
Target duration extracting sub-module 1412 is greater than for extracting the temperature weight from the multiple preset time period At least one target preset time period of weight threshold;
News documents set acquisition submodule 1413 obtains at least one described mesh for being based on the event keyword Mark the news documents set in preset time period.
The disclosure one kind in the specific implementation, the relevance score determining module 142 includes:
First participle acquisition submodule 1421, for being segmented to the event keyword, obtain at least one first Participle;
First word frequency computational submodule 1422, for calculating at least one described first participle in the multiple news documents At least one of the first word frequency and at least one described first participle at least one of all news documents set first Inverse document frequency;
Second participle acquisition submodule 1423, for being directed to all news documents, to each news documents It is segmented, obtains multiple second participles;
Second word frequency computational submodule 1424, for calculating the multiple second participle in each news documents Multiple second inverse document frequencies of multiple second word frequency and the multiple second participle in all news documents set;
Document matrix construct submodule 1425, for according to the multiple second word frequency and the multiple second against text frequency Rate index constructs the corresponding document matrix of each news documents;
Relevance score determines submodule 1426, for according at least one described first word frequency, it is described at least one the One inverse document frequency and the document matrix determine the relevance score of each news documents.
The disclosure one kind in the specific implementation, the relevance score determines that submodule 1426 includes:
Similarity value computational submodule, for calculating at least one described first word frequency and at least one described first inverse text This frequency index, the similarity value of news documents corresponding with each document matrix;
Relevance score acquisition submodule, for being commented the similarity value as the correlation of each news documents Point.
The disclosure one kind in the specific implementation, the summary texts determining module 144 includes:
Document text input submodule 1441, it is preparatory for inputting the corresponding document text of the top n news documents Trained summary text network model;
Summary text receiving submodule 1442, for receive by the summary text network model output with the preceding N The corresponding summary text of a news documents.
The disclosure one kind in the specific implementation, the document text input submodule 1441 includes:
Sentence format text acquisition submodule, it is successively that each news documents are corresponding for being directed to the top n news documents Document text split by sentence format, obtain multiple format file texts;
Sentence format text input submodule, is used for summary text network described in the multiple sentence format file text input Model;
The summary text receiving submodule 1442 includes:
Sentence format make a summary receiving submodule, for receives by the summary text network model export with the sentence format The corresponding multiple format summary texts of document text;
Summary text acquisition submodule obtains the summary for merging the multiple sentence format summary texts Text.
The document processing device, document processing that the embodiment of the present disclosure provides, in addition to having the document processing device, document processing of the offer of above-described embodiment three Outside the beneficial effect that embodiment has, news can also be obtained to polymerization and gives text summary, significant increase public sentiment/new Hear the efficiency of editorial staff.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.
Additionally, the embodiment of the present disclosure additionally provides a kind of electronic equipment, comprising: processor;
Memory for storage processor executable instruction;
Wherein, processor is configured as executing the document processing method of embodiment one to any one of embodiment two.
Fig. 5 is a kind of block diagram for text processing apparatus 800 shown according to an exemplary embodiment.For example, device 800 can be mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, and medical treatment is set It is standby, body-building equipment, personal digital assistant etc..
Referring to Fig. 5, device 800 may include following one or more components: processing component 802, memory 804, electric power Component 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor module 814, and Communication component 816.
The integrated operation of the usual control device 800 of processing component 802, such as with display, telephone call, data communication, phase Machine operation and record operate associated operation.Processing component 802 may include that one or more processors 820 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 802 may include one or more modules, just Interaction between processing component 802 and other assemblies.For example, processing component 802 may include multi-media module, it is more to facilitate Interaction between media component 808 and processing component 802.
Memory 804 is configured as storing various types of data to support the operation in equipment 800.These data are shown Example includes the instruction of any application or method for operating on device 800, contact data, and telephone book data disappears Breath, picture, video etc..Memory 804 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.
Power supply module 806 provides electric power for the various assemblies of device 800.Power supply module 806 may include power management system System, one or more power supplys and other with for device 800 generate, manage, and distribute the associated component of electric power.
Multimedia component 808 includes the screen of one output interface of offer between described device 800 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers Body component 808 includes a front camera and/or rear camera.When equipment 800 is in operation mode, such as screening-mode or When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike Wind (MIC), when device 800 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched It is set to reception external audio signal.The received audio signal can be further stored in memory 804 or via communication set Part 816 is sent.In some embodiments, audio component 810 further includes a loudspeaker, is used for output audio signal.
I/O interface 812 provides interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.
Sensor module 814 includes one or more sensors, and the state for providing various aspects for device 800 is commented Estimate.For example, sensor module 814 can detecte the state that opens/closes of equipment 800, and the relative positioning of component, for example, it is described Component is the display and keypad of device 800, and sensor module 814 can be with 800 1 components of detection device 800 or device Position change, the existence or non-existence that user contacts with device 800,800 orientation of device or acceleration/deceleration and device 800 Temperature change.Sensor module 814 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 814 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 816 is configured to facilitate the communication of wired or wireless way between device 800 and other equipment.Device 800 can access the wireless network based on communication standard, such as WiFi, carrier network (such as 2G, 3G, 4G or 5G) or them Combination.In one exemplary embodiment, communication component 816 is received via broadcast channel from the wide of external broadcasting management system Broadcast signal or broadcast related information.In one exemplary embodiment, the communication component 816 further includes near-field communication (NFC) Module, to promote short range communication.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) can be based in NFC module Technology, ultra wide band (UWB) technology, bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, device 800 can be believed by one or more application specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 804 of instruction, above-metioned instruction can be executed by the processor 820 of device 800 to complete the above method.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..
Fig. 6 is a kind of block diagram for text processing apparatus 1900 shown according to an exemplary embodiment.For example, device 1900 may be provided as a server.Referring to Fig. 6, device 1900 includes processing component 1922, further comprise one or Multiple processors and memory resource represented by a memory 1932, can be by the execution of processing component 1922 for storing Instruction, such as application program.The application program stored in memory 1932 may include it is one or more each Module corresponding to one group of instruction.In addition, processing component 1922 is configured as executing instruction, to execute the above method: obtain with Corresponding at least one the news documents set as unit of preset time period of event keyword;Based on the event keyword, and Multiple news documents at least one described news documents set determine that the corresponding correlation of the multiple news documents is commented Point;According to the relevance score, the scoring highest top n news documents of score value are extracted from the multiple news documents;N is Positive integer more than or equal to 1;According to the corresponding document text of the top n news documents, the top n news documents pair are determined The summary text answered, and using the summary text as the summary texts of the top n news documents.
Device 1900 can also include that a power supply module 1926 be configured as the power management of executive device 1900, and one Wired or wireless network interface 1950 is configured as device 1900 being connected to network and input and output (I/O) interface 1958.Device 1900 can be operated based on the operating system for being stored in memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
Those skilled in the art will readily occur to its of the disclosure after considering specification and practicing disclosure disclosed herein Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following Claim is pointed out.
It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the accompanying claims.

Claims (10)

1. a kind of document processing method characterized by comprising
Obtain at least one news documents set as unit of preset time period corresponding with event keyword;
Based on multiple news documents in the event keyword and at least one described news documents set, determine described more The corresponding relevance score of a news documents;
According to the relevance score, the scoring highest top n news documents of score value are extracted from the multiple news documents;N For the positive integer more than or equal to 1;
According to the corresponding document text of the top n news documents, the corresponding summary text of the top n news documents is determined, And using the summary text as the summary texts of the top n news documents.
2. the method according to claim 1, wherein the acquisition is corresponding with event keyword with preset time period For unit at least one news documents set the step of, comprising:
Based on the corresponding media event of the event keyword, the determining heat with the associated multiple preset time periods of the media event Spend weight;
At least one target preset time period that the temperature weight is greater than weight threshold is extracted from the multiple preset time period;
Based on the event keyword, the news documents set at least one described target preset time period is obtained.
3. the method according to claim 1, wherein described be based on the event keyword and described at least one Multiple news documents in a news documents set, the step of determining the multiple news documents corresponding relevance score, packet It includes:
The event keyword is segmented, at least one first participle is obtained;
Calculate at least one described first participle the first word frequency of at least one of the multiple news documents and it is described at least One first participle is in the first inverse document frequency of at least one of all news documents set;
For all news documents, each news documents are segmented, obtain multiple second participles;
Calculate multiple second word frequency and the multiple second participle of the multiple second participle in each news documents Multiple second inverse document frequencies in all news documents set;
According to the multiple second word frequency and the multiple second inverse document frequency, it is corresponding to construct each news documents Document matrix;
According at least one described first word frequency, at least one described first inverse document frequency and the document matrix, really The relevance score of fixed each news documents.
4. according to the method described in claim 3, it is characterized in that, described at least one first word frequency according to, it is described extremely Few first inverse document frequency and the document matrix, determine the step of the relevance score of each news documents Suddenly, comprising:
At least one described first word frequency and at least one described first inverse document frequency are calculated, with each document square The similarity value of the corresponding news documents of battle array;
Using the similarity value as the relevance score of each news documents.
5. the method according to claim 1, wherein described according to the corresponding document of the top n news documents Text, the step of determining the top n news documents corresponding summary text, comprising:
By the corresponding document text input of top n news documents summary text network model trained in advance;
Receive the summary text corresponding with the top n news documents exported by the summary text network model.
6. according to the method described in claim 5, it is characterized in that, described by the corresponding document text of the top n news documents The step of summary text network model that this input is trained in advance, comprising:
For the top n news documents, the corresponding document text of each news documents is split by sentence format successively, is obtained To multiple format file texts;
By summary text network model described in the multiple sentence format file text input;
The step for receiving the summary text corresponding with the top n news documents exported by the summary text network model Suddenly, comprising:
It receives and is made a summary by the multiple formats corresponding with the sentence format file text that the summary text network model exports Text;
The multiple sentence format summary texts are merged, the summary text is obtained.
7. a kind of document processing device, document processing characterized by comprising
News documents set obtains module, for obtaining at least one as unit of preset time period corresponding with event keyword News documents set;
Relevance score determining module, for based in the event keyword and at least one described news documents set Multiple news documents determine the corresponding relevance score of the multiple news documents;
News documents extraction module, for extracting scoring score value from the multiple news documents according to the relevance score Highest top n news documents;N is the positive integer more than or equal to 1;
Summary texts determining module, for determining that the top n is new according to the corresponding document text of the top n news documents The corresponding summary text of document is heard, and using the summary text as the summary texts of the top n news documents.
8. device according to claim 7, which is characterized in that the news documents set obtains module and includes:
Temperature weight determines submodule, for being based on the corresponding media event of the event keyword, the determining and news thing The temperature weight of the associated multiple preset time periods of part;
Target duration extracting sub-module, for extracting the temperature weight from the multiple preset time period greater than weight threshold At least one target preset time period;
News documents set acquisition submodule, for being based on the event keyword, when at least one described target of acquisition is preset News documents set in length.
9. a kind of electronic equipment characterized by comprising
Processor;
For storing the memory of the processor-executable instruction;
Wherein, the processor is configured to document processing method described in any one of perform claim requirement 1 to 6.
10. a kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of first terminal When device executes, so that the first terminal is able to carry out document processing method described in any one of claims 1 to 6.
CN201910517936.4A 2019-06-14 2019-06-14 Document processing method, device, electronic equipment and storage medium Pending CN110377808A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910517936.4A CN110377808A (en) 2019-06-14 2019-06-14 Document processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910517936.4A CN110377808A (en) 2019-06-14 2019-06-14 Document processing method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110377808A true CN110377808A (en) 2019-10-25

Family

ID=68248831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910517936.4A Pending CN110377808A (en) 2019-06-14 2019-06-14 Document processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110377808A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613296A (en) * 2020-12-07 2021-04-06 深圳价值在线信息科技股份有限公司 News importance degree acquisition method and device, terminal equipment and storage medium
CN114780712A (en) * 2022-04-06 2022-07-22 科技日报社 Quality evaluation-based news topic generation method and device
CN115391516A (en) * 2022-10-31 2022-11-25 成都飞机工业(集团)有限责任公司 Unstructured document extraction method, device, equipment and medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446940A (en) * 2007-11-27 2009-06-03 北京大学 Method and device of automatically generating a summary for document set
US8131735B2 (en) * 2009-07-02 2012-03-06 Battelle Memorial Institute Rapid automatic keyword extraction for information retrieval and analysis
CN105930314A (en) * 2016-04-14 2016-09-07 清华大学 Text summarization generation system and method based on coding-decoding deep neural networks
CN106933878A (en) * 2015-12-30 2017-07-07 腾讯科技(北京)有限公司 A kind of information processing method and device
CN107169131A (en) * 2017-06-08 2017-09-15 广州优视网络科技有限公司 A kind of video searching method, device and server
CN107256251A (en) * 2017-06-08 2017-10-17 广州优视网络科技有限公司 A kind of application software searching method, device and server
CN107273476A (en) * 2017-06-08 2017-10-20 广州优视网络科技有限公司 A kind of article search method, device and server
CN107977420A (en) * 2017-11-23 2018-05-01 广东工业大学 The abstract extraction method, apparatus and readable storage medium storing program for executing of a kind of evolved document
CN108280112A (en) * 2017-06-22 2018-07-13 腾讯科技(深圳)有限公司 Abstraction generating method, device and computer equipment
CN108319668A (en) * 2018-01-23 2018-07-24 义语智能科技(上海)有限公司 Generate the method and apparatus of text snippet
CN109241272A (en) * 2018-07-25 2019-01-18 华南师范大学 A kind of Chinese text abstraction generating method, computer-readable storage media and computer equipment
CN109657051A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 Text snippet generation method, device, computer equipment and storage medium
CN109726281A (en) * 2018-12-12 2019-05-07 Tcl集团股份有限公司 A kind of text snippet generation method, intelligent terminal and storage medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446940A (en) * 2007-11-27 2009-06-03 北京大学 Method and device of automatically generating a summary for document set
US8131735B2 (en) * 2009-07-02 2012-03-06 Battelle Memorial Institute Rapid automatic keyword extraction for information retrieval and analysis
CN106933878A (en) * 2015-12-30 2017-07-07 腾讯科技(北京)有限公司 A kind of information processing method and device
CN105930314A (en) * 2016-04-14 2016-09-07 清华大学 Text summarization generation system and method based on coding-decoding deep neural networks
CN107273476A (en) * 2017-06-08 2017-10-20 广州优视网络科技有限公司 A kind of article search method, device and server
CN107256251A (en) * 2017-06-08 2017-10-17 广州优视网络科技有限公司 A kind of application software searching method, device and server
CN107169131A (en) * 2017-06-08 2017-09-15 广州优视网络科技有限公司 A kind of video searching method, device and server
CN108280112A (en) * 2017-06-22 2018-07-13 腾讯科技(深圳)有限公司 Abstraction generating method, device and computer equipment
CN107977420A (en) * 2017-11-23 2018-05-01 广东工业大学 The abstract extraction method, apparatus and readable storage medium storing program for executing of a kind of evolved document
CN108319668A (en) * 2018-01-23 2018-07-24 义语智能科技(上海)有限公司 Generate the method and apparatus of text snippet
CN109241272A (en) * 2018-07-25 2019-01-18 华南师范大学 A kind of Chinese text abstraction generating method, computer-readable storage media and computer equipment
CN109657051A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 Text snippet generation method, device, computer equipment and storage medium
CN109726281A (en) * 2018-12-12 2019-05-07 Tcl集团股份有限公司 A kind of text snippet generation method, intelligent terminal and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613296A (en) * 2020-12-07 2021-04-06 深圳价值在线信息科技股份有限公司 News importance degree acquisition method and device, terminal equipment and storage medium
CN114780712A (en) * 2022-04-06 2022-07-22 科技日报社 Quality evaluation-based news topic generation method and device
CN115391516A (en) * 2022-10-31 2022-11-25 成都飞机工业(集团)有限责任公司 Unstructured document extraction method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN111461089B (en) Face detection method, and training method and device of face detection model
US11120078B2 (en) Method and device for video processing, electronic device, and storage medium
CN110781305B (en) Text classification method and device based on classification model and model training method
TWI728564B (en) Method, device and electronic equipment for image description statement positioning and storage medium thereof
CN109522419B (en) Session information completion method and device
CN109918669B (en) Entity determining method, device and storage medium
CN110008401B (en) Keyword extraction method, keyword extraction device, and computer-readable storage medium
WO2021027343A1 (en) Human face image recognition method and apparatus, electronic device, and storage medium
CN110377808A (en) Document processing method, device, electronic equipment and storage medium
CN111859020B (en) Recommendation method, recommendation device, electronic equipment and computer readable storage medium
CN103650035A (en) Identifying people that are proximate to a mobile device user via social graphs, speech models, and user context
CN109614482A (en) Processing method, device, electronic equipment and the storage medium of label
WO2022166069A1 (en) Deep learning network determination method and apparatus, and electronic device and storage medium
CN110399934A (en) A kind of video classification methods, device and electronic equipment
CN107133354A (en) The acquisition methods and device of description information of image
CN110069624A (en) Text handling method and device
CN108345625A (en) A kind of information mining method and device, a kind of device for information excavating
CN110929176A (en) Information recommendation method and device and electronic equipment
CN112101216A (en) Face recognition method, device, equipment and storage medium
CN116863286A (en) Double-flow target detection method and model building method thereof
CN111222316A (en) Text detection method, device and storage medium
CN111739535A (en) Voice recognition method and device and electronic equipment
CN106156299B (en) The subject content recognition methods of text information and device
CN112884040A (en) Training sample data optimization method and system, storage medium and electronic equipment
CN110177284A (en) Information displaying method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191025