CN113177117B

CN113177117B - News material acquisition method and device, storage medium and electronic device

Info

Publication number: CN113177117B
Application number: CN202110292933.2A
Authority: CN
Inventors: 程刚; 张剑; 王昕�; 黄仁杰
Original assignee: Shenzhen Raixun Information Technology Co ltd
Current assignee: Shenzhen Raixun Information Technology Co ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2022-03-08
Anticipated expiration: 2041-03-18
Also published as: CN113177117A

Abstract

The invention provides a news material acquisition method and device, a storage medium and an electronic device, wherein the method comprises the following steps: collecting first news source data of a target theme from a specified data source in a source limiting mode; extracting a first keyword set in the first news source data, and acquiring second news source data from a search engine in a non-limited source mode based on the first keyword set; extracting a second keyword set in the second news source data; and generating news materials of the target theme according to the first keyword set and the second keyword set. According to the invention, the technical problem of low accuracy of related technology for collecting news materials is solved, the collection efficiency of multi-source text data is improved, and the redundancy of the news data is reduced.

Description

News material acquisition method and device, storage medium and electronic device

Technical Field

The invention relates to the field of computers, in particular to a news material acquisition method and device, a storage medium and an electronic device.

Background

In the related art, as the digitization of news has started, the network news information has been explosively increased. There are a huge amount of news contents on the network, and among these news, the news contents are different depending on the news reporters, but are still the same news topic in nature. Meanwhile, a lot of news on the network are presented in the form of articles, and people need to spend a lot of additional time when wanting to obtain some news messages. The rapid development of modern information technology and storage technology and the rapid spread of the internet, the network news taking a network as a carrier is started, the network news information is increased explosively, some false news may exist, and a key problem is how to distinguish and screen out the news with high credibility. Meanwhile, the pace of modern people's work and life is accelerated, how to enable people to quickly know news information in a short time is an important problem, and in the face of the two challenging problems, the problem is solved only by manpower, which is not only inefficient but also difficult to realize, so that an artificial intelligence technology capable of intelligently processing mass data becomes a research hotspot at present, and is rapidly developed in recent years, and various systems based on the artificial intelligence technology emerge endlessly. By utilizing the text abstract extraction technology, a brief and reliable news abstract is obtained according to the news information of the same subject but different description contents, and people can quickly know the news contents through the news abstract.

In the related art, the rapid development of information technology and storage technology and the rapid spread of the internet, network news with a network as a carrier is started, network news information is increased explosively, some false news may exist, and how to distinguish and screen news with high credibility is a key problem. Meanwhile, the pace of modern people's work and life is accelerated, how to enable people to quickly know news information in a short time is an important problem, and in the face of the two challenging problems, the problem is solved only by manpower, which is not only inefficient but also difficult to realize, so that an artificial intelligence technology capable of intelligently processing mass data becomes a research hotspot at present, and is rapidly developed in recent years, and various systems based on the artificial intelligence technology emerge endlessly.

In the related art, the crawling strategy adopted by the artificial intelligence application system in the data acquisition link is single, the acquired data may cause performance influence of subsequent machine learning, such as news information data, particularly, if false news is not processed during acquisition, the false news may enter the machine learning process, and further adverse effect is caused on the final application system.

In view of the above problems in the related art, no effective solution has been found at present.

Disclosure of Invention

The embodiment of the invention provides a news material acquisition method and device, a storage medium and an electronic device.

According to an embodiment of the present invention, there is provided a method for collecting news material, including: collecting first news source data of a target theme from a specified data source in a source limiting mode; extracting a first keyword set in the first news source data, and acquiring second news source data from a search engine in a non-limited source mode based on the first keyword set; extracting a second keyword set in the second news source data; and generating news materials of the target theme according to the first keyword set and the second keyword set.

Optionally, the generating of the news footage of the target topic according to the first keyword set and the second keyword set includes: comparing the first keyword set with the second keyword set, and extracting a first common related keyword set of the first keyword set and the second keyword set in a first acquisition period; judging whether the number of the first common related key words is smaller than a preset threshold value or not; if the number of the first common related key words is smaller than a preset threshold value, outputting the first key word set and the second key words to news materials of the target theme; and if the number of the first common related key words is larger than or equal to a preset threshold value, continuously iterating and extracting the key word set by taking the second key word set as the initial key word until the number of the nth common key word sets in the nth acquisition period after iteration is smaller than the preset threshold value, wherein n is an integer larger than 0.

Optionally, in a second acquisition cycle, the iterative extraction of the keyword set with the second keyword set as the starting keyword includes: collecting third news source data from the specified data source in a source-limited mode by taking the second keyword set as a search keyword; extracting a third key word set in the third news source data, and acquiring fourth news source data from the search engine in a non-source-limited mode based on the third key word set; and extracting a fourth keyword set in the fourth news source data.

Optionally, the method further includes: performing word segmentation processing on the news source data to obtain a word sequence; configuring first label information of the word sequence to generate a news data set, wherein the news data set comprises the word sequence and corresponding first label information, and the news source data comprises the first news source data and the second news source data; identifying the news data set by adopting a target named entity identification NER model, and outputting entity information of the news data set, wherein the entity information comprises an effective character sequence; and selecting news characteristic materials matched with the entity information from the news materials.

Optionally, before identifying the news data set using the target NER model, the method further includes: dividing the news data set into a training set, a verification set and a test set; and iteratively training an initial NER model by adopting the training set, the verification set and the test set until the latest target NER model meets a preset condition.

Optionally, iteratively training an initial NER model using the training set, the validation set, and the test set, includes: segmenting the training set, the validation set, and the test set into a first sequence of characters; taking the first character sequence as input data, extracting feature information of the first character sequence, and generating a feature vector set based on the feature information; extracting a hidden state sequence of the feature vector set by adopting a bidirectional long-term short-term memory (BilSTM) network, wherein the hidden state sequence comprises character-to-character relation feature information; performing entity tag detection on characters in the first character sequence according to the hidden state sequence to obtain second tag information, and generating third tag information by adopting a Viterbi algorithm according to the first tag information and the second tag information to obtain a second character sequence, wherein the second character sequence comprises a word sequence and corresponding third tag information; and taking the second character sequence as input data, and iteratively training the initial NER model until the NER model of the current iteration period meets a preset condition.

Optionally, the extracting, by using a BiLSTM network, the hidden state sequence of the feature vector set includes: extracting feature information of the words according to the feature vector set, and inputting the feature vectors corresponding to the words into a BilSTM network, wherein the BilSTM network comprises a forward LSTM and a reverse LSTM; the forward LSTM outputs according to the input feature vector to obtain a forward hidden state sequence, and the reverse LSTM outputs according to the input feature vector to obtain a reverse hidden state sequence; and splicing the forward hidden state sequence and the reverse hidden state sequence to obtain a complete hidden state sequence.

According to another embodiment of the present invention, there is provided a news material collecting apparatus including: the first acquisition module is used for acquiring first news source data of a target theme from a specified data source in a source limiting mode; the second acquisition module is used for extracting a first keyword set in the first news source data and acquiring second news source data from a search engine in a non-limited source mode based on the first keyword set; the extraction module is used for extracting a second keyword set in the second news source data; and the generating module is used for generating news materials of the target theme according to the first keyword set and the second keyword set.

Optionally, the generating module includes: the extraction unit is used for comparing the first keyword set with the second keyword set and extracting a first common related keyword set of the first keyword set and the second keyword set in a first acquisition period; the judging unit is used for judging whether the number of the first common key words is smaller than a preset threshold value or not; the processing unit is used for outputting the first keyword set and the second keyword to news materials of the target theme if the number of the first common related keywords is smaller than a preset threshold value; and if the number of the first common related key words is larger than or equal to a preset threshold value, continuously iterating and extracting the key word set by taking the second key word set as the initial key word until the number of the nth common key word sets in the nth acquisition period after iteration is smaller than the preset threshold value, wherein n is an integer larger than 0.

Optionally, in the second acquisition cycle, the processing unit includes: the first acquisition subunit is used for acquiring third news source data from the specified data source in a source-limited mode by taking the second keyword set as a search keyword; the second acquisition subunit is configured to extract a third keyword set in the third news source data, and acquire fourth news source data from the search engine in a non-source-limited manner based on the third keyword set; and the extraction subunit is used for extracting a fourth keyword set in the fourth news source data.

Optionally, the apparatus further comprises: the word segmentation module is used for carrying out word segmentation processing on the news source data to obtain a word sequence; a configuration module, configured to configure first tag information of the word sequence, and generate a news data set, where the news data set includes the word sequence and corresponding first tag information, and the news source data includes the first news source data and the second news source data; the identification module is used for identifying the news data set by adopting a target named entity identification NER model and outputting entity information of the news data set, wherein the entity information comprises an effective character sequence; and the selecting module is used for selecting news characteristic materials matched with the entity information from the news materials.

Optionally, the apparatus further comprises: the dividing module is used for dividing the news data set into a training set, a verification set and a test set before the identification module adopts a target NER model to identify the news data set; and the training module is used for iteratively training the initial NER model by adopting the training set, the verification set and the test set until the latest target NER model meets a preset condition.

Optionally, the training module includes: a segmentation unit for segmenting the training set, the validation set, and the test set into a first sequence of characters; a first extraction unit, configured to extract feature information of the first character sequence using the first character sequence as input data, and generate a feature vector set based on the feature information; the second extraction unit is used for extracting a hidden state sequence of the feature vector set by adopting a bidirectional long-term short-term memory (BilSTM) network, wherein the hidden state sequence comprises character-to-character relation feature information; the processing unit is used for carrying out entity label detection on the characters in the first character sequence according to the hidden state sequence to obtain second label information, and generating third label information by adopting a Viterbi algorithm according to the first label information and the second label information to obtain a second character sequence, wherein the second character sequence comprises a word sequence and corresponding third label information; and the training unit is used for iteratively training the initial NER model by taking the second character sequence as input data until the NER model of the current iteration cycle meets a preset condition.

Optionally, the second extracting unit includes: the input subunit is used for inputting the feature vectors corresponding to the words into a BilSTM network according to the feature information of the words extracted by the feature vector set, wherein the BilSTM network comprises a forward LSTM and a reverse LSTM; the output subunit is used for outputting the forward LSTM according to the input feature vector to obtain a forward hidden state sequence, and outputting the reverse LSTM according to the input feature vector to obtain a reverse hidden state sequence; and the splicing subunit is used for splicing the forward hidden state sequence and the reverse hidden state sequence to obtain a complete hidden state sequence.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the method, the first news source data of the target theme are collected from the specified data source in a source-limited mode, the first keyword set in the first news source data is extracted, and the second news source data are collected from the search engine in a source-unlimited mode based on the first keyword set; the method comprises the steps of extracting a second keyword set in second news source data, generating news materials of a target theme according to the first keyword set and the second keyword set, preventing false news data from being collected by utilizing two data collection modes of limited sources and non-limited sources, improving the accuracy of data and news manuscripts on the premise of ensuring the data volume, solving the technical problem of low accuracy of collecting news materials in the related art, improving the collection efficiency of multi-source text data and reducing the redundancy of the news data.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a news material collecting computer according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of collecting news material according to an embodiment of the present invention;

FIG. 3 is a data collection flow diagram of an embodiment of the present invention;

fig. 4 is a block diagram of a system for collecting news material according to an embodiment of the present invention;

fig. 5 is a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

The method provided by the embodiment one of the present application may be executed in a server, a computer, a mobile phone, or a similar computing device. Taking an example of the server running on the server, fig. 1 is a hardware structure block diagram of a server according to an embodiment of the present invention. As shown in fig. 1, the server may include one or more (only one shown in fig. 1) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and is not intended to limit the structure of the server. For example, the server may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a server program, for example, a software program and a module of application software, such as a server program corresponding to a news material collecting method in an embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the server program stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a server over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In this embodiment, a method for collecting news material is provided, and fig. 2 is a flowchart of a method for collecting news material according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, collecting first news source data of a target theme from a specified data source in a source limiting mode;

the embodiment relates to a technology for legally collecting data on a specified website through a collection technology. For example, when it is desired to collect news information, a collection script may be set on a news website such as news of new wave and news of Tencent, so as to collect news data.

Step S204, extracting a first keyword set in the first news source data, and acquiring second news source data from a search engine in a non-limited source mode based on the first keyword set;

the embodiment collects news data in two ways, one is data collection of a limited source, and the other is data collection of a non-limited source. The data of the limited sources refers to that the acquisition of news information in a specific field is performed on a specified limited number of data sources (such as large news websites), for example, news related to 5G technology is acquired on the 5G topic of a well-known large website such as Tencent news and New wave news. The data of the unlimited source refers to news information acquired by searching on a search engine according to the keywords, and the data source is not limited.

Step S206, extracting a second keyword set in the second news source data;

step S208, generating news materials of the target theme according to the first keyword set and the second keyword set. The news material of this embodiment includes a plurality of keywords, and is a set.

Through the steps, first news source data of a target theme are collected from a specified data source in a source-limited mode, a first keyword set in the first news source data is extracted, and second news source data are collected from a search engine in a source-unlimited mode based on the first keyword set; the method comprises the steps of extracting a second keyword set in second news source data, generating news materials of a target theme according to the first keyword set and the second keyword set, preventing false news data from being collected by utilizing two data collection modes of limited sources and non-limited sources, improving the accuracy of data and news manuscripts on the premise of ensuring the data volume, solving the technical problem of low accuracy of collecting news materials in the related art, improving the collection efficiency of multi-source text data and reducing the redundancy of the news data.

In one implementation of this embodiment, generating news footage of the target topic from the first set of keywords and the second set of keywords comprises:

s11, comparing the first keyword set with the second keyword set, and extracting a first common related keyword set of the first keyword set and the second keyword set in a first acquisition cycle;

s12, judging whether the number of the first common related key words is less than a preset threshold value;

s13, if the number of the first common related key words is smaller than a preset threshold value, outputting news materials of the target theme by the first key word set and the second key words; and if the number of the first common related key words is larger than or equal to a preset threshold value, continuously iterating and extracting the key word set by taking the second key word set as the initial key word until the number of the nth common key word sets in the nth acquisition period after iteration is smaller than the preset threshold value, wherein n is an integer larger than 0.

In one embodiment, generating news footage of the target topic from the first set of keywords and the second set of keywords comprises: calculating the total number of the keywords of the first keyword set and the second keyword set, if the total number of the keywords is smaller than a first threshold value, comparing the first keyword set with the second keyword set, and extracting a first common related keyword set of the first keyword set and the second keyword set in a first acquisition period; and if the number of the first common related key words is larger than the second threshold value, outputting the first key word set and the second key word to news materials of the target theme. In this embodiment, if the total keywords collected by the two methods are less and the coincidence rate is higher, it is indicated that the news source data collected by the two methods are more reliable, and the common keywords are output as the news material of the target topic.

Optionally, the final keyword set may be determined continuously in an iterative manner, the news source data N of the target topic is acquired from the specified data source in an nth limited source manner, the news source data N +1 is acquired from the search engine in a non-limited source manner based on the news source data N, the keywords in the news source data N and the news source data N +1 are compared until the same keywords are less than a preset value, the iteration is stopped, and the keywords in the news source data N and the news source data N +1 are used as the final keyword set, where N is a positive integer.

In one example, in the second acquisition cycle, the iterative extraction of the keyword set from the second keyword set as the starting keyword comprises: collecting third news source data from a specified data source in a source-limited mode by taking the second keyword set as a search keyword; extracting a third key word set in the third news source data, and acquiring fourth news source data from the search engine in a non-source-limited mode based on the third key word set; a fourth set of keywords in the fourth news source data is extracted.

Fig. 3 is a data acquisition flow chart of an embodiment of the present invention, which is used for acquiring news data from a network, counting corresponding keywords, and packaging the keywords into a data acquisition module, where the data acquisition flow includes:

and S31, collecting data of a restricted source, and initializing a keyword set key _ list _0 { kword _1, kword _2, …, kword _ n }, wherein kword _ n represents the nth keyword, and key _ list _0 represents the initial keyword set.

And S32, collecting non-limited source data according to the keyword set.

And S33, performing data cleaning on the limited source data and the unlimited source data.

S34, extracting the keywords by using a keyword extraction technology, and then updating the keyword set to obtain key _ list _1, wherein _1 represents that the keyword set is obtained by the first updating.

S35, repeating the steps S31-S33, and comparing the new and old keyword sets, for example: if the number of the same keywords in the old and new keyword sets is less than m (m is an engineering experience parameter), stopping iteration, ending the data acquisition module, and keeping the current limited source data and unlimited source data as a news data set raw _ data ═ new _ data _1, new _ data _2, …, new _ data _ n }, wherein the new _ data _ n represents the nth news in the data set. And keeping the current keyword set as a final keyword list key _ list ═ { kword _1, kword _2, …, kword _ n }.

By utilizing two data acquisition modes of limited sources and non-limited sources, the data crawling based on keyword multi-round iteration is carried out to acquire news information data, so that the acquisition of false news data can be prevented, and the accuracy of the data and news manuscripts is improved on the premise of ensuring the data volume.

In an embodiment of this embodiment, the method further includes: performing word segmentation processing on news source data to obtain a word sequence; configuring first label information of a word sequence to generate a news data set, wherein the news data set comprises the word sequence and corresponding first label information, and news source data comprises first news source data and second news source data; identifying a news data set by adopting a target named entity identification NER model, and outputting entity information of the news data set, wherein the entity information comprises an effective character sequence; and selecting news characteristic materials matched with the entity information from the news materials.

In the present embodiment, examples of the entity include a person name, a place name, an organization name, and a proper noun. For example, in a sentence, "xiaoming is in class at school", "xiaoming" is a person name, "school" is a place name, "xiaoming" and "school" all belong to the entity of the sentence. For news information data, some main entity components in the news information data are identified, and the understanding efficiency of the news information can be effectively improved.

Optionally, before identifying the news data set by using the target NER model, the method further includes: dividing a news data set into a training set, a verification set and a test set; and iteratively training the initial NER model by adopting the training set, the verification set and the test set until the latest target NER model meets the preset condition.

In one embodiment of this embodiment, iteratively training the initial NER model using the training set, the validation set, and the test set includes: segmenting the training set, the verification set and the test set into a first character sequence; taking the first character sequence as input data, extracting characteristic information of the first character sequence, and generating a characteristic vector set based on the characteristic information; extracting a hidden state sequence of a characteristic vector set by adopting a bidirectional long-term short-term memory (BilSTM) network, wherein the hidden state sequence comprises character-to-character relation characteristic information; performing entity tag detection on characters in the first character sequence according to the hidden state sequence to obtain second tag information, and generating third tag information by adopting a Viterbi algorithm according to the first tag information and the second tag information to obtain a second character sequence, wherein the second character sequence comprises a word sequence and corresponding third tag information; and taking the second character sequence as input data, and iteratively training the initial NER model until the NER model of the current iteration period meets a preset condition.

Optionally, the extracting, by using the BiLSTM network, the hidden state sequence of the feature vector set includes: extracting feature information of the words according to the feature vector set, and inputting the feature vectors corresponding to the words into a BilSTM network, wherein the BilSTM network comprises a forward LSTM and a reverse LSTM; the forward LSTM outputs according to the input feature vector to obtain a forward hidden state sequence, and the reverse LSTM outputs according to the input feature vector to obtain a reverse hidden state sequence; and splicing the forward hidden state sequence and the reverse hidden state sequence to obtain a complete hidden state sequence.

Because the news data set obtained by the data acquisition module is in a text form which cannot be directly understood by a computer, the data set needs to be processed. The processing flow comprises the following steps: data labeling, data partitioning and model training. The following is a detailed description:

step A, data annotation: named entity recognition is a classification task and is mainly performed based on supervised learning, data are marked to be supervised data with labels, and the link is essential for the supervised learning. The data annotation methods that can be used for the recognition task vary according to the named entity. In this embodiment, for example, by using an entity tagging method of bio (begin inside out), there may be B-PER, I-PER representing a first character of a person and a non-first character of a person, B-LOC, I-LOC representing a first character of a place name and a non-first character of a place name, B-ORG, I-ORG representing a first character of an organization name and a non-first character of an organization name, etc., which all belong to entity tags, and O represents that the word does not belong to a part of a named entity and belongs to an invalid tag. Labeling example: horse [ B-PER ] cloud [ I-PER ] exit [0] agent [0] state [ B-ORG ] family [ I-ORG ] exhibition [ I-ORG ] convention [ I-ORG ]. Through the entity labeling mode, the original news data set raw _ data is processed into a news data set tag _ data with entity tags in each word. The news text firstly carries out Chinese word segmentation through a trained Chinese word segmentation model, obtains a word sequence of the news text after word segmentation, and then manually marks the word sequence to obtain a news entity label of the news text.

Step B, data division: dividing the news data set tag _ data marked in the step A) into the following three parts:

(1) training set: training data for use as a model;

(2) and (4) verification set: used for verifying the performance of the training model;

(3) and (3) test set: to test the effect of the final model. The division standard is A, B and C, namely the training set accounts for A% of raw _ data of the news data set, the verification set accounts for B% of the news data set, and the test set accounts for C%, wherein A, B, C are preset engineering experience parameters.

Step C, model training: after preprocessing the data, the data can be input into a named entity recognition model (optional) for model training, where the flow is described by taking a named entity recognition model of BERT (Bidirectional Encoder for representation from transducers, variable-voltage Bidirectional Encoder) + BiLSTM (Bi-directional Long Short-Term Memory) + CRF (Conditional Random Field algorithm) as an example:

c.1) dividing the news text in the training set, the verification set and the test set obtained in the step B) into character sequences char _ list _ i ═ { char _1, char _2, …, char _ n }, wherein char _ list _ i represents the character sequence of the ith news, and char _ n represents the nth character.

C.2) the first layer of the model is an input layer, and a Chinese pre-training model BERT (alternative technology) provided by Google is used as the input layer, namely the BERT is used as a feature extractor. Inputting the character sequence segmented by c.1) into an input layer, and converting the text into a vector form understandable by a computer, to obtain a feature vector set vector { v _1, v _2, …, v _ n }, where v _ n represents a feature vector of an nth character.

C.3) the second layer of the model is a BilSTM layer, and LSTM refers to an artificial neural network model of long-term and short-term memory. The BilSTM is composed of a positive LSTM and a negative LSTM. The specific BilSTM training process comprises the following steps:

(1) BilSTM can extract character feature information according to the feature vector set obtained by C.2), and feature vectors corresponding to all characters of news are used as input of a BilSTM layer;

(2) the forward LSTM can output and obtain a forward hidden state sequence h according to the input feature vector_{Is just}＝{h_{Is just}_1,h_{Is just}_2,…,h_{Is just}N, where h is_{Is just}N represents the forward hidden state of the nth character;

(3) the reverse LSTM can output a reverse hidden state sequence h according to the input feature vector_{Inverse direction}＝{h_{Inverse direction}_1,h_{Inverse direction}_2,…,h_{Inverse direction}N, where h is_{Inverse direction}N represents the reverse hidden state of the nth character;

(4) hiding the forward hidden state sequence h_{Is just}And reverse hidden state sequence h_{Inverse direction}Splicing to obtain a complete hidden state sequence h ═ h_{Is just}，h_{Inverse direction}And the complete hidden state sequence h contains the relationship characteristic information between the characters learned by the BilSTM through the character characteristic vector.

C.4) the last layer of the model is a CRF layer, wherein the CRF refers to a conditional random field model, and the CRF can perform entity label prediction on characters according to the obtained hidden state sequence h. When the CRF predicts a word again, the information of the entity label before the word can be utilized, and then the optimal entity label result is obtained according to the Viterbi algorithm, so that the character sequence marked with the entity label is obtained.

And C.5) repeating the model training steps of C.2) -C.4) by using the character sequence with the entity label obtained in C.4) as input data until an iteration ending condition is met, finally identifying a news data set by using an NER model with the iteration ending, outputting entity information of the news data set, and recording the character sequence with the effective entity label and a final character feature vector corresponding to the character sequence.

Processing the character sequence with the entity label obtained in the step c.5) into a word sequence, adding and averaging the corresponding character feature vectors to obtain word feature vectors, matching the word sequence with the keyword list key _ list obtained previously, and if the word sequence is the same as the keyword list key _ list obtained previously, recording the word sequence as an entity word set entry _ set { entry _1, entry _2, …, entry _ n }, wherein entry _ n represents a feature keyword sequence of the nth news item, and entry _ n represents a feature vector corresponding to the nth feature keyword.

According to the scheme of the embodiment, the corresponding entity is extracted from news data by using a named entity identification technology, and then the entity information and the news information are used as clustering objects, so that not only can the similarity of the upper text be utilized, but also the key information of the upper entity can be utilized, the clustering effect is improved, and more valuable news materials are obtained.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

In this embodiment, a news material collecting device is further provided for implementing the above embodiments and preferred embodiments, which have already been described and will not be described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of a configuration of an apparatus for collecting news material according to an embodiment of the present invention, as shown in fig. 4, the apparatus including: a first acquisition module 40, a second acquisition module 42, an extraction module 44, a generation module 46, wherein,

the first acquisition module 40 is used for acquiring first news source data of a target theme from a specified data source in a source-limited mode;

a second collecting module 42, configured to extract a first keyword set in the first news source data, and collect second news source data from a search engine in a non-limited source manner based on the first keyword set;

an extracting module 44, configured to extract a second keyword set in the second news source data;

a generating module 46, configured to generate news footage of the target topic according to the first keyword set and the second keyword set.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Fig. 5 is a structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes a processor 51, a communication interface 52, a memory 53 and a communication bus 54, where the processor 51, the communication interface 52, and the memory 53 complete communication with each other through the communication bus 54, and the memory 53 is used for storing a computer program;

the processor 51 is configured to implement the following steps when executing the program stored in the memory 53: collecting first news source data of a target theme from a specified data source in a source limiting mode; extracting a first keyword set in the first news source data, and acquiring second news source data from a search engine in a non-limited source mode based on the first keyword set; extracting a second keyword set in the second news source data; and generating news materials of the target theme according to the first keyword set and the second keyword set.

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment provided by the present application, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute the method for collecting news material as described in any one of the above embodiments.

In a further embodiment provided by the present application, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of gathering news material as described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method for collecting news materials is characterized by comprising the following steps:

collecting first news source data of a target theme from a specified data source in a source limiting mode;

extracting a first keyword set in the first news source data, and acquiring second news source data from a search engine in a non-limited source mode based on the first keyword set;

extracting a second keyword set in the second news source data;

generating news materials of the target theme according to the first keyword set and the second keyword set;

generating news materials of the target theme according to the first keyword set and the second keyword set comprises the following steps: comparing the first keyword set with the second keyword set, and extracting a first common related keyword set of the first keyword set and the second keyword set in a first acquisition period; judging whether the number of the first common related key words is smaller than a preset threshold value or not; if the number of the first common related key words is smaller than a preset threshold value, outputting the first key word set and the second key word set as news materials of the target theme; and if the number of the first common related key words is larger than or equal to a preset threshold value, continuously iterating and extracting the key word set by taking the second key word set as the initial key word until the number of the nth common key word sets in the nth acquisition period after iteration is smaller than the preset threshold value, wherein n is an integer larger than 0.

2. The method of claim 1, wherein, during a second acquisition cycle, continuing to iteratively extract a set of keywords from the second set of keywords as a starting keyword comprises:

collecting third news source data from the specified data source in a source-limited mode by taking the second keyword set as a search keyword;

extracting a third key word set in the third news source data, and acquiring fourth news source data from the search engine in a non-source-limited mode based on the third key word set;

extracting a fourth keyword set in the fourth news source data;

and comparing the third keyword set with the fourth keyword set, and extracting a second common keyword set of the third keyword set and the fourth keyword set in a second acquisition period.

3. The method of claim 1, further comprising:

performing word segmentation processing on the news source data to obtain a word sequence;

configuring first label information of the word sequence to generate a news data set, wherein the news data set comprises the word sequence and corresponding first label information, and the news source data comprises the first news source data and the second news source data; identifying the news data set by adopting a target named entity identification NER model, and outputting entity information of the news data set, wherein the entity information comprises an effective character sequence;

and selecting news characteristic materials matched with the entity information from the news materials.

4. The method of claim 3, wherein prior to identifying the news dataset using a target NER model, the method further comprises:

dividing the news data set into a training set, a verification set and a test set;

and iteratively training an initial NER model by adopting the training set, the verification set and the test set until the latest target NER model meets a preset condition.

5. The method of claim 4, wherein iteratively training an initial NER model using the training set, the validation set, and the test set comprises:

segmenting the training set, the validation set, and the test set into a first sequence of characters;

taking the first character sequence as input data, extracting feature information of the first character sequence, and generating a feature vector set based on the feature information;

extracting a hidden state sequence of the feature vector set by adopting a bidirectional long-term short-term memory (BilSTM) network, wherein the hidden state sequence comprises character-to-character relation feature information;

performing entity tag detection on characters in the first character sequence according to the hidden state sequence to obtain second tag information, and generating third tag information by adopting a Viterbi algorithm according to the first tag information and the second tag information to obtain a second character sequence, wherein the second character sequence comprises a word sequence and corresponding third tag information;

and taking the second character sequence as input data, and iteratively training the initial NER model until the NER model of the current iteration period meets a preset condition.

6. The method of claim 5, wherein extracting the hidden-state sequence of the set of feature vectors using a BilSTM network comprises:

extracting feature information of the words according to the feature vector set, and inputting the feature vectors corresponding to the words into a BilSTM network, wherein the BilSTM network comprises a forward LSTM and a reverse LSTM;

the forward LSTM outputs according to the input feature vector to obtain a forward hidden state sequence, and the reverse LSTM outputs according to the input feature vector to obtain a reverse hidden state sequence;

and splicing the forward hidden state sequence and the reverse hidden state sequence to obtain a complete hidden state sequence.

7. An acquisition device of news material, comprising:

the first acquisition module is used for acquiring first news source data of a target theme from a specified data source in a source limiting mode;

the second acquisition module is used for extracting a first keyword set in the first news source data and acquiring second news source data from a search engine in a non-limited source mode based on the first keyword set;

the extraction module is used for extracting a second keyword set in the second news source data;

the generating module is used for generating news materials of the target theme according to the first keyword set and the second keyword set;

wherein the generating module comprises: the extraction unit is used for comparing the first keyword set with the second keyword set and extracting a first common related keyword set of the first keyword set and the second keyword set in a first acquisition period; the judging unit is used for judging whether the number of the first common key words is smaller than a preset threshold value or not; the processing unit is used for outputting the first keyword set and the second keyword set as news materials of the target theme if the number of the first common related keywords is smaller than a preset threshold value; and if the number of the first common related key words is larger than or equal to a preset threshold value, continuously iterating and extracting the key word set by taking the second key word set as the initial key word until the number of the nth common key word sets in the nth acquisition period after iteration is smaller than the preset threshold value, wherein n is an integer larger than 0.

8. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 6 when executed.

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.