CN109492097A

CN109492097A - A kind of corporate news data classification of risks method

Info

Publication number: CN109492097A
Application number: CN201811239290.XA
Authority: CN
Inventors: 陈玮; 刘德彬; 孙世通; 吴万杰; 严开
Original assignee: Chongqing Yu Yu Da Data Technology Co Ltd
Current assignee: Chongqing Yucun Technology Co ltd
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2019-03-19
Anticipated expiration: 2038-10-23
Also published as: CN109492097B

Abstract

The invention discloses a kind of corporate news data classification of risks methods, include the following steps: that the Business Name according to determining enterprise obtains the association attributes of the determination enterprise, it scans for the association attributes combination of two and as keyword, news material relevant to the determination enterprise is obtained, and extracts the sentence containing the association attributes from the news material；In sentence inputting CNN sentence disaggregated model containing the association attributes, the sentence classification of each sentence will be obtained, the sentence is classified as positive classification or negative classification；Each sentence classification is weighted processing respectively, takes news category of the big person of sentence classification value as Present News that weight that treated, the news category is positive, and noodles are other or negative classification；The present invention carries out sentence extraction according to enterprise dominant, is predicted by distich subclassification, and then realizes the class prediction for being directed to the news material of the main body.

Description

A kind of corporate news data classification of risks method

Technical field

The invention belongs to technical field of data processing, and in particular to a kind of corporate news data classification of risks method.

Background technique

Currently, state-of-the-art technology has a large amount of textual classification model and sentiment analysis model, algorithm is all comparatively mature. Existing textual classification model and sentiment analysis model are mutually independent algorithm.The mainstream that wherein textual classification model uses is calculated Method has Bi-LSTM algorithm and CNN, FastText algorithm, can be instructed based on character, word-based being used as entire chapter news Practice corpus data, since it is used as training corpus for full text, then for only one classification of a specific news article, But when occurring multiple company's main bodys in news, in fact there may be different points for different company's main bodys Class.For example, certain news content describes the negative information of company A and the positive information of company B, if divided for full text Class, can only obtain a classification always, which may be pair for the classification of company A, but company A's and company B In the different situation of classification (the company A noodles that are negative are other, and the company B noodles that are positive are other), existing classification thinking is unable to satisfy always Classify in same piece news for different subjects mark.And sentiment analysis relatively mostly uses Bi-LSTM algorithm, sentiment analysis is usual Only output entire article Sentiment orientation, including front probability, negative probability；There is no more specifical emotional category to distinguish.Therefore, It is completely dependent on a model prediction, accuracy is highly dependent on the preparation of news corpus data, and it is various in view of news style, together It is entirely different that the news of sample comes from the possible style of different writer, therefore has limitation.

Summary of the invention

In order to solve the above problems existing in the present technology, it is an object of that present invention to provide one kind can be directed to a certain specific master The corporate news data classification of risks method that body is classified.

The technical scheme adopted by the invention is as follows:

A kind of corporate news data classification of risks method, includes the following steps:

The association attributes that the determination enterprise is obtained according to the Business Name of determining enterprise, by the association attributes combination of two And scanned for as keyword, news material relevant to the determination enterprise is obtained, and extract from the news material Contain the sentence of the association attributes out；

In sentence inputting CNN sentence disaggregated model containing the association attributes, the sentence classification of each sentence will be obtained, The sentence is classified as positive classification or negative classification；

Each sentence classification is weighted processing respectively, takes the big person of sentence classification value that weights that treated as working as The news category of preceding news, the news category is positive, and noodles are other or negative classification.

Further, the association attributes include but is not limited to that method name, Gao Guanming, company's abbreviation, stock abbreviation, company are gone through History name and ProductName.

Further, the CNN sentence disaggregated model is the Company News disaggregated model made of the training of CNN algorithm.

Further, training forms the CNN sentence disaggregated model with the following method:

Prepare training corpus data；

By in the sentence inputting CNN sentence classification based training model in training corpus data, training obtains CNN sentence classification mould Type.

Further, the preparation training corpus data include the following steps:

Grab enterprise-class news material in news data source using web crawlers, and by the enterprise-class news material with The form storage of text is in the database；

According to the news focus of enterprises pay attention, news category needed for counting is summarized；

For the different customized a series of strong rules of news category；

According to the customized strong rule, the news material conduct to match with the strong rule is filtered out in the database Spare corpus data；

Using manually to strong rule sift out come spare corpus data check, filter out the first training corpus data；

Using the data for manually obtaining different news categories from major website, as the second training corpus data；

First corpus data and the second corpus data are merged, training corpus data are obtained.

The invention has the benefit that

The present invention carries out sentence extraction according to enterprise dominant, is predicted by distich subclassification, and then realizes and be directed to The class prediction of the news material of the main body.Since each sentence includes the association attributes of determining enterprise, prediction result Necessarily it is directed to the determination enterprise.If multiple enterprise dominants involved in same piece news material can using the method for the present invention Different sentences is extracted according to different subjects, obtains the news category for being directed to different enterprise dominants, and it is more accurate to classify.

Detailed description of the invention

Fig. 1 is flow chart of the present invention.

Fig. 2 is to prepare training corpus data flowchart.

Specific embodiment

With reference to the accompanying drawing and specific embodiment the present invention is further elaborated.Following embodiment is only used for clearer Ground illustrates product of the invention, therefore is intended only as example, and not intended to limit the protection scope of the present invention.

Embodiment:

A kind of corporate news data classification of risks method provided in an embodiment of the present invention, as shown in Figure 1, including following step It is rapid:

S101, the association attributes that the determination enterprise is obtained according to the Business Name of determining enterprise, by the association attributes two Two groups of merging are scanned for as keyword, obtain relevant to determination enterprise news material, and from the news material In extract the sentence containing the association attributes.

It determines that enterprise is the enterprise for needing to carry out news data risk analysis, is obtained according to the Business Name of the determination enterprise The association attributes of the determination enterprise, association attributes include but is not limited to method name, Gao Guanming, company's abbreviation, stock abbreviation, company History name and ProductName.

The meaning of combination of two is the relationships that two association attributes are and.Using the association attributes of combination of two as keyword The search of news material is carried out, accuracy is higher, can prevent from searching because of the appearance of different company's same alike result value and be somebody's turn to do It determines the incoherent news material of enterprise, influences subsequent calculating.For example, Chongqing Yu Cun big data Science and Technology Ltd. and Beijing reputation The company for depositing big data Science and Technology Ltd. is referred to as possible to deposit big data for reputation, if only carried out with single association attributes Search, then the news material that can not be accurately positioned in search result is about Chongqing Yu Cun big data Science and Technology Ltd. or north Jing Yucun big data Science and Technology Ltd..

The association attributes combination of two that will determine enterprise, and scans on the internet as keyword, obtain with The relevant news material of determination enterprise, and extracted from the news material and contain the determination enterprise association attributes (keyword) Sentence.

S102, by containing the association attributes sentence inputting CNN sentence disaggregated model in, obtain the sentence of each sentence Classification, the sentence are classified as positive classification or negative classification.

CNN sentence disaggregated model is the Company News disaggregated model made of the training of CNN algorithm, which can be used existing There is the training of textual classification model training method to form.Each sentence classification is predicted by CNN sentence disaggregated model, is obtained The classification of each sentence, this is classified as positive classification or negative classification.Since each sentence contains the association attributes of determining enterprise, Therefore, the prediction of sentence classification is the prediction carried out for the determination enterprise.

S103, each sentence classification is weighted processing respectively, takes the big person of sentence classification value that weights that treated As the news category of Present News, the news category is positive, and noodles are other or negative classification.

In the present embodiment, 3 are assigned by headline weight, remaining equal weight assigns 1, because headline is often more Represent the Sentiment orientation of author.Sentence classification each in news material is added after weighting is handled respectively, the big person of value is used as should The news category of news material.It will be added after the sentence of positive classification and the sentence of negative classification respectively weighting processing, if just The other value of noodles is big, then the news category noodles that are positive are other, if the value of negative classification is big, the news category noodles that are negative are other.

The present invention is predicted only for enterprise-class news (finance and economics plate, the company's plate of such as news), passes through combination CNN sentence disaggregated model predicts news data risk, and the wind of enterprise dominant in news can be more accurately predicted Dangerous information, accuracy are higher.

Training CNN sentence disaggregated model be unable to do without training corpus, referring to fig. 2: in the present invention, training corpus data preparation side Method includes the following steps:

S201, enterprise-class news material as much as possible is grabbed in news data source using web crawlers, and should Enterprise-class news material stores in the database in a text form.

News data source include the major portal website in the whole nation corporate news and financial and economic news plate and with finance and economics, enterprise Relevant each middle-size and small-size website such as industry.

S202, the news focus according to enterprises pay attention summarize news category needed for counting.

News category includes but is not limited to " tax evasion ", " Policy Supervision ", " risk of breaking one's promise ", " delinquent ", " accident Information ", " product problem ", " win-win cooperation ", " business variation ", " plagiarizing infringement ", " disputes act ", " violates " equity variation " Regulation ", " wage arrears ", " product up-gradation ", " senior executive departing ", " investment and financing ", " operations risks ", " absconding to avoid punishment ", " corruption Bribe ", " fraud fraud ", " achievement awards ", " cuts in salaries of reducing the staff ", " listing failure ", " stock is favourable ", " break ", " strategy Risk ", " disclosing wrong ", " bulletin publicity ", " mortgage is pledged ", " stop doing business rectification ", " stock empty profit ", " debt information ", " achievement Loss ", " financial risk ", " business debt ", " other ", " Cooperation Risk ".

Most news categories are risk classification, such as tax evasion, intuitively embody news and describe mainstream corporation Negative information, so that user has a basic understanding to main body enterprise.

S203, for the different customized a series of strong rules of news category.

S204, according to the customized strong rule of step S203, filter out in the database match with the strong rule it is new Material is heard as spare corpus data.

S205, using manually to strong rule sift out come spare corpus data check, filter out the first training corpus Data.

In a particular embodiment, core manually is carried out to the spare corpus data for specifying strong Rules Filtering to come out as needed It is right, to determine whether the spare corpus screened belongs to specified news category, prevent strong rule error.Because of news type Formula is varied, influenced by writer it is quite big, sometimes strong Rules Filtering go out data be all not fully we want The data taken.The step for increasing artificial nucleus couple, keeps training corpus data more accurate, to guarantee trained model accuracy rate It is higher.

S206, using the data for manually obtaining different news categories from major website, as the second training corpus data.

S207, the first corpus data and the second corpus data are merged, obtains training corpus data.

In training corpus data, the training corpus data of each news category are no less than 5000.

First training corpus data and the second training corpus data are prepared in 1:1 ratio.And the first training corpus data It is not repeated with the second training corpus data.

By in the sentence inputting CNN sentence classification based training model in training corpus, using open source CNN algorithm, training is obtained CNN sentence disaggregated model.

The present invention is not limited to above-mentioned optional embodiment, anyone can show that other are various under the inspiration of the present invention The product of form, however, make any variation in its shape or structure, it is all to fall into the claims in the present invention confining spectrum Technical solution, be within the scope of the present invention.

Claims

1. a kind of corporate news data classification of risks method, which comprises the steps of:

The association attributes of the determining enterprise are obtained according to the Business Name of determining enterprise, simultaneously by the association attributes combination of two It is scanned for as keyword, obtains news material relevant to the determining enterprise, and extract from the news material Contain the sentence of the association attributes out；

In sentence inputting CNN sentence disaggregated model containing the association attributes, the sentence classification of each sentence will be obtained, it is described Sentence is classified as positive classification or negative classification；

Each sentence classification is weighted processing respectively, takes weighting treated the big person of sentence classification value as currently new The news category of news, the news category is positive, and noodles are other or negative classification.

2. corporate news data classification of risks method according to claim 1, which is characterized in that the association attributes include But it is not limited to method name, Gao Guanming, company's abbreviation, stock abbreviation, corporate history name and ProductName.

3. corporate news data classification of risks method according to claim 1, which is characterized in that the CNN sentence classification Model is the Company News disaggregated model made of the training of CNN algorithm.

4. corporate news data classification of risks method according to claim 3, which is characterized in that the CNN sentence classification Training forms model with the following method:

Prepare training corpus data；

By in the sentence inputting CNN sentence classification based training model in training corpus data, training obtains CNN sentence disaggregated model.

5. corporate news data classification of risks method according to claim 4, which is characterized in that the preparation training corpus Data include the following steps:

Enterprise-class news material is grabbed in news data source using web crawlers, and by the enterprise-class news material with text This form storage is in the database；

For the different customized a series of strong rules of news category；

According to the customized strong rule, the news material to match with the strong rule is filtered out in the database as standby Use corpus data；