CN112507190B

CN112507190B - Method and system for extracting keywords of financial and economic news

Info

Publication number: CN112507190B
Application number: CN202011495561.5A
Authority: CN
Inventors: 李明玉
Original assignee: Xinhua Zhiyun Technology Co ltd
Current assignee: Xinhua Zhiyun Technology Co ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2023-04-07
Anticipated expiration: 2040-12-17
Also published as: CN112507190A

Abstract

The invention discloses a keyword extraction method and a keyword extraction system for financial and financial fast news, wherein the method comprises the following steps: acquiring financial quick messaging text data and marking the financial text; inputting the marked text data into a pre-trained convolutional neural network to obtain font embedded characteristic vectors of text data characters; inputting the labeled text data into a pre-trained RoBerta-wwm model, and acquiring semantic embedded feature vectors of text data characters; splicing and reducing the dimension of the font embedded feature vector and the semantic embedded feature vector to obtain a combined character feature vector; inputting the combined character feature vector into a conditional random field layer, and acquiring an output character label by adjusting training parameters; and extracting key words according to the character tags. The method and the system adopt a Chinese RoBerta-wwm prediction model to represent the character vector of the financial and economic news text, combine the five-stroke characteristics of Chinese to carry out representation, and can improve the extraction accuracy of the keywords by combining the character type characteristics of the five-stroke Chinese.

Description

Method and system for extracting keywords of financial and economic news

Technical Field

The invention relates to the field of artificial intelligence, in particular to a method and a system for extracting keywords of financial and economic news.

Background

At present, most text keyword extraction algorithms are based on unsupervised algorithms, and the existing keyword extraction method comprises the following steps: the method for extracting keywords based on statistical characteristics, the method for extracting keywords based on word graph characteristics, the method for extracting keywords based on a topic model and the combination of the keyword extraction methods are adopted, however, the existing keyword extraction method depends heavily on the performance of a Chinese word segmentation device, the proportion of wrong specific nouns of the Chinese word segmentation device to the specific nouns in the financial field is high, the extracted keywords are not accurate, for short texts such as financial and channel news and even ultra-short texts with dozens of characters, the text statistical characteristics, the word graph characteristics and the topic characteristics used in the existing scheme are weak, and the keywords extracted by the existing scheme cannot effectively express the core of the financial news and channel news, so that the quasi-calling rate of a keyword algorithm is low.

Disclosure of Invention

One of the main purposes of the invention is to provide a method and a system for extracting keywords of financial and economic news. The method and the system adopt a Chinese RoBerta-wwm prediction model to represent the character vector of the financial and economic news text, combine the five-stroke characteristics of Chinese to carry out representation, and can improve the extraction accuracy of the keywords by combining the character type characteristics of the five-stroke Chinese.

The invention also aims to provide a method and a system for extracting keywords of financial and economic news. The method and the system feed the character mixed vector of the financial and economic news text into a CRF (conditional random field) for correcting the constraint of the part-of-speech syntax of the keyword, and can further judge the type of each character according to the output result.

The invention also aims to provide a method and a system for extracting keywords of financial and economic news. The method and the system are combined with the font characteristics and the semantic characteristics of the characters to represent the financial fast news, so that the relevance of extracting keywords of the financial fast news can be improved.

The invention also aims to provide a method and a system for extracting keywords of financial and financial fast news. The method and the system adopt a supervised learning method to obtain a keyword extraction model, carry out sequence labeling on the finance and economics news shortcut text keywords according to the naming rule of the finance and economics news shortcut, and clean the obtained text before labeling so as to improve the accuracy of the model for extracting the finance and economics news shortcut keywords.

In order to achieve at least one of the above objects, the present invention further provides a method for extracting keywords of financial news, comprising the steps of:

acquiring financial and economic news text data and labeling the financial and economic texts;

inputting the marked text data into a pre-trained convolutional neural network to obtain font embedded characteristic vectors of text data characters;

inputting the labeled text data into a pre-trained RoBerta-wwm model, and acquiring semantic embedded feature vectors of text data characters;

splicing and reducing the dimension of the font embedded feature vector and the semantic embedded feature vector to obtain a combined character feature vector;

inputting the combined character feature vector into a conditional random field layer, and acquiring an output character label by adjusting training parameters;

and extracting key words according to the character tags.

According to a preferred embodiment of the present invention, a five-stroke font feature vector of each character is obtained according to a single financial affair express message text, and a five-stroke font feature vector matrix of the single financial affair express message is established for obtaining a font embedded feature vector of the single financial affair express message.

According to a preferred embodiment of the present invention, at least 3 convolution kernel sliding windows with different sizes are established, a sliding feature map of each convolution kernel sliding window on a five-stroke font feature vector matrix is calculated, and pooling operation is performed according to the obtained feature maps.

According to a preferred embodiment of the present invention, a maximum pooling training parameter α and an average pooling training parameter β are obtained, and a pooled window output characteristic is further calculated, wherein the window output characteristic is:

[O ₁ ,O ₂ ,O ₃ ]＝αMaxPool[m ₁ ,m ₂ ,m ₃ ]+βMeanPool[m ₁ ,m ₂ ,m ₃ ]；

wherein [ m ] is ₁ ,m ₂ ,m ₃ ]For a characteristic diagram of different windows, [ O ] ₁ ,O ₂ ,O ₃ ]The output characteristics of the different windows.

According to one preferred embodiment of the present invention, the pooled output features of different windows are spliced to obtain the five-stroke embedded feature vector

Comprises the following steps:

according to a preferred embodiment of the present invention, the method for obtaining the semantic embedded feature vector comprises the following steps: inputting RoBerta-wwm encorder end into same single financial and financial rapid messaging text for obtaining each character languageFalse embedding feature vector

According to one preferred embodiment of the present invention, a dimension-reduced training parameter W is obtained _O Splicing the semantic embedded feature vector and the semantic embedded feature vector, reducing the dimension of the spliced result according to the dimension reduction training parameter WO, and acquiring the final combined character feature vector

According to one preferred embodiment of the invention, the label probability distribution of each character in the single financial affair express message text is calculated according to the label output by the conditional random field layer, and the keyword of the single financial affair express message is obtained by adopting the BIEO labeling rule according to the probability distribution.

According to one preferred embodiment of the present invention, the CRF layer is decoded using a first order Viterbi algorithm and the entire model is trained using a log likelihood loss function, where the log likelihood loss function is:

is a regular term, lambda is a training parameter, N is the total number of the DCs labeled with keywords, theta is a model integral parameter, and P (y) _i |s _i ) Is the label probability distribution of the character.

In order to achieve at least one of the above objects, the present invention further provides a system for extracting keywords of financial news, wherein the system employs the above method for extracting keywords of financial news.

Drawings

FIG. 1 is a schematic diagram showing a flow of a keyword extraction method for financial and economic news;

FIG. 2 is a schematic diagram showing a model structure of a keyword extraction system for financial and financial news in a manner of the present invention;

FIG. 3 is a schematic diagram showing a convolution diagram of five-stroke font feature vector acquisition in the keyword extraction method of financial and economic news.

Detailed Description

The following description is presented to disclose the invention so as to enable any person skilled in the art to practice the invention. The preferred embodiments in the following description are given by way of example only, and other obvious variations will occur to those skilled in the art. The basic principles of the invention, as defined in the following description, may be applied to other embodiments, variations, modifications, equivalents, and other technical solutions without departing from the spirit and scope of the invention.

It will be understood by those skilled in the art that in the present disclosure, the terms "longitudinal," "lateral," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the invention and simplicity in description, but do not indicate or imply that the device or component being referred to must have a particular orientation, be constructed in a particular orientation, and be constructed in a particular manner of operation, and thus, the terms are not to be construed as limiting the invention.

It is understood that the terms "a" and "an" should be interpreted as meaning that a number of one element or element is one in one embodiment, while a number of other elements is one in another embodiment, and the terms "a" and "an" should not be interpreted as limiting the number.

Referring to fig. 1-3, the present invention provides a schematic flow diagram and a schematic model structure diagram of a keyword extraction method for financial and financial fast news. The method comprises the steps of extracting characteristics of character characteristics in a financial text by adopting a pre-trained model, obtaining a label of a character of the financial news and the economic news in a labeling mode, and extracting keywords according to the label. The invention extracts the key words by taking the characters of the financial and economic news as units and identifies and extracts the key words by utilizing the structural characteristics of the Chinese characters, thereby having obvious advantages compared with the noise caused by inaccurate word segmentation of the traditional Chinese word segmentation device.

The keyword extraction method specifically comprises the following steps: firstly acquiring text data of financial news flashes, wherein the text data of a single financial news flash acquired each time is used as a basic unit for keyword extraction, and the acquired text data is cleaned, wherein the text data cleaning method comprises the following steps: deleting some special characters, invisible characters in the financial fast-messaging text crawler webpage; removing head and tail blank characters, line feed characters and the like of the flash text; removing the URL link in the DCT text; and removing some electric head and electric tail in the financial news prompt text by using rules, such as: (news of the wealth consortium XX day); eliminating financial fast messages with the number of text words less than 10; and intercepting the financial fast messaging text with the text word number still larger than 512 after the processing of the steps. So that each financial and economic news conforms to the word number and format requirements.

Marking the cleaned data, and carrying out entity marking on the text data of each financial affair news according to the naming rule of the financial news, wherein the marked entity content comprises the following steps: entities such as person name, place name, organization name and date in the text need to be represented. The key words of the financial news in the news are required to reflect the fluctuation of the market, the influence on the industry and the financial concept and the like. Entities to be marked comprise related keywords such as futures, financial plates, industries, industry chain nouns and financial event nouns.

Further, the named financial and economic news text is subjected to feature extraction, and the feature extraction comprises the following steps: the font characteristics of each character in the text are extracted by adopting a Convolutional Neural Network (CNN), the convolutional neural network is obtained by pre-training and configuring relevant training parameters, and specifically, the character information of each financial and economic news can be represented as follows: s _i ＝{w ₁ ,w ₂ ,…,w _n }，S _i The character information of a single express message, n represents the number of the characters of the single financial and economic express message, wherein n is more than or equal to 10 and less than or equal to 512. Defining a single Chinese character as w _j Then the five-stroke input of each Chinese character is wubi (w) _j )＝{b _j1 ,b _j2 ,…,b _jk In which b is _jk A font structure for five-stroke input of a single chinese character, k representing the five-stroke font structure, and j representing the character. The five input feature vectors can be obtained through the trained convolutional neural network. Converting the acquired five-stroke input characteristic vector into an exponential form, setting the five-stroke vector dimension as d, and calculating to acquire a five-stroke vector matrix B of each character in the single financial and economic news _i ∈R ^k×d The structure of the five-stroke input feature vector matrix is as follows:

wherein wubi (c) _jk ) The five strokes input feature vectors represent characters, and e is a natural index.

Further, different sliding windows [ a ] are established using convolution kernels ₁ ,a ₂ ,a ₃ ]In one preferred embodiment of the present invention, the sliding window size of the convolution kernel can be set as: [2,3,4]And moving the different sliding windows in the five-stroke vector matrix to obtain feature maps under the sliding windows with different sizes, wherein the feature maps can be represented as follows:

wherein m is ₁ ,m ₂ ,m ₃ Representing the feature map under different size sliding windows. Further performing average pooling and maximum pooling on feature maps under different window sizes, setting an average pooling and training parameter beta and a maximum pooling trainable parameter alpha, performing pooling operation according to the average pooling and training parameter beta and the maximum pooling trainable parameter alpha, and outputting features:

maxpool as maximum pooling operation and Meanpool as average poolChemical conversion operation, [ O ] ₁ ,O ₂ ,O ₃ ]Output characteristics under different sliding windows. Further splicing the output characteristic vectors for obtaining the final font embedded characteristic vector

It should be noted that the vector splicing referred to in the present invention is an expansion of the vector in the horizontal or vertical direction, for example: defining a one-dimensional vector m ₁ ＝[1,2]，m ₂ ＝[3,4]Splicing the one-dimensional vectors to obtain a spliced vector m ₃ ＝[1,2,3,4]. The same applies to the two-dimensional and above vector splicing method.

The invention further adopts a pre-trained RoBerta-wwm model to obtain semantic embedded feature vectors, and the specific method comprises the following steps: the character information S of the same financial and economic news text is used _i Inputting the semantic embedded feature vector of each character to an encoder end of a RoBerta-wwm model

Splicing the obtained semantic embedded characteristic vector and font embedded characteristic vector of the same quick news, and leading the spliced result to pass through dimension reduction trainable parameters W _O Reducing dimension, and finally obtaining the combined character feature vector of the same financial and economic news, wherein the combined character feature vector

The expression is as follows:

explained, the dimension reduction trainable parameters W _O Semantic embedding by sum-concatenation for matrix structureThe feature vector and the font embedding feature vector are multiplied to obtain a combined character feature vector with smaller output dimension

Combining character feature vectors obtained after dimensionality reduction

Inputting the character into a conditional random field layer (CRF layer), and obtaining the label probability distribution of each character in a quick text sequence through the constraint decoding of the conditional random field to the lexical method: />

Where y is the target tag, s is the character information, y' is all possible tag sequences, w _j ' is all Chinese characters corresponding, W _CRF And b _CRF Are parameters and bias terms for the CRF layer. The CRF layer is further decoded according to a first order viterbi algorithm and the entire model is trained with a log likelihood loss function with an L2 regularization term. Wherein the log likelihood loss function is:

wherein N is the total number of the DCS samples marked with the keywords, theta is the overall model parameter, lambda is the training parameter, and P (y) _i |s _i ) Assigning the characters to corresponding labels according to the highest label probability for the label probability distribution of the characters; and further extracting keywords according to a result output by the CRF layer, wherein the label adopts a BIEO label rule, B is a starting character of the keyword, I is a middle character of the keyword, E is an ending character of the keyword, and O is a non-keyword character in the BIEO label rule, and the keyword between B and E is automatically acquired as the keyword corresponding to the financial and financial fast news.

It should be noted that the whole model is shown in fig. 2, and includes a trained convolutional neural network for acquiring feature vectors of five inputs, a RoBerta-wwm model and a CRF layer.

In particular, according to embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wire segments, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless segments, wire segments, fiber optic cables, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be understood by those skilled in the art that the embodiments of the present invention described above and illustrated in the accompanying drawings are illustrative only and not restrictive of the broad invention, and that the objects of the invention have been fully and effectively achieved and that the functional and structural principles of the present invention have been shown and described in the embodiments and that modifications and variations may be resorted to without departing from the principles described herein.

Claims

1. A method for extracting keywords of financial news flashes is characterized by comprising the following steps:

acquiring financial quick messaging text data and marking the financial text;

extracting key words according to the character tags;

and acquiring a five-stroke font characteristic vector of each character according to the single financial and economic news text, and establishing a five-stroke font characteristic vector matrix of the single financial and economic news for acquiring a font embedded characteristic vector of the single financial and economic news.

2. The method of claim 1, wherein at least 3 convolutional kernel sliding windows of different sizes are established, a sliding feature map of each convolutional kernel sliding window on a five-stroke feature vector matrix is calculated, and pooling operation is performed according to the obtained feature map.

3. The method of claim 2, wherein a maximum pooling training parameter is obtained

And an average pooled training parameter>

And further calculating the output characteristics of the window after pooling, wherein the output characteristics of the window are as follows:

；

wherein

For characteristic maps of different windows, ->

The output characteristics of the different windows.

4. The method of claim 3, wherein the five-stroke embedded feature vector is obtained by concatenating the pooled output features of different windows

Comprises the following steps:

。

5. the method for extracting keywords of financial news flashes as claimed in claim 4, wherein the method for obtaining the semantic embedded feature vectors comprises the following steps: inputting RoBerta-wm encoder end to same single financial and economic news text for obtaining semantic embedded characteristic vector of each character

。

6. The method of claim 5, wherein the dimension-reduction training parameters are obtained

The semantic embedded characteristic vector and the semantic embedded characteristic vector are spliced, and the spliced result is based on the dimension-reduction training parameter->

Performing dimension reduction to obtain the final characteristic vector(s) of the combined character(s)>

：

。

7. The method as claimed in claim 6, wherein the probability distribution of the label of each character in the single financial news text is calculated according to the label output from the conditional random field layer, and the keyword of the single financial news is obtained according to the probability distribution by using BIEO labeling rule.

8. The method of claim 7, wherein a CRF layer is decoded using a first order Viterbi algorithm and the entire model is trained using a log likelihood loss function, wherein

，

Is a canonical term, <' > based on a criterion>

In order to train the parameters of the device,Nfor the total number of newsletter samples labeled with the keyword, < > or>

Is the overall parameter of the model and is the overall parameter of the model,

is the label probability distribution of the character.

9. A keyword extraction system for financial and financial news in a short time is characterized in that the keyword extraction system for financial and financial news in a short time adopts the keyword extraction method for financial and financial news in any one of claims 1 to 8.