CN112597761A

CN112597761A - Temporary report semantic information mining method and device, storage medium and electronic equipment

Info

Publication number: CN112597761A
Application number: CN202011415777.6A
Authority: CN
Inventors: 蒋翠清; 吕喜梅; 王钊; 丁勇; 王建飞; 殷畅
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-04-02

Abstract

The invention provides a method and a device for mining temporary report semantic information, a storage medium and electronic equipment, and relates to the technical field of natural language processing. The method comprises the steps of constructing a word list based on acquired temporary report text data, training a word vector of each word in the word list by utilizing a BERT model, acquiring a document vector of each temporary report based on the word vector of each word and a TFIDF value of the word vector, acquiring an enterprise vector from the document vector, reducing dimensions of the enterprise vector, and obtaining the remaining enterprise vector dimensions after dimension reduction, namely temporary report semantic information. The invention makes up the technical vacancy that the information mining can not be carried out on the temporary report in the prior art, and ensures that the mining of the semantic information of the temporary report is automatic, accurate and effective.

Description

Temporary report semantic information mining method and device, storage medium and electronic equipment

Technical Field

The invention relates to a natural language processing technology, in particular to a method and a device for mining temporary report semantic information, a storage medium and electronic equipment.

Background

Under the background of big data era, the traditional quantitative information can not meet the requirements of users, and people are thrown eyes to qualitative text data. Regular reports (annual reports, quarterly reports and the like) and temporary reports are main files for transmitting the business operation conditions and development trends of enterprises and are widely concerned by various users. However, regular reports and temporary reports often exist in text form, and have qualitative unstructured characteristics, so that users have difficulty in accurately reading and mining effective information.

At present, research on enterprise report texts mainly focuses on regular report texts, common analysis methods include manual reading analysis and scoring, but automatic analysis methods for temporary report texts have not been reported yet. In addition, unlike the regular report, the temporary report information has a wide coverage range and no fixed and standardized format, has different descriptions for different events, is short in space, and lacks of emotional information of the management layer, so the conventional text analysis method suitable for the regular report is not suitable for mining and analyzing the text of the temporary report, and further cannot deeply mine effective information hidden in the content of the temporary report.

Therefore, the prior art cannot automatically and accurately analyze the temporary report text.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a mining method, a mining device, a mining storage medium and electronic equipment for temporary report semantic information, and solves the problem that the prior art cannot automatically and accurately analyze a temporary report text.

(II) technical scheme

In order to achieve the purpose, the invention is realized by the following technical scheme:

in a first aspect, the present invention provides a mining method for temporary report semantic information, where the method includes:

acquiring temporary report text data, and constructing a word list based on the temporary report text data;

training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;

obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining an enterprise vector based on all the document vectors;

and reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.

Preferably, the obtaining of the temporary report text data and the constructing of the vocabulary based on the temporary report text data include:

acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;

acquiring a user-defined dictionary and a stop word dictionary based on the corpus;

and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.

Preferably, the obtaining a document vector of each provisional report based on the word vector of each word and the TFIDF value, and obtaining a business vector based on all the document vectors includes:

taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain a document vector of each temporary report;

and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.

Preferably, the dimensionality reduction of the enterprise vector includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.

In a second aspect, the present invention provides an apparatus for mining temporary report semantic information, including:

the data acquisition module is used for acquiring temporary report text data and constructing a word list based on the temporary report text data;

the word vector training module is used for training the word vector of each word in the word list based on a BERT model and simultaneously acquiring the TFIDF value of each word;

an enterprise vector acquisition module, configured to acquire a document vector of each provisional report based on the word vector and the TFIDF value of each word, and acquire an enterprise vector based on all the document vectors;

and the semantic information mining module is used for reducing the dimension of the enterprise vector and acquiring the semantic information of the temporary report.

Preferably, the data obtaining module obtains temporary report text data, and constructs a word list based on the temporary report text data, including:

Preferably, the obtaining module of enterprise vector obtains document vector of each provisional report based on the word vector of each word and the TFIDF value, and obtains enterprise vector based on all the document vectors, including:

taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain the document vector of the temporary report;

Preferably, the semantic information mining module performs dimensionality reduction on the enterprise vector, and includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.

In a third aspect, the present invention proposes a computer-readable storage medium storing a computer program for temporal reporting semantic information mining, wherein the computer program causes a computer to execute the temporal reporting semantic information mining method as described above.

In a fourth aspect, the present invention provides an electronic device, including:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the temporal reporting semantic information mining method as described above.

(III) advantageous effects

The invention provides a method and a device for mining temporary report semantic information, a storage medium and electronic equipment. Compared with the prior art, the method has the following beneficial effects:

the method comprises the steps of constructing a word list based on acquired temporary report text data, training a word vector of each word in the word list by utilizing a BERT model, acquiring a document vector of each temporary report based on the word vector of each word and a TFIDF value of the word vector, acquiring an enterprise vector from the document vector, reducing dimensions of the enterprise vector, and obtaining the remaining enterprise vector dimensions after dimension reduction, namely temporary report semantic information. The automatic and accurate analysis method for the semantic information of the temporary report, provided by the invention, makes up the technical vacancy that the information mining cannot be carried out on the temporary report in the prior art, and enables the mining of the semantic information of the temporary report to be more automatic, accurate and effective.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a mining method of temporary report semantic information according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

By providing the mining method, device, storage medium and electronic equipment for the temporary report semantic information, the problem that the temporary report text cannot be automatically and accurately analyzed in the prior art is solved, and the purpose of automatically and accurately analyzing and mining the temporary report semantic information is achieved.

In order to solve the technical problems, the general idea of the embodiment of the application is as follows:

in order to realize automatic, accurate and effective excavation of the temporary report text, the method comprises the steps of firstly obtaining temporary report text data and constructing a word list, then training a word vector of each word in the word list by utilizing a BERT model, obtaining a document vector of each temporary report based on the word vector of each word and a TFIDF value of each word, then obtaining an enterprise vector by the document vector, reducing the dimension of the enterprise vector, and finally taking the remaining enterprise vector dimension after the dimension reduction as the semantic information of the temporary report.

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

Example 1:

in a first aspect, an embodiment of the present invention first provides a method for mining temporary report semantic information, where referring to fig. 1, the method includes:

s1, acquiring temporary report text data, and constructing a word list based on the temporary report text data;

s2, training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;

s3, obtaining a document vector of each temporary report based on the word vector and the TFIDF value of each word, and obtaining an enterprise vector based on all the document vectors;

and S4, reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.

The word list is constructed based on the acquired temporary report text data, the word vector of each word in the word list is trained by using a BERT model, the document vector of each temporary report is acquired based on the word vector of each word and the TFIDF value of the word vector, the enterprise vector is acquired from the document vector, the dimension of the enterprise vector is reduced, and the dimension of the rest enterprise vector after dimension reduction is the semantic information of the temporary report. The automatic and accurate analysis method for the semantic information of the temporary report, provided by the invention, makes up the technical vacancy that the information mining cannot be carried out on the temporary report in the prior art, and enables the mining of the semantic information of the temporary report to be more automatic, accurate and effective.

In the embodiment of the present invention, in order to obtain a more accurate word list, a better processing method is that when obtaining temporary report text data and constructing a word list based on the temporary report text data, the method includes the following steps: acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library; acquiring a user-defined dictionary and a stop word dictionary based on the corpus; and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.

In addition, in the embodiment of the present invention, in order to make the mining result of the temporary report semantic information more accurate and effective, the technical solution uses the TFIDF value of each word to represent the importance degree of each word in the document, and specifically, when obtaining the document vector of each temporary report based on the word vector of each word and the TFIDF value thereof, and obtaining the enterprise vector based on all the document vectors, includes the following steps: taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain the document vector of the temporary report; and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.

In the embodiment of the present invention, in order to ensure the accuracy and effectiveness of the semantic information mining result, a better processing method is to use a principal component analysis method to perform dimensionality reduction on the enterprise vector when performing dimensionality reduction on the enterprise vector.

The following describes the implementation of an embodiment of the present invention in detail with reference to the explanation of specific steps.

Fig. 1 is a flowchart of a mining method for temporary report semantic information, and referring to fig. 1, a concrete process of mining temporary report semantic information includes:

and S1, acquiring temporary report text data, and constructing a word list based on the temporary report text data.

First, temporary report text data is acquired. The method comprises the steps of downloading temporary report pdf files belonging to the same enterprise in batch from a website such as a stock exchange by using crawler software (such as octopus and the like) or crawler codes, then converting the temporary report pdf files into formats which can be recognized by a computer in batch (such as converting the temporary report pdf files into csv and txt formats by using conversion tools such as WPS and the like), and finally summarizing and merging all format-converted temporary report files into a document to form a corpus (namely a document set).

Then, word list construction is carried out on the temporary report document. 1) And loading the custom dictionary and the stop word dictionary. According to the corpus, the indivisible words are brought into a custom dictionary (the custom dictionary is a collection of some indivisible words, such as ' rights and interests change ' which cannot be divided into ' rights and interests ' change '), so that improper word segmentation is avoided; adding nonsense words in the corpus on the basis of the general stop word dictionary to serve as a stop word dictionary (the stop word dictionary is a collection of unimportant words, such as 'yes'); 2) punctuation, letters and numbers. Removing punctuations, letters and numbers in each temporary report document by using a regular expression; 3) segmenting words and deactivating words. Performing word segmentation on each temporary report document by adopting a jieba word segmentation accurate mode, and removing words contained in a stop word list; 4) and removing sparse words. Sparse words in each provisional report document are removed at a sparsity rate of 0.99 (sparsity rate refers to the number of documents in which a word appears divided by the total number of documents).

And finally, constructing a word list. And removing the duplication of the words in all the processed temporary report documents to form a word list.

S2, training the word vector of each word in the word list based on the BERT model, and simultaneously obtaining the TFIDF value of each word.

The BERT model is a pre-trained text representation method, can be used for extracting high-quality linguistic features from text data, is an abbreviation of Bidirectional encoding expressions, and mainly uses a Mask language model pre-training method, firstly randomly marks words, and then predicts the words through a training model so as to obtain word embedding expressions. And training a word vector of each word in the word list by using a BERT word vector technology in a Tensorflow environment, wherein the default dimension is 768, so that each word in the word list is converted into a 768-dimensional real number vector from a text. And acquiring the TFIDF value of each word in each temporary report document, namely a word frequency inverse document matrix, and reflecting the importance of each word in each document in the word list. Specifically, a document-term matrix (DTM) is created, with each word in the vocabulary being a column, and each row being the term frequency-inverse document probability (TFIDF) for each term in each document, with the TFIDF value representing the degree of importance of the term in the document.

Assuming that the word list contains 1000 words and the total document number of the temporary report document is D, a TFIDF matrix of D1000 is formed. Meanwhile, the obtained 1000 768-dimensional real number vectors form a 1000 x 768-dimensional word vector matrix.

S3, obtaining the document vector of each temporary report based on the word vector of each word and the TFIDF value, and obtaining the enterprise vector based on all the document vectors.

And taking the TFIDF value of each word in each temporary report document as weight, correspondingly multiplying the TFIDF value by the word vector of each word, and adding the result to obtain the 768-dimensional vector (namely, multiplying the TFIDF matrix by the word vector matrix) of the temporary report document. When the total document number is D, obtaining a document vector matrix of D768; and then calculating the arithmetic mean of all temporary report document vectors of the same enterprise to be used as the enterprise vector of the enterprise.

And S4, reducing the dimension of the enterprise vector, and acquiring semantic soft information of the temporary report.

And (3) reducing the dimensionality of the 768-dimensional enterprise vector by using a Principal Component Analysis (PCA), reserving 85% of information according to the covariance proportion, and taking the remaining enterprise vector dimensionality as finally extracted temporary report semantic information. Specifically, first, data preprocessing: calculating the average value of the enterprise vectors of all dimensions; subtracting a corresponding mean value from each element of the enterprise vector matrix so as to facilitate data centralization and reduce the possibility of overfitting, and then solving a covariance matrix, an eigenvalue of covariance and an eigenvector; and finally, arranging the eigenvalues in a descending order, selecting the largest k eigenvectors, and taking the corresponding k eigenvectors as column vectors to form an eigenvector matrix. Thus, the sample variance in each dimension is large, i.e., k is chosen according to the covariance ratio. In general, the covariance ratio is set to 85%, that is, 85% of the original data information is retained, and finally the remaining enterprise vector dimension is used as the finally extracted temporary report semantic information.

Therefore, the whole process of the mining method for the temporary report semantic information is completed.

Example 2:

in a second aspect, in an embodiment of the present invention, an apparatus for mining temporary report semantic information is provided, the apparatus including:

Optionally, the data obtaining module obtains temporary report text data, and constructs a vocabulary based on the temporary report text data, including:

performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the custom dictionary and the stop word dictionary to form a word list;

optionally, the obtaining module of enterprise vectors obtains document vectors of each provisional report based on the word vector of each word and the TFIDF value, and obtains enterprise vectors based on all the document vectors, including:

taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding to obtain the document vector;

Optionally, the reducing the dimension of the enterprise vector by the semantic information mining module includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.

It can be understood that the device for mining the temporary report semantic information provided by the embodiment of the present invention corresponds to the method for mining the temporary report semantic information, and the explanation, example, beneficial effects and the like of the relevant contents thereof may refer to the corresponding contents in the method for mining the temporary report semantic information, and are not described herein again.

Example 3:

in a third aspect, an embodiment of the present invention provides a computer-readable storage medium storing a computer program for temporally reporting semantic information mining, wherein the computer program causes a computer to execute the following steps:

Optionally, the obtaining of the temporary report text data and constructing a vocabulary table based on the temporary report text data include:

Optionally, the obtaining a document vector of each provisional report based on the word vector of each word and the TFIDF value, and obtaining a business vector based on all the document vectors includes:

Optionally, the dimensionality reduction of the enterprise vector includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.

Example 4:

in a fourth aspect, an embodiment of the present invention provides an electronic device, including:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the steps of:

In summary, compared with the prior art, the method has the following beneficial effects:

1. the method comprises the steps of constructing a word list based on acquired temporary report text data, training a word vector of each word in the word list by utilizing a BERT model, acquiring a document vector of each temporary report based on the word vector of each word and a TFIDF value of the word vector, acquiring an enterprise vector from the document vector, reducing dimensions of the enterprise vector, and obtaining the remaining enterprise vector dimensions after dimension reduction, namely temporary report semantic information. The automatic and accurate analysis method for the semantic information of the temporary report, provided by the invention, makes up the technical vacancy that the information mining cannot be carried out on the temporary report in the prior art, and enables the mining of the semantic information of the temporary report to be more automatic, accurate and effective;

2. the invention trains the word vector of each word in the word list by using the BERT model, and can more deeply dig out effective information hidden in the temporary report text;

3. the invention obtains the document vector of each temporary report by using the word vector of each word and the TFIDF value of each word, considers the importance degree of each word in the document, and can ensure that the mining result of the semantic information of the temporary report is more accurate and effective.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for mining temporary report semantic information, the method comprising:

2. The method of claim 1, wherein said obtaining temporary report text data, building a vocabulary based on said temporary report text data, comprises:

3. The method of claim 1, wherein said obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining a business vector based on all of the document vectors, comprises:

4. The method of claim 1, wherein the dimensionality reduction of the enterprise vector comprises: and reducing the dimension of the enterprise vector by using a principal component analysis method.

5. An apparatus for mining temporary report semantic information, the apparatus comprising:

6. The apparatus of claim 5, wherein the data acquisition module acquires temporary report text data, constructs a vocabulary based on the temporary report text data, comprising:

7. The apparatus of claim 5, wherein the enterprise vector acquisition module acquires a document vector for each provisional report based on the word vector for each word and the TFIDF value, and acquires an enterprise vector based on all of the document vectors, comprising:

8. The apparatus of claim 5, wherein the semantic information mining module dimensionality reduces the enterprise vector comprising: and reducing the dimension of the enterprise vector by using a principal component analysis method.

9. A computer-readable storage medium storing a computer program for mining temporary report semantic information, wherein the computer program causes a computer to execute the temporary report semantic information mining method according to any one of claims 1 to 4.

10. An electronic device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the temporal reporting semantic information mining method of any of claims 1-4.