CN112597761A - Temporary report semantic information mining method and device, storage medium and electronic equipment - Google Patents
Temporary report semantic information mining method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN112597761A CN112597761A CN202011415777.6A CN202011415777A CN112597761A CN 112597761 A CN112597761 A CN 112597761A CN 202011415777 A CN202011415777 A CN 202011415777A CN 112597761 A CN112597761 A CN 112597761A
- Authority
- CN
- China
- Prior art keywords
- word
- vector
- temporary report
- text data
- enterprise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005065 mining Methods 0.000 title claims abstract description 49
- 238000000034 method Methods 0.000 title claims abstract description 43
- 239000013598 vector Substances 0.000 claims abstract description 202
- 238000012549 training Methods 0.000 claims abstract description 19
- 230000009467 reduction Effects 0.000 claims abstract description 12
- 230000011218 segmentation Effects 0.000 claims description 11
- 238000012847 principal component analysis method Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 6
- 230000002123 temporal effect Effects 0.000 claims description 4
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 239000011159 matrix material Substances 0.000 description 10
- 238000004458 analytical method Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 241000238413 Octopus Species 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method and a device for mining temporary report semantic information, a storage medium and electronic equipment, and relates to the technical field of natural language processing. The method comprises the steps of constructing a word list based on acquired temporary report text data, training a word vector of each word in the word list by utilizing a BERT model, acquiring a document vector of each temporary report based on the word vector of each word and a TFIDF value of the word vector, acquiring an enterprise vector from the document vector, reducing dimensions of the enterprise vector, and obtaining the remaining enterprise vector dimensions after dimension reduction, namely temporary report semantic information. The invention makes up the technical vacancy that the information mining can not be carried out on the temporary report in the prior art, and ensures that the mining of the semantic information of the temporary report is automatic, accurate and effective.
Description
Technical Field
The invention relates to a natural language processing technology, in particular to a method and a device for mining temporary report semantic information, a storage medium and electronic equipment.
Background
Under the background of big data era, the traditional quantitative information can not meet the requirements of users, and people are thrown eyes to qualitative text data. Regular reports (annual reports, quarterly reports and the like) and temporary reports are main files for transmitting the business operation conditions and development trends of enterprises and are widely concerned by various users. However, regular reports and temporary reports often exist in text form, and have qualitative unstructured characteristics, so that users have difficulty in accurately reading and mining effective information.
At present, research on enterprise report texts mainly focuses on regular report texts, common analysis methods include manual reading analysis and scoring, but automatic analysis methods for temporary report texts have not been reported yet. In addition, unlike the regular report, the temporary report information has a wide coverage range and no fixed and standardized format, has different descriptions for different events, is short in space, and lacks of emotional information of the management layer, so the conventional text analysis method suitable for the regular report is not suitable for mining and analyzing the text of the temporary report, and further cannot deeply mine effective information hidden in the content of the temporary report.
Therefore, the prior art cannot automatically and accurately analyze the temporary report text.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a mining method, a mining device, a mining storage medium and electronic equipment for temporary report semantic information, and solves the problem that the prior art cannot automatically and accurately analyze a temporary report text.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
in a first aspect, the present invention provides a mining method for temporary report semantic information, where the method includes:
acquiring temporary report text data, and constructing a word list based on the temporary report text data;
training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;
obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining an enterprise vector based on all the document vectors;
and reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.
Preferably, the obtaining of the temporary report text data and the constructing of the vocabulary based on the temporary report text data include:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
Preferably, the obtaining a document vector of each provisional report based on the word vector of each word and the TFIDF value, and obtaining a business vector based on all the document vectors includes:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain a document vector of each temporary report;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
Preferably, the dimensionality reduction of the enterprise vector includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.
In a second aspect, the present invention provides an apparatus for mining temporary report semantic information, including:
the data acquisition module is used for acquiring temporary report text data and constructing a word list based on the temporary report text data;
the word vector training module is used for training the word vector of each word in the word list based on a BERT model and simultaneously acquiring the TFIDF value of each word;
an enterprise vector acquisition module, configured to acquire a document vector of each provisional report based on the word vector and the TFIDF value of each word, and acquire an enterprise vector based on all the document vectors;
and the semantic information mining module is used for reducing the dimension of the enterprise vector and acquiring the semantic information of the temporary report.
Preferably, the data obtaining module obtains temporary report text data, and constructs a word list based on the temporary report text data, including:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
Preferably, the obtaining module of enterprise vector obtains document vector of each provisional report based on the word vector of each word and the TFIDF value, and obtains enterprise vector based on all the document vectors, including:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain the document vector of the temporary report;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
Preferably, the semantic information mining module performs dimensionality reduction on the enterprise vector, and includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.
In a third aspect, the present invention proposes a computer-readable storage medium storing a computer program for temporal reporting semantic information mining, wherein the computer program causes a computer to execute the temporal reporting semantic information mining method as described above.
In a fourth aspect, the present invention provides an electronic device, including:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the temporal reporting semantic information mining method as described above.
(III) advantageous effects
The invention provides a method and a device for mining temporary report semantic information, a storage medium and electronic equipment. Compared with the prior art, the method has the following beneficial effects:
the method comprises the steps of constructing a word list based on acquired temporary report text data, training a word vector of each word in the word list by utilizing a BERT model, acquiring a document vector of each temporary report based on the word vector of each word and a TFIDF value of the word vector, acquiring an enterprise vector from the document vector, reducing dimensions of the enterprise vector, and obtaining the remaining enterprise vector dimensions after dimension reduction, namely temporary report semantic information. The automatic and accurate analysis method for the semantic information of the temporary report, provided by the invention, makes up the technical vacancy that the information mining cannot be carried out on the temporary report in the prior art, and enables the mining of the semantic information of the temporary report to be more automatic, accurate and effective.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a mining method of temporary report semantic information according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
By providing the mining method, device, storage medium and electronic equipment for the temporary report semantic information, the problem that the temporary report text cannot be automatically and accurately analyzed in the prior art is solved, and the purpose of automatically and accurately analyzing and mining the temporary report semantic information is achieved.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
in order to realize automatic, accurate and effective excavation of the temporary report text, the method comprises the steps of firstly obtaining temporary report text data and constructing a word list, then training a word vector of each word in the word list by utilizing a BERT model, obtaining a document vector of each temporary report based on the word vector of each word and a TFIDF value of each word, then obtaining an enterprise vector by the document vector, reducing the dimension of the enterprise vector, and finally taking the remaining enterprise vector dimension after the dimension reduction as the semantic information of the temporary report.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Example 1:
in a first aspect, an embodiment of the present invention first provides a method for mining temporary report semantic information, where referring to fig. 1, the method includes:
s1, acquiring temporary report text data, and constructing a word list based on the temporary report text data;
s2, training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;
s3, obtaining a document vector of each temporary report based on the word vector and the TFIDF value of each word, and obtaining an enterprise vector based on all the document vectors;
and S4, reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.
The word list is constructed based on the acquired temporary report text data, the word vector of each word in the word list is trained by using a BERT model, the document vector of each temporary report is acquired based on the word vector of each word and the TFIDF value of the word vector, the enterprise vector is acquired from the document vector, the dimension of the enterprise vector is reduced, and the dimension of the rest enterprise vector after dimension reduction is the semantic information of the temporary report. The automatic and accurate analysis method for the semantic information of the temporary report, provided by the invention, makes up the technical vacancy that the information mining cannot be carried out on the temporary report in the prior art, and enables the mining of the semantic information of the temporary report to be more automatic, accurate and effective.
In the embodiment of the present invention, in order to obtain a more accurate word list, a better processing method is that when obtaining temporary report text data and constructing a word list based on the temporary report text data, the method includes the following steps: acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library; acquiring a user-defined dictionary and a stop word dictionary based on the corpus; and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
In addition, in the embodiment of the present invention, in order to make the mining result of the temporary report semantic information more accurate and effective, the technical solution uses the TFIDF value of each word to represent the importance degree of each word in the document, and specifically, when obtaining the document vector of each temporary report based on the word vector of each word and the TFIDF value thereof, and obtaining the enterprise vector based on all the document vectors, includes the following steps: taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain the document vector of the temporary report; and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
In the embodiment of the present invention, in order to ensure the accuracy and effectiveness of the semantic information mining result, a better processing method is to use a principal component analysis method to perform dimensionality reduction on the enterprise vector when performing dimensionality reduction on the enterprise vector.
The following describes the implementation of an embodiment of the present invention in detail with reference to the explanation of specific steps.
Fig. 1 is a flowchart of a mining method for temporary report semantic information, and referring to fig. 1, a concrete process of mining temporary report semantic information includes:
and S1, acquiring temporary report text data, and constructing a word list based on the temporary report text data.
First, temporary report text data is acquired. The method comprises the steps of downloading temporary report pdf files belonging to the same enterprise in batch from a website such as a stock exchange by using crawler software (such as octopus and the like) or crawler codes, then converting the temporary report pdf files into formats which can be recognized by a computer in batch (such as converting the temporary report pdf files into csv and txt formats by using conversion tools such as WPS and the like), and finally summarizing and merging all format-converted temporary report files into a document to form a corpus (namely a document set).
Then, word list construction is carried out on the temporary report document. 1) And loading the custom dictionary and the stop word dictionary. According to the corpus, the indivisible words are brought into a custom dictionary (the custom dictionary is a collection of some indivisible words, such as ' rights and interests change ' which cannot be divided into ' rights and interests ' change '), so that improper word segmentation is avoided; adding nonsense words in the corpus on the basis of the general stop word dictionary to serve as a stop word dictionary (the stop word dictionary is a collection of unimportant words, such as 'yes'); 2) punctuation, letters and numbers. Removing punctuations, letters and numbers in each temporary report document by using a regular expression; 3) segmenting words and deactivating words. Performing word segmentation on each temporary report document by adopting a jieba word segmentation accurate mode, and removing words contained in a stop word list; 4) and removing sparse words. Sparse words in each provisional report document are removed at a sparsity rate of 0.99 (sparsity rate refers to the number of documents in which a word appears divided by the total number of documents).
And finally, constructing a word list. And removing the duplication of the words in all the processed temporary report documents to form a word list.
S2, training the word vector of each word in the word list based on the BERT model, and simultaneously obtaining the TFIDF value of each word.
The BERT model is a pre-trained text representation method, can be used for extracting high-quality linguistic features from text data, is an abbreviation of Bidirectional encoding expressions, and mainly uses a Mask language model pre-training method, firstly randomly marks words, and then predicts the words through a training model so as to obtain word embedding expressions. And training a word vector of each word in the word list by using a BERT word vector technology in a Tensorflow environment, wherein the default dimension is 768, so that each word in the word list is converted into a 768-dimensional real number vector from a text. And acquiring the TFIDF value of each word in each temporary report document, namely a word frequency inverse document matrix, and reflecting the importance of each word in each document in the word list. Specifically, a document-term matrix (DTM) is created, with each word in the vocabulary being a column, and each row being the term frequency-inverse document probability (TFIDF) for each term in each document, with the TFIDF value representing the degree of importance of the term in the document.
Assuming that the word list contains 1000 words and the total document number of the temporary report document is D, a TFIDF matrix of D1000 is formed. Meanwhile, the obtained 1000 768-dimensional real number vectors form a 1000 x 768-dimensional word vector matrix.
S3, obtaining the document vector of each temporary report based on the word vector of each word and the TFIDF value, and obtaining the enterprise vector based on all the document vectors.
And taking the TFIDF value of each word in each temporary report document as weight, correspondingly multiplying the TFIDF value by the word vector of each word, and adding the result to obtain the 768-dimensional vector (namely, multiplying the TFIDF matrix by the word vector matrix) of the temporary report document. When the total document number is D, obtaining a document vector matrix of D768; and then calculating the arithmetic mean of all temporary report document vectors of the same enterprise to be used as the enterprise vector of the enterprise.
And S4, reducing the dimension of the enterprise vector, and acquiring semantic soft information of the temporary report.
And (3) reducing the dimensionality of the 768-dimensional enterprise vector by using a Principal Component Analysis (PCA), reserving 85% of information according to the covariance proportion, and taking the remaining enterprise vector dimensionality as finally extracted temporary report semantic information. Specifically, first, data preprocessing: calculating the average value of the enterprise vectors of all dimensions; subtracting a corresponding mean value from each element of the enterprise vector matrix so as to facilitate data centralization and reduce the possibility of overfitting, and then solving a covariance matrix, an eigenvalue of covariance and an eigenvector; and finally, arranging the eigenvalues in a descending order, selecting the largest k eigenvectors, and taking the corresponding k eigenvectors as column vectors to form an eigenvector matrix. Thus, the sample variance in each dimension is large, i.e., k is chosen according to the covariance ratio. In general, the covariance ratio is set to 85%, that is, 85% of the original data information is retained, and finally the remaining enterprise vector dimension is used as the finally extracted temporary report semantic information.
Therefore, the whole process of the mining method for the temporary report semantic information is completed.
Example 2:
in a second aspect, in an embodiment of the present invention, an apparatus for mining temporary report semantic information is provided, the apparatus including:
the data acquisition module is used for acquiring temporary report text data and constructing a word list based on the temporary report text data;
the word vector training module is used for training the word vector of each word in the word list based on a BERT model and simultaneously acquiring the TFIDF value of each word;
an enterprise vector acquisition module, configured to acquire a document vector of each provisional report based on the word vector and the TFIDF value of each word, and acquire an enterprise vector based on all the document vectors;
and the semantic information mining module is used for reducing the dimension of the enterprise vector and acquiring the semantic information of the temporary report.
Optionally, the data obtaining module obtains temporary report text data, and constructs a vocabulary based on the temporary report text data, including:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the custom dictionary and the stop word dictionary to form a word list;
optionally, the obtaining module of enterprise vectors obtains document vectors of each provisional report based on the word vector of each word and the TFIDF value, and obtains enterprise vectors based on all the document vectors, including:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding to obtain the document vector;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
Optionally, the reducing the dimension of the enterprise vector by the semantic information mining module includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.
It can be understood that the device for mining the temporary report semantic information provided by the embodiment of the present invention corresponds to the method for mining the temporary report semantic information, and the explanation, example, beneficial effects and the like of the relevant contents thereof may refer to the corresponding contents in the method for mining the temporary report semantic information, and are not described herein again.
Example 3:
in a third aspect, an embodiment of the present invention provides a computer-readable storage medium storing a computer program for temporally reporting semantic information mining, wherein the computer program causes a computer to execute the following steps:
acquiring temporary report text data, and constructing a word list based on the temporary report text data;
training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;
obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining an enterprise vector based on all the document vectors;
and reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.
Optionally, the obtaining of the temporary report text data and constructing a vocabulary table based on the temporary report text data include:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
Optionally, the obtaining a document vector of each provisional report based on the word vector of each word and the TFIDF value, and obtaining a business vector based on all the document vectors includes:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding to obtain the document vector;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
Optionally, the dimensionality reduction of the enterprise vector includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.
Example 4:
in a fourth aspect, an embodiment of the present invention provides an electronic device, including:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the steps of:
acquiring temporary report text data, and constructing a word list based on the temporary report text data;
training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;
obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining an enterprise vector based on all the document vectors;
and reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.
Optionally, the obtaining of the temporary report text data and constructing a vocabulary table based on the temporary report text data include:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
Optionally, the obtaining a document vector of each provisional report based on the word vector of each word and the TFIDF value, and obtaining a business vector based on all the document vectors includes:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding to obtain the document vector;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
Optionally, the dimensionality reduction of the enterprise vector includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.
In summary, compared with the prior art, the method has the following beneficial effects:
1. the method comprises the steps of constructing a word list based on acquired temporary report text data, training a word vector of each word in the word list by utilizing a BERT model, acquiring a document vector of each temporary report based on the word vector of each word and a TFIDF value of the word vector, acquiring an enterprise vector from the document vector, reducing dimensions of the enterprise vector, and obtaining the remaining enterprise vector dimensions after dimension reduction, namely temporary report semantic information. The automatic and accurate analysis method for the semantic information of the temporary report, provided by the invention, makes up the technical vacancy that the information mining cannot be carried out on the temporary report in the prior art, and enables the mining of the semantic information of the temporary report to be more automatic, accurate and effective;
2. the invention trains the word vector of each word in the word list by using the BERT model, and can more deeply dig out effective information hidden in the temporary report text;
3. the invention obtains the document vector of each temporary report by using the word vector of each word and the TFIDF value of each word, considers the importance degree of each word in the document, and can ensure that the mining result of the semantic information of the temporary report is more accurate and effective.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A method for mining temporary report semantic information, the method comprising:
acquiring temporary report text data, and constructing a word list based on the temporary report text data;
training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;
obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining an enterprise vector based on all the document vectors;
and reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.
2. The method of claim 1, wherein said obtaining temporary report text data, building a vocabulary based on said temporary report text data, comprises:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
3. The method of claim 1, wherein said obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining a business vector based on all of the document vectors, comprises:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain a document vector of each temporary report;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
4. The method of claim 1, wherein the dimensionality reduction of the enterprise vector comprises: and reducing the dimension of the enterprise vector by using a principal component analysis method.
5. An apparatus for mining temporary report semantic information, the apparatus comprising:
the data acquisition module is used for acquiring temporary report text data and constructing a word list based on the temporary report text data;
the word vector training module is used for training the word vector of each word in the word list based on a BERT model and simultaneously acquiring the TFIDF value of each word;
an enterprise vector acquisition module, configured to acquire a document vector of each provisional report based on the word vector and the TFIDF value of each word, and acquire an enterprise vector based on all the document vectors;
and the semantic information mining module is used for reducing the dimension of the enterprise vector and acquiring the semantic information of the temporary report.
6. The apparatus of claim 5, wherein the data acquisition module acquires temporary report text data, constructs a vocabulary based on the temporary report text data, comprising:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
7. The apparatus of claim 5, wherein the enterprise vector acquisition module acquires a document vector for each provisional report based on the word vector for each word and the TFIDF value, and acquires an enterprise vector based on all of the document vectors, comprising:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain the document vector of the temporary report;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
8. The apparatus of claim 5, wherein the semantic information mining module dimensionality reduces the enterprise vector comprising: and reducing the dimension of the enterprise vector by using a principal component analysis method.
9. A computer-readable storage medium storing a computer program for mining temporary report semantic information, wherein the computer program causes a computer to execute the temporary report semantic information mining method according to any one of claims 1 to 4.
10. An electronic device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the temporal reporting semantic information mining method of any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011415777.6A CN112597761A (en) | 2020-12-07 | 2020-12-07 | Temporary report semantic information mining method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011415777.6A CN112597761A (en) | 2020-12-07 | 2020-12-07 | Temporary report semantic information mining method and device, storage medium and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112597761A true CN112597761A (en) | 2021-04-02 |
Family
ID=75188956
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011415777.6A Pending CN112597761A (en) | 2020-12-07 | 2020-12-07 | Temporary report semantic information mining method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112597761A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117574916A (en) * | 2023-12-12 | 2024-02-20 | 合肥工业大学 | Temporary report semantic analysis method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109960802A (en) * | 2019-03-19 | 2019-07-02 | 四川大学 | The information processing method and device of narrative text are reported for aviation safety |
US20200026759A1 (en) * | 2018-07-18 | 2020-01-23 | The Dun & Bradstreet Corporation | Artificial intelligence engine for generating semantic directions for websites for automated entity targeting to mapped identities |
CN111538836A (en) * | 2020-04-22 | 2020-08-14 | 哈尔滨工业大学(威海) | Method for identifying financial advertisements in text advertisements |
US20200349199A1 (en) * | 2019-05-03 | 2020-11-05 | Servicenow, Inc. | Determining semantic content of textual clusters |
-
2020
- 2020-12-07 CN CN202011415777.6A patent/CN112597761A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200026759A1 (en) * | 2018-07-18 | 2020-01-23 | The Dun & Bradstreet Corporation | Artificial intelligence engine for generating semantic directions for websites for automated entity targeting to mapped identities |
CN109960802A (en) * | 2019-03-19 | 2019-07-02 | 四川大学 | The information processing method and device of narrative text are reported for aviation safety |
US20200349199A1 (en) * | 2019-05-03 | 2020-11-05 | Servicenow, Inc. | Determining semantic content of textual clusters |
CN111538836A (en) * | 2020-04-22 | 2020-08-14 | 哈尔滨工业大学(威海) | Method for identifying financial advertisements in text advertisements |
Non-Patent Citations (2)
Title |
---|
FENG MAI ET AL.: "deep learning models for bankruptcy prediction using textual disclosures", 《EUROPEAN JOURNAL OF OPERATIONAL RESEARCH》, 16 April 2019 (2019-04-16), pages 743 - 758 * |
姚加权;张锟澎;罗平;: "金融学文本大数据挖掘方法与研究进展", 经济学动态, no. 04, 18 April 2020 (2020-04-18), pages 145 - 160 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117574916A (en) * | 2023-12-12 | 2024-02-20 | 合肥工业大学 | Temporary report semantic analysis method and system |
CN117574916B (en) * | 2023-12-12 | 2024-05-10 | 合肥工业大学 | Temporary report semantic analysis method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11914954B2 (en) | Methods and systems for generating declarative statements given documents with questions and answers | |
CN109189942B (en) | Construction method and device of patent data knowledge graph | |
CN103218444B (en) | Based on semantic method of Tibetan language webpage text classification | |
CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
KR101681109B1 (en) | An automatic method for classifying documents by using presentative words and similarity | |
RU2704531C1 (en) | Method and apparatus for analyzing semantic information | |
CN113961685A (en) | Information extraction method and device | |
CN110019820B (en) | Method for detecting time consistency of complaints and symptoms of current medical history in medical records | |
CN110334343B (en) | Method and system for extracting personal privacy information in contract | |
CN112612892B (en) | Special field corpus model construction method, computer equipment and storage medium | |
CN113986864A (en) | Log data processing method and device, electronic equipment and storage medium | |
CN115757819A (en) | Method and device for acquiring information of quoting legal articles in referee document | |
CN110889275A (en) | Information extraction method based on deep semantic understanding | |
CN117149955A (en) | Method, medium and system for automatically answering insurance clause consultation | |
Kim | Analysis of standard vocabulary use of the open government data: the case of the public data portal of Korea | |
US8224642B2 (en) | Automated identification of documents as not belonging to any language | |
CN115757743A (en) | Document search term matching method and electronic equipment | |
CN111444713B (en) | Method and device for extracting entity relationship in news event | |
CN112597761A (en) | Temporary report semantic information mining method and device, storage medium and electronic equipment | |
Correa et al. | A deep search method to survey data portals in the whole web: toward a machine learning classification model | |
Altınel et al. | Performance Analysis of Different Sentiment Polarity Dictionaries on Turkish Sentiment Detection | |
CN110705285A (en) | Government affair text subject word bank construction method, device, server and readable storage medium | |
CN114969371A (en) | Heat sorting method and device of combined knowledge graph | |
CN109597879B (en) | Service behavior relation extraction method and device based on 'citation relation' data | |
CN113868431A (en) | Financial knowledge graph-oriented relation extraction method and device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |