CN112597761A - Temporary report semantic information mining method and device, storage medium and electronic equipment - Google Patents

Temporary report semantic information mining method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN112597761A
CN112597761A CN202011415777.6A CN202011415777A CN112597761A CN 112597761 A CN112597761 A CN 112597761A CN 202011415777 A CN202011415777 A CN 202011415777A CN 112597761 A CN112597761 A CN 112597761A
Authority
CN
China
Prior art keywords
word
vector
temporary report
text data
enterprise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011415777.6A
Other languages
Chinese (zh)
Inventor
蒋翠清
吕喜梅
王钊
丁勇
王建飞
殷畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202011415777.6A priority Critical patent/CN112597761A/en
Publication of CN112597761A publication Critical patent/CN112597761A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a device for mining temporary report semantic information, a storage medium and electronic equipment, and relates to the technical field of natural language processing. The method comprises the steps of constructing a word list based on acquired temporary report text data, training a word vector of each word in the word list by utilizing a BERT model, acquiring a document vector of each temporary report based on the word vector of each word and a TFIDF value of the word vector, acquiring an enterprise vector from the document vector, reducing dimensions of the enterprise vector, and obtaining the remaining enterprise vector dimensions after dimension reduction, namely temporary report semantic information. The invention makes up the technical vacancy that the information mining can not be carried out on the temporary report in the prior art, and ensures that the mining of the semantic information of the temporary report is automatic, accurate and effective.

Description

Temporary report semantic information mining method and device, storage medium and electronic equipment
Technical Field
The invention relates to a natural language processing technology, in particular to a method and a device for mining temporary report semantic information, a storage medium and electronic equipment.
Background
Under the background of big data era, the traditional quantitative information can not meet the requirements of users, and people are thrown eyes to qualitative text data. Regular reports (annual reports, quarterly reports and the like) and temporary reports are main files for transmitting the business operation conditions and development trends of enterprises and are widely concerned by various users. However, regular reports and temporary reports often exist in text form, and have qualitative unstructured characteristics, so that users have difficulty in accurately reading and mining effective information.
At present, research on enterprise report texts mainly focuses on regular report texts, common analysis methods include manual reading analysis and scoring, but automatic analysis methods for temporary report texts have not been reported yet. In addition, unlike the regular report, the temporary report information has a wide coverage range and no fixed and standardized format, has different descriptions for different events, is short in space, and lacks of emotional information of the management layer, so the conventional text analysis method suitable for the regular report is not suitable for mining and analyzing the text of the temporary report, and further cannot deeply mine effective information hidden in the content of the temporary report.
Therefore, the prior art cannot automatically and accurately analyze the temporary report text.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a mining method, a mining device, a mining storage medium and electronic equipment for temporary report semantic information, and solves the problem that the prior art cannot automatically and accurately analyze a temporary report text.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
in a first aspect, the present invention provides a mining method for temporary report semantic information, where the method includes:
acquiring temporary report text data, and constructing a word list based on the temporary report text data;
training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;
obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining an enterprise vector based on all the document vectors;
and reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.
Preferably, the obtaining of the temporary report text data and the constructing of the vocabulary based on the temporary report text data include:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
Preferably, the obtaining a document vector of each provisional report based on the word vector of each word and the TFIDF value, and obtaining a business vector based on all the document vectors includes:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain a document vector of each temporary report;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
Preferably, the dimensionality reduction of the enterprise vector includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.
In a second aspect, the present invention provides an apparatus for mining temporary report semantic information, including:
the data acquisition module is used for acquiring temporary report text data and constructing a word list based on the temporary report text data;
the word vector training module is used for training the word vector of each word in the word list based on a BERT model and simultaneously acquiring the TFIDF value of each word;
an enterprise vector acquisition module, configured to acquire a document vector of each provisional report based on the word vector and the TFIDF value of each word, and acquire an enterprise vector based on all the document vectors;
and the semantic information mining module is used for reducing the dimension of the enterprise vector and acquiring the semantic information of the temporary report.
Preferably, the data obtaining module obtains temporary report text data, and constructs a word list based on the temporary report text data, including:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
Preferably, the obtaining module of enterprise vector obtains document vector of each provisional report based on the word vector of each word and the TFIDF value, and obtains enterprise vector based on all the document vectors, including:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain the document vector of the temporary report;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
Preferably, the semantic information mining module performs dimensionality reduction on the enterprise vector, and includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.
In a third aspect, the present invention proposes a computer-readable storage medium storing a computer program for temporal reporting semantic information mining, wherein the computer program causes a computer to execute the temporal reporting semantic information mining method as described above.
In a fourth aspect, the present invention provides an electronic device, including:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the temporal reporting semantic information mining method as described above.
(III) advantageous effects
The invention provides a method and a device for mining temporary report semantic information, a storage medium and electronic equipment. Compared with the prior art, the method has the following beneficial effects:
the method comprises the steps of constructing a word list based on acquired temporary report text data, training a word vector of each word in the word list by utilizing a BERT model, acquiring a document vector of each temporary report based on the word vector of each word and a TFIDF value of the word vector, acquiring an enterprise vector from the document vector, reducing dimensions of the enterprise vector, and obtaining the remaining enterprise vector dimensions after dimension reduction, namely temporary report semantic information. The automatic and accurate analysis method for the semantic information of the temporary report, provided by the invention, makes up the technical vacancy that the information mining cannot be carried out on the temporary report in the prior art, and enables the mining of the semantic information of the temporary report to be more automatic, accurate and effective.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a mining method of temporary report semantic information according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
By providing the mining method, device, storage medium and electronic equipment for the temporary report semantic information, the problem that the temporary report text cannot be automatically and accurately analyzed in the prior art is solved, and the purpose of automatically and accurately analyzing and mining the temporary report semantic information is achieved.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
in order to realize automatic, accurate and effective excavation of the temporary report text, the method comprises the steps of firstly obtaining temporary report text data and constructing a word list, then training a word vector of each word in the word list by utilizing a BERT model, obtaining a document vector of each temporary report based on the word vector of each word and a TFIDF value of each word, then obtaining an enterprise vector by the document vector, reducing the dimension of the enterprise vector, and finally taking the remaining enterprise vector dimension after the dimension reduction as the semantic information of the temporary report.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Example 1:
in a first aspect, an embodiment of the present invention first provides a method for mining temporary report semantic information, where referring to fig. 1, the method includes:
s1, acquiring temporary report text data, and constructing a word list based on the temporary report text data;
s2, training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;
s3, obtaining a document vector of each temporary report based on the word vector and the TFIDF value of each word, and obtaining an enterprise vector based on all the document vectors;
and S4, reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.
The word list is constructed based on the acquired temporary report text data, the word vector of each word in the word list is trained by using a BERT model, the document vector of each temporary report is acquired based on the word vector of each word and the TFIDF value of the word vector, the enterprise vector is acquired from the document vector, the dimension of the enterprise vector is reduced, and the dimension of the rest enterprise vector after dimension reduction is the semantic information of the temporary report. The automatic and accurate analysis method for the semantic information of the temporary report, provided by the invention, makes up the technical vacancy that the information mining cannot be carried out on the temporary report in the prior art, and enables the mining of the semantic information of the temporary report to be more automatic, accurate and effective.
In the embodiment of the present invention, in order to obtain a more accurate word list, a better processing method is that when obtaining temporary report text data and constructing a word list based on the temporary report text data, the method includes the following steps: acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library; acquiring a user-defined dictionary and a stop word dictionary based on the corpus; and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
In addition, in the embodiment of the present invention, in order to make the mining result of the temporary report semantic information more accurate and effective, the technical solution uses the TFIDF value of each word to represent the importance degree of each word in the document, and specifically, when obtaining the document vector of each temporary report based on the word vector of each word and the TFIDF value thereof, and obtaining the enterprise vector based on all the document vectors, includes the following steps: taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain the document vector of the temporary report; and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
In the embodiment of the present invention, in order to ensure the accuracy and effectiveness of the semantic information mining result, a better processing method is to use a principal component analysis method to perform dimensionality reduction on the enterprise vector when performing dimensionality reduction on the enterprise vector.
The following describes the implementation of an embodiment of the present invention in detail with reference to the explanation of specific steps.
Fig. 1 is a flowchart of a mining method for temporary report semantic information, and referring to fig. 1, a concrete process of mining temporary report semantic information includes:
and S1, acquiring temporary report text data, and constructing a word list based on the temporary report text data.
First, temporary report text data is acquired. The method comprises the steps of downloading temporary report pdf files belonging to the same enterprise in batch from a website such as a stock exchange by using crawler software (such as octopus and the like) or crawler codes, then converting the temporary report pdf files into formats which can be recognized by a computer in batch (such as converting the temporary report pdf files into csv and txt formats by using conversion tools such as WPS and the like), and finally summarizing and merging all format-converted temporary report files into a document to form a corpus (namely a document set).
Then, word list construction is carried out on the temporary report document. 1) And loading the custom dictionary and the stop word dictionary. According to the corpus, the indivisible words are brought into a custom dictionary (the custom dictionary is a collection of some indivisible words, such as ' rights and interests change ' which cannot be divided into ' rights and interests ' change '), so that improper word segmentation is avoided; adding nonsense words in the corpus on the basis of the general stop word dictionary to serve as a stop word dictionary (the stop word dictionary is a collection of unimportant words, such as 'yes'); 2) punctuation, letters and numbers. Removing punctuations, letters and numbers in each temporary report document by using a regular expression; 3) segmenting words and deactivating words. Performing word segmentation on each temporary report document by adopting a jieba word segmentation accurate mode, and removing words contained in a stop word list; 4) and removing sparse words. Sparse words in each provisional report document are removed at a sparsity rate of 0.99 (sparsity rate refers to the number of documents in which a word appears divided by the total number of documents).
And finally, constructing a word list. And removing the duplication of the words in all the processed temporary report documents to form a word list.
S2, training the word vector of each word in the word list based on the BERT model, and simultaneously obtaining the TFIDF value of each word.
The BERT model is a pre-trained text representation method, can be used for extracting high-quality linguistic features from text data, is an abbreviation of Bidirectional encoding expressions, and mainly uses a Mask language model pre-training method, firstly randomly marks words, and then predicts the words through a training model so as to obtain word embedding expressions. And training a word vector of each word in the word list by using a BERT word vector technology in a Tensorflow environment, wherein the default dimension is 768, so that each word in the word list is converted into a 768-dimensional real number vector from a text. And acquiring the TFIDF value of each word in each temporary report document, namely a word frequency inverse document matrix, and reflecting the importance of each word in each document in the word list. Specifically, a document-term matrix (DTM) is created, with each word in the vocabulary being a column, and each row being the term frequency-inverse document probability (TFIDF) for each term in each document, with the TFIDF value representing the degree of importance of the term in the document.
Assuming that the word list contains 1000 words and the total document number of the temporary report document is D, a TFIDF matrix of D1000 is formed. Meanwhile, the obtained 1000 768-dimensional real number vectors form a 1000 x 768-dimensional word vector matrix.
S3, obtaining the document vector of each temporary report based on the word vector of each word and the TFIDF value, and obtaining the enterprise vector based on all the document vectors.
And taking the TFIDF value of each word in each temporary report document as weight, correspondingly multiplying the TFIDF value by the word vector of each word, and adding the result to obtain the 768-dimensional vector (namely, multiplying the TFIDF matrix by the word vector matrix) of the temporary report document. When the total document number is D, obtaining a document vector matrix of D768; and then calculating the arithmetic mean of all temporary report document vectors of the same enterprise to be used as the enterprise vector of the enterprise.
And S4, reducing the dimension of the enterprise vector, and acquiring semantic soft information of the temporary report.
And (3) reducing the dimensionality of the 768-dimensional enterprise vector by using a Principal Component Analysis (PCA), reserving 85% of information according to the covariance proportion, and taking the remaining enterprise vector dimensionality as finally extracted temporary report semantic information. Specifically, first, data preprocessing: calculating the average value of the enterprise vectors of all dimensions; subtracting a corresponding mean value from each element of the enterprise vector matrix so as to facilitate data centralization and reduce the possibility of overfitting, and then solving a covariance matrix, an eigenvalue of covariance and an eigenvector; and finally, arranging the eigenvalues in a descending order, selecting the largest k eigenvectors, and taking the corresponding k eigenvectors as column vectors to form an eigenvector matrix. Thus, the sample variance in each dimension is large, i.e., k is chosen according to the covariance ratio. In general, the covariance ratio is set to 85%, that is, 85% of the original data information is retained, and finally the remaining enterprise vector dimension is used as the finally extracted temporary report semantic information.
Therefore, the whole process of the mining method for the temporary report semantic information is completed.
Example 2:
in a second aspect, in an embodiment of the present invention, an apparatus for mining temporary report semantic information is provided, the apparatus including:
the data acquisition module is used for acquiring temporary report text data and constructing a word list based on the temporary report text data;
the word vector training module is used for training the word vector of each word in the word list based on a BERT model and simultaneously acquiring the TFIDF value of each word;
an enterprise vector acquisition module, configured to acquire a document vector of each provisional report based on the word vector and the TFIDF value of each word, and acquire an enterprise vector based on all the document vectors;
and the semantic information mining module is used for reducing the dimension of the enterprise vector and acquiring the semantic information of the temporary report.
Optionally, the data obtaining module obtains temporary report text data, and constructs a vocabulary based on the temporary report text data, including:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the custom dictionary and the stop word dictionary to form a word list;
optionally, the obtaining module of enterprise vectors obtains document vectors of each provisional report based on the word vector of each word and the TFIDF value, and obtains enterprise vectors based on all the document vectors, including:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding to obtain the document vector;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
Optionally, the reducing the dimension of the enterprise vector by the semantic information mining module includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.
It can be understood that the device for mining the temporary report semantic information provided by the embodiment of the present invention corresponds to the method for mining the temporary report semantic information, and the explanation, example, beneficial effects and the like of the relevant contents thereof may refer to the corresponding contents in the method for mining the temporary report semantic information, and are not described herein again.
Example 3:
in a third aspect, an embodiment of the present invention provides a computer-readable storage medium storing a computer program for temporally reporting semantic information mining, wherein the computer program causes a computer to execute the following steps:
acquiring temporary report text data, and constructing a word list based on the temporary report text data;
training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;
obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining an enterprise vector based on all the document vectors;
and reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.
Optionally, the obtaining of the temporary report text data and constructing a vocabulary table based on the temporary report text data include:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
Optionally, the obtaining a document vector of each provisional report based on the word vector of each word and the TFIDF value, and obtaining a business vector based on all the document vectors includes:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding to obtain the document vector;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
Optionally, the dimensionality reduction of the enterprise vector includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.
Example 4:
in a fourth aspect, an embodiment of the present invention provides an electronic device, including:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the steps of:
acquiring temporary report text data, and constructing a word list based on the temporary report text data;
training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;
obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining an enterprise vector based on all the document vectors;
and reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.
Optionally, the obtaining of the temporary report text data and constructing a vocabulary table based on the temporary report text data include:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
Optionally, the obtaining a document vector of each provisional report based on the word vector of each word and the TFIDF value, and obtaining a business vector based on all the document vectors includes:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding to obtain the document vector;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
Optionally, the dimensionality reduction of the enterprise vector includes: and reducing the dimension of the enterprise vector by using a principal component analysis method.
In summary, compared with the prior art, the method has the following beneficial effects:
1. the method comprises the steps of constructing a word list based on acquired temporary report text data, training a word vector of each word in the word list by utilizing a BERT model, acquiring a document vector of each temporary report based on the word vector of each word and a TFIDF value of the word vector, acquiring an enterprise vector from the document vector, reducing dimensions of the enterprise vector, and obtaining the remaining enterprise vector dimensions after dimension reduction, namely temporary report semantic information. The automatic and accurate analysis method for the semantic information of the temporary report, provided by the invention, makes up the technical vacancy that the information mining cannot be carried out on the temporary report in the prior art, and enables the mining of the semantic information of the temporary report to be more automatic, accurate and effective;
2. the invention trains the word vector of each word in the word list by using the BERT model, and can more deeply dig out effective information hidden in the temporary report text;
3. the invention obtains the document vector of each temporary report by using the word vector of each word and the TFIDF value of each word, considers the importance degree of each word in the document, and can ensure that the mining result of the semantic information of the temporary report is more accurate and effective.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for mining temporary report semantic information, the method comprising:
acquiring temporary report text data, and constructing a word list based on the temporary report text data;
training a word vector of each word in the word list based on a BERT model, and simultaneously obtaining a TFIDF value of each word;
obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining an enterprise vector based on all the document vectors;
and reducing the dimension of the enterprise vector, and acquiring the semantic information of the temporary report.
2. The method of claim 1, wherein said obtaining temporary report text data, building a vocabulary based on said temporary report text data, comprises:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
3. The method of claim 1, wherein said obtaining a document vector for each provisional report based on the word vector for each word and the TFIDF value, and obtaining a business vector based on all of the document vectors, comprises:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain a document vector of each temporary report;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
4. The method of claim 1, wherein the dimensionality reduction of the enterprise vector comprises: and reducing the dimension of the enterprise vector by using a principal component analysis method.
5. An apparatus for mining temporary report semantic information, the apparatus comprising:
the data acquisition module is used for acquiring temporary report text data and constructing a word list based on the temporary report text data;
the word vector training module is used for training the word vector of each word in the word list based on a BERT model and simultaneously acquiring the TFIDF value of each word;
an enterprise vector acquisition module, configured to acquire a document vector of each provisional report based on the word vector and the TFIDF value of each word, and acquire an enterprise vector based on all the document vectors;
and the semantic information mining module is used for reducing the dimension of the enterprise vector and acquiring the semantic information of the temporary report.
6. The apparatus of claim 5, wherein the data acquisition module acquires temporary report text data, constructs a vocabulary based on the temporary report text data, comprising:
acquiring temporary report text data, converting the temporary report text data into a format which can be recognized by a computer, and summarizing the temporary report text data into a language library;
acquiring a user-defined dictionary and a stop word dictionary based on the corpus;
and performing punctuation removal, letter removal, digit removal, word segmentation, stop word removal and sparse word removal on the corpus by utilizing the user-defined dictionary and the stop word dictionary to form a word list.
7. The apparatus of claim 5, wherein the enterprise vector acquisition module acquires a document vector for each provisional report based on the word vector for each word and the TFIDF value, and acquires an enterprise vector based on all of the document vectors, comprising:
taking the TFIDF value of each word as weight, multiplying the TFIDF value of each word by the word vector of each word correspondingly, and adding the result to obtain the document vector of the temporary report;
and obtaining the enterprise vector by carrying out arithmetic mean on all the document vectors.
8. The apparatus of claim 5, wherein the semantic information mining module dimensionality reduces the enterprise vector comprising: and reducing the dimension of the enterprise vector by using a principal component analysis method.
9. A computer-readable storage medium storing a computer program for mining temporary report semantic information, wherein the computer program causes a computer to execute the temporary report semantic information mining method according to any one of claims 1 to 4.
10. An electronic device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the temporal reporting semantic information mining method of any of claims 1-4.
CN202011415777.6A 2020-12-07 2020-12-07 Temporary report semantic information mining method and device, storage medium and electronic equipment Pending CN112597761A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011415777.6A CN112597761A (en) 2020-12-07 2020-12-07 Temporary report semantic information mining method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011415777.6A CN112597761A (en) 2020-12-07 2020-12-07 Temporary report semantic information mining method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN112597761A true CN112597761A (en) 2021-04-02

Family

ID=75188956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011415777.6A Pending CN112597761A (en) 2020-12-07 2020-12-07 Temporary report semantic information mining method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112597761A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117574916A (en) * 2023-12-12 2024-02-20 合肥工业大学 Temporary report semantic analysis method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960802A (en) * 2019-03-19 2019-07-02 四川大学 The information processing method and device of narrative text are reported for aviation safety
US20200026759A1 (en) * 2018-07-18 2020-01-23 The Dun & Bradstreet Corporation Artificial intelligence engine for generating semantic directions for websites for automated entity targeting to mapped identities
CN111538836A (en) * 2020-04-22 2020-08-14 哈尔滨工业大学(威海) Method for identifying financial advertisements in text advertisements
US20200349199A1 (en) * 2019-05-03 2020-11-05 Servicenow, Inc. Determining semantic content of textual clusters

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200026759A1 (en) * 2018-07-18 2020-01-23 The Dun & Bradstreet Corporation Artificial intelligence engine for generating semantic directions for websites for automated entity targeting to mapped identities
CN109960802A (en) * 2019-03-19 2019-07-02 四川大学 The information processing method and device of narrative text are reported for aviation safety
US20200349199A1 (en) * 2019-05-03 2020-11-05 Servicenow, Inc. Determining semantic content of textual clusters
CN111538836A (en) * 2020-04-22 2020-08-14 哈尔滨工业大学(威海) Method for identifying financial advertisements in text advertisements

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FENG MAI ET AL.: "deep learning models for bankruptcy prediction using textual disclosures", 《EUROPEAN JOURNAL OF OPERATIONAL RESEARCH》, 16 April 2019 (2019-04-16), pages 743 - 758 *
姚加权;张锟澎;罗平;: "金融学文本大数据挖掘方法与研究进展", 经济学动态, no. 04, 18 April 2020 (2020-04-18), pages 145 - 160 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117574916A (en) * 2023-12-12 2024-02-20 合肥工业大学 Temporary report semantic analysis method and system
CN117574916B (en) * 2023-12-12 2024-05-10 合肥工业大学 Temporary report semantic analysis method and system

Similar Documents

Publication Publication Date Title
US11914954B2 (en) Methods and systems for generating declarative statements given documents with questions and answers
CN109189942B (en) Construction method and device of patent data knowledge graph
CN103218444B (en) Based on semantic method of Tibetan language webpage text classification
CN113011533A (en) Text classification method and device, computer equipment and storage medium
KR101681109B1 (en) An automatic method for classifying documents by using presentative words and similarity
RU2704531C1 (en) Method and apparatus for analyzing semantic information
CN113961685A (en) Information extraction method and device
CN110019820B (en) Method for detecting time consistency of complaints and symptoms of current medical history in medical records
CN110334343B (en) Method and system for extracting personal privacy information in contract
CN112612892B (en) Special field corpus model construction method, computer equipment and storage medium
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN115757819A (en) Method and device for acquiring information of quoting legal articles in referee document
CN110889275A (en) Information extraction method based on deep semantic understanding
CN117149955A (en) Method, medium and system for automatically answering insurance clause consultation
Kim Analysis of standard vocabulary use of the open government data: the case of the public data portal of Korea
US8224642B2 (en) Automated identification of documents as not belonging to any language
CN115757743A (en) Document search term matching method and electronic equipment
CN111444713B (en) Method and device for extracting entity relationship in news event
CN112597761A (en) Temporary report semantic information mining method and device, storage medium and electronic equipment
Correa et al. A deep search method to survey data portals in the whole web: toward a machine learning classification model
Altınel et al. Performance Analysis of Different Sentiment Polarity Dictionaries on Turkish Sentiment Detection
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN114969371A (en) Heat sorting method and device of combined knowledge graph
CN109597879B (en) Service behavior relation extraction method and device based on 'citation relation' data
CN113868431A (en) Financial knowledge graph-oriented relation extraction method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination