CN114841173A - Academic text semantic feature extraction method and system based on pre-training model and storage medium - Google Patents
Academic text semantic feature extraction method and system based on pre-training model and storage medium Download PDFInfo
- Publication number
- CN114841173A CN114841173A CN202210778073.8A CN202210778073A CN114841173A CN 114841173 A CN114841173 A CN 114841173A CN 202210778073 A CN202210778073 A CN 202210778073A CN 114841173 A CN114841173 A CN 114841173A
- Authority
- CN
- China
- Prior art keywords
- model
- academic
- training
- text
- training model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 120
- 238000003860 storage Methods 0.000 title claims abstract description 13
- 238000000605 extraction Methods 0.000 title claims description 24
- 239000013598 vector Substances 0.000 claims abstract description 95
- 238000000034 method Methods 0.000 claims abstract description 41
- 230000009467 reduction Effects 0.000 claims abstract description 38
- 238000013140 knowledge distillation Methods 0.000 claims abstract description 23
- 230000006835 compression Effects 0.000 claims abstract description 13
- 238000007906 compression Methods 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims description 29
- 238000004422 calculation algorithm Methods 0.000 claims description 22
- 238000000513 principal component analysis Methods 0.000 claims description 21
- 230000009193 crawling Effects 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 9
- 238000005516 engineering process Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 4
- 239000010410 layer Substances 0.000 description 18
- 239000011159 matrix material Substances 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 230000002457 bidirectional effect Effects 0.000 description 4
- 238000012880 independent component analysis Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000007792 addition Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 235000015076 Shorea robusta Nutrition 0.000 description 1
- 244000166071 Shorea robusta Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000002365 multiple layer Substances 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method, a system and a storage medium for extracting academic text semantic features based on a pre-training model, wherein the method comprises the following steps: acquiring academic resource text data; inputting the acquired academic resource text data into a pre-training model to obtain a multi-dimensional academic text semantic feature vector; the pre-training model is a student pre-training model obtained by finely tuning a Bert pre-training model based on a multiple load sample loss function, and training the student model by knowledge distillation by taking the finely tuned Bert pre-training model as a teacher model; and performing dimension reduction compression on the multidimensional academic text semantic feature vector, and outputting the final academic text semantic feature. The method and the device improve the quality of vector generation, accelerate the speed of vector generation, and are suitable for text vector generation in an academic big data scene.
Description
Technical Field
The invention relates to the technical field of big data, in particular to an academic text semantic feature extraction method and system based on a pre-training model and a storage medium.
Background
Academic resources represent more complex features than traditional internet data, and word frequency-based statistical models, topic models or deep learning-based word vector representation methods are generally used in terms of extracting text representation features of the academic resources. TF-IDF (word frequency-inverse document frequency) is a typical text vector representation method, which uses a statistical method to extract text features, calculates document word weights by calculating word frequency and inverse document frequency, and constructs document vector representation by using a weight set of all words in a document. Wuji et al also proposed TTF-LDA (Time + TF-IDF + Latent Dirichlet Allocation) algorithm, which is based on TF-IDF and LDA and uses topic analysis method to process abstract of academic literature. Mikolov et al also presented a Word2Vec model, using a continuous bag of words model and Skip-Gram method to obtain a hidden layer vector representation of a Word by the task of predicting the next Word.
With the rapid development of artificial intelligence technology, deep learning technology is also used to perform feature extraction of academic texts. The self-encoder is able to efficiently learn semantic representations of text data, and Eisa et al propose to extract a vocabulary feature set using a deep self-encoder technique. The network structure of the Recurrent Neural Network (RNN) is capable of processing inputs of different time series, is well suited to extracting features from serialized text data, and is widely used in text processing tasks. On the basis, in order to solve the problem of RNN gradient disappearance caused by long distance, people improve the RNN, construct LSTM (long-short term memory network) and GRU (gated cycle unit) models, and keep long-distance semantic information by integrating memory, forgetting and output stages, so that the models can achieve better effect on feature extraction of long texts. Devlin et al also proposed a Bert (Bidirectional Encoder Representation from transforms) pre-training model that utilizes a multi-headed self-attention mechanism to capture context semantics to the maximum extent and achieves good results on multiple natural language tasks.
The Bert pre-training model is an encoder using a bidirectional transformer, and compared with a traditional model, the Bert model captures semantic representations of word levels in a text in a pre-training process by using a Masked language model (Masked LM), and captures semantic relations between sentences based on Next Sentence Prediction (Next sequence Prediction) to obtain semantic representations of Sentence levels in the text. In order to ensure the quality of pre-training, the Bert pre-training model probabilistically selects some adjacent and non-adjacent sentences as input, and ensures that the model can understand the incidence relation between different sentences. The bidirectional converter of the Bert pre-training model introduces an attention mechanism, learns the internal association relationship of sentences, the internal relationship of target sentences and the relationship between source sentences and target sentences, utilizes a multi-head attention mechanism and uses a full-attention structure.
The Bert model obtained by pre-training can be finely adjusted under a specific NLP (Natural Language Processing) task, so that a single model can adapt to a plurality of different NLP tasks, and computing resources are saved. Although the Bert pre-training model can be fine-tuned under the NLP task, the text vector provided by the current Bert model performs poorly in the context of academic resource text feature representation, and the generated vector representation has a "collapse" phenomenon, that is, the Bert model tends to encode all sentences into a small spatial region, which makes most sentence pairs have a high similarity score, even those sentence pairs that are completely semantically unrelated. In addition, the vector generation speed of the current Bert model is too slow, and the speed of extraction of academic text features is greatly influenced.
Therefore, how to accelerate the vector generation speed while improving the vector generation quality for the generation of text vectors in academic big data scenes is a problem to be solved urgently.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an academic text semantic feature extraction method and system based on a pre-training model, so as to improve the vector generation quality and improve the vector generation speed.
One aspect of the invention provides an academic text semantic feature extraction method based on a pre-training model, which comprises the following steps:
acquiring academic resource text data;
inputting the acquired academic resource text data into a pre-training model to obtain a multi-dimensional academic text semantic feature vector; the pre-training model is a student pre-training model obtained by finely adjusting the Bert pre-training model based on a multiple load sample loss function, and training the student model by knowledge distillation by taking the finely adjusted Bert pre-training model as a teacher model;
and performing dimensionality reduction compression on the multidimensional academic text semantic feature vector, and outputting a final academic text semantic feature vector.
In some embodiments of the invention, webpage academic resource data is crawled through a crawler technology to obtain academic resource text data; in the process of crawling webpage academic resource text data by a script crawler, aiming at a webpage to be crawled with a reverse crawling mechanism, extracting a document ID of an original URL of the webpage to be crawled, constructing a new URL by using the extracted document ID, and guiding the crawler to a detailed page without the reverse crawling mechanism, so that complete document information of the webpage to be crawled is obtained.
In some embodiments of the present invention, the fine-tuning the Bert pre-training model based on the multiple load sample loss function, and training the student model by knowledge distillation using the fine-tuned Bert pre-training model as the teacher model includes: fine-tuning the Bert pre-training model by utilizing a natural language reasoning data set or a semantic text similarity benchmark data set based on a loss function of multiple heavy load samples; training a student model by using the trimmed Bert pre-training model as a teacher model through knowledge distillation by using a wiki data set; and the input of the Bert pre-training model is a sentence pair containing a relation label in a natural language reasoning data set.
In some embodiments of the invention, the multiple heavy load sample loss function satisfies the following equation:
wherein,uandvrespectively representing the results based on the Bert pre-training modelThe phrase vector sequence [ 2 ]u 1 , …, u i , …, u K ]And 2v 1 , …, v i , …, v K ],Representing a sentence vectorAndthe dot product between the two (C) and (D),representing a pre-trained model used in computing dot products between sentence vectors,Krepresenting the number of sentence pairs input to the Bert pre-trained model.
In some embodiments of the present invention, the loss function used in the student model training process is an MSE loss function, which is expressed as:
wherein,a sentence vector generated by the teacher model is represented,sentence vectors generated by the student model are represented; n represents the number of sentence vectors.
In some embodiments of the invention, the method further comprises: a student pre-training model training step, which comprises: fine-tuning the Bert pre-training model by utilizing a multiple heavy load sample loss function based on a natural language reasoning data set or a semantic text similarity reference data set to obtain the fine-tuned Bert pre-training model; and (4) training the student model by using the trimmed Bert pre-training model as a teacher model through knowledge distillation to obtain the student pre-training model.
In some embodiments of the present invention, performing dimension reduction compression on the multidimensional academic text semantic feature vector comprises: and performing dimensionality reduction compression on the multidimensional academic text feature vector output by the pre-training model by using a principal component analysis dimensionality reduction algorithm.
In some embodiments of the invention, the academic resource text data comprises structural chemistry resource text data and/or non-structural chemistry resource text data.
In some embodiments of the invention, when training the student models by knowledge distillation using the trimmed Bert pre-training model as the teacher model, the teacher model comprises 12 hidden layers, and the [1,4,7,10] hidden layers of the teacher model are reserved as the hidden layers of the student models.
In another aspect of the present invention, there is also provided an academic text semantic feature extraction system based on a pre-trained model, the system including a processor and a memory, the memory storing computer instructions, the processor being configured to execute the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system implementing the steps of the method as described above.
In another aspect of the present invention, a computer storage medium is also provided, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method as set forth above.
The academic text semantic feature extraction method and the academic text semantic feature extraction system based on the pre-training model can use any voice-driven speaker head motion synthesis method as input, and the naturalness of the result can be improved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.
FIG. 1 is a diagram illustrating fine tuning of a Bert pre-trained model according to an embodiment of the present invention.
FIG. 2 is a schematic illustration of model compression by knowledge distillation of a fine tuned Bert model in one embodiment of the present invention.
Fig. 3 is a schematic flow chart of an academic text semantic feature extraction method based on a pre-training model in an embodiment of the present invention.
FIG. 4 is a diagram illustrating the collection, pre-processing, and storage of academic resources according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
It should be noted that in the drawings or description, the same drawing reference numerals are used for similar or identical parts. And in the drawings, for simplicity or convenience. Furthermore, implementations not shown or described in the drawings are of a form known to those of ordinary skill in the art. Additionally, although examples may be provided herein of parameters including particular values, it should be appreciated that the parameters need not be exactly equal to the respective values, but rather approximate the respective values within acceptable error margins or design constraints.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and other details not so related to the present invention are omitted.
It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.
Aiming at the problem that a text vector provided by the existing Bert pre-training model (for short, the Bert model) is not good enough in performance under an academic resource text feature representation scene, the invention provides a new pre-training model based on the Bert pre-training model, and provides an academic text semantic feature extraction method based on the new pre-training model. The novel pre-training model provided by the invention is a deep learning model generated by combining the model fine tuning (Finetune), Knowledge distillation (Knowledge distillation) and Principal Component Analysis (PCA) dimensionality reduction optimization on the basis of the existing Bert model, and can accurately and quickly extract text semantic features by utilizing academic resource texts based on the novel pre-training model, so that text vectors can be generated more accurately and quickly. In the embodiment of the invention, the proposed new pre-training model is called a Bert-FDP pre-training model, wherein FDP is an acronym of Finetune, Distilling and PCA, and represents fine tuning, knowledge distillation and PCA dimensionality reduction optimization.
The Bert-FDP pre-training model proposed by the embodiment of the present invention will be explained below.
The model fine tuning means that the trained model is trained again in a new task scene to realize the adjustment of the model parameters, so that the model achieves a better effect on the original basis. In an embodiment of the present invention, a Bert model is finely tuned on a Natural Language Inference (NLI) dataset to learn a potential semantic implication relationship in a Natural language inference dataset. The natural language reasoning data set as the training data may adopt training samples in the existing natural language reasoning data set, and the training samples are a large number of sentence pairs. FIG. 1 is a schematic diagram illustrating fine tuning of a Bert pre-trained model in an embodiment of the present invention, as shown in FIG. 1, the Bert pre-trained model uses two towers on a natural language inference data setThe structure is trained, the training data is composed of a plurality of sentence pairs with implication relations, two groups of inputs of the double-tower structure are sentence sequences respectively, namely the first group of inputs is a sentence sequence composed of a sentence 1 and subsequent sentences, and the second group of inputs is a sentence sequence composed of a sentence 2 and subsequent sentences. Where sentence 1 and sentence 2 belong to a similar pair of sentences, sentence 1 is not similar to other sentences in the second set of inputs and sentence 2 is not similar to other sentences in the first set of inputs. The input multiple sentence pairs are stacked by the encoders of the multiple-layer converter to form a Bert basic network structure, and the encoder of each layer of the Bert basic network structure consists of a multi-head attention layer and a feed-forward layer (feed-forward). The Bert pre-training model is used for capturing the bidirectional relationship between sentences, the sentence vectors output by the Bert basic network structure are subjected to dimensionality reduction processing by a pooling layer, and the sentence vector sequences [ 2 ], [u 1 , u 2 , …, u n ]And 2v 1 , v 2 , …, v n ]Then, similarity calculation is carried out on sentences in the two sentence vector sequences, such as dot product calculation.
In the embodiment of the invention, the training data of the loss function of multiple load examples is paired by sentencesIs composed of (a) whereinIs a pair of similar sentences and the sentence is,the multiple negative sample loss function is minimized when i! = jAll when maximizing the distance between i | = j simultaneouslyDistance between sentence pairs. For a natural language reasoning data set, the invention inputs sentence pairs with embedded labels into a pre-training model of the invention as positive examples, and for a batch (batch) formed by a group of sentence pairs, a specific loss functionThe form is as follows:
wherein,uandvrespectively represent a sentence vector sequence [ 2 ]u 1 , …, u i , …, u K ]And 2v 1 , …, v i , …, v K ],Representing a sentence vectorAndthe dot product between the two (C) and (D),representing a pre-trained model used in computing the sentence-vector dot product,Krepresenting the number of sentence pairs input to the Bert pre-trained model.
In the embodiment of the invention, the vector generated by the Bert model is used for constructing the vector set dot productThe Bert model is then fine-tuned using the multiple heavy sample loss function. Compared with the use in the existing modelAndconstructing a combined vectorAnd the vector is used as the input of a Softmax classifier with a full connection layer structure, and the embodiment of the invention uses a multiple heavy load sample loss function to fine tune the Bert model, so that a better sentence representation effect can be obtained.
The multiple heavy load sample loss functions can reduce the distance between similar sentence vectors and make the sentence vector distance between negative samples farther, so that the Bert-FDP model can generate more accurate and reasonable sentence vectors.
For the pre-training model after fine adjustment of the multiple load sample loss functions, in order to ensure the sentence embedding generation speed in a big data scene, the invention further uses a knowledge distillation means to compress the model to generate a student network model, and uses the student network model as the pre-training model to generate sentence embedding vectors of academic texts (such as scientific research result titles), thereby optimizing the vector generation speed on the premise of ensuring the sentence vector quality. Knowledge distillation refers to a student network model (student model for short) which is provided by a teacher network model and has fewer output result training layers and higher speed. In the embodiment of the invention, the teacher network model is a Bert pre-training model which is subjected to fine tuning by a multiple load sample loss function. FIG. 2 is a schematic illustration of model compression by knowledge distillation of a fine tuned Bert model in one embodiment of the present invention. In an embodiment of the present invention, a student model is trained using a wiki data set, where the teacher model may include 12 hidden layers, for example, in terms of selection of the student model, the hidden layers [1,4,7,10] of the teacher model are retained as the hidden layers of the student model, and on this basis, external sentence corpora is used to generate sentence vectors, and a MSE (Mean Square Error) loss function is used to compare the generated results of the vectors. Here, the hidden layer of the teacher model, which is reserved as the hidden layer of the student model, is merely an example of the present invention, and the present invention is not limited thereto, and more, fewer, or other layers may be used as the hidden layer of the student model.
In one embodiment of the present invention, the MSE loss function is expressed as:
wherein,a sentence vector generated by the teacher model is represented,sentence vectors generated by the student model are represented; n represents the number of sentence vectors. In the embodiment of the invention, after the student model is generated by knowledge distillation on the basis of the fine-tuned Bert pre-training model, the student model is used as the pre-training model to extract the semantic features of the academic text, so that the vector generation quality can be improved, and the vector generation speed can be increased.
After the trained student model is used as a pre-training model for extraction of semantic features of academic texts, the vector output by the student model can be reduced by using a dimension reduction algorithm in the embodiment of the invention, so that the time consumption of similarity comparison is further reduced. A dimension reduction algorithm can be selected from the existing dimension reduction algorithms to perform dimension reduction compression on the feature vectors output by the student model. Existing dimension reduction algorithms may include Principal Component Analysis (PCA) dimension reduction algorithms, Independent Component Analysis (ICA) dimension reduction algorithms, Linear Discriminant Analysis (LDA) dimension reduction algorithms, and the like, but the present invention is not limited thereto.
Taking the PCA dimension reduction algorithm as an example, the PCA dimension reduction algorithm mainly reduces dimensions of data with high dimensions through data preprocessing (data cleaning) by linear transformation (decomposition feature matrix), and realizes projection of the high-dimensional data on a low-dimensional space. Assuming that data output by the student model is n pieces of d-dimensional data, the data can form a matrix X with n rows and d columns according to arrangement, after zero-averaging (namely subtracting the average value of the column) is carried out on each column of the matrix X, a covariance matrix is further solved, the eigenvalue of the covariance matrix and the corresponding eigenvector are obtained, then the eigenvector is arranged into a matrix from top to bottom according to the row according to the size of the eigenvalue, and the first k rows form a matrix P; data reduced to k dimension can be obtained by Y = PX. Since the PCA dimension reduction algorithm in the embodiment of the present invention may adopt the existing PCA dimension reduction algorithm, it is not described herein again.
The embodiment of the invention effectively combines the fine adjustment of the pre-training model, the knowledge distillation of the fine-adjusted model and the compression of the vector dimension by using Principal Component Analysis (PCA) on the vector generated by the student model, thereby not only ensuring the sentence embedding quality of the pre-training model, but also accelerating the sentence embedding generation speed and the vector similarity calculation speed.
The method verifies the actual effect of generating the sentence vector by the Bert-FDP on a plurality of text-matched data sets in Chinese and English, and constructs a data set containing 98 ten thousand sentence pairs as training data to pre-train a fine-tuned Bert model by using an SNLI (Stanford natural language inference) and a MultiNLI (Multi-stream assignment natural language inference) corpus in a comparison experiment of English data sets; in the knowledge distillation process, a student model is trained by using a wiki data set containing 787 ten thousand sentences, and finally, a sentence vector constructed by PCA dimension reduction is used for carrying out a comparison experiment on a semantic similarity matching public data set.
Experimental results show that the correlation coefficient values of the trained Bert-FDP model in each data set are improved compared with the correlation coefficient values of a common text feature extraction model.
In another embodiment of the present invention, a Semantic text Similarity reference (STS-B) data set may be used as training data to fine-tune the Bert pre-training model based on the multiple negative sample loss function, and experiments show that adjusting the Bert pre-training model on the STS-B data set can improve the feature vector extraction accuracy by 13%, which indicates that the Bert-FDP model can generate better text vectors, thereby effectively alleviating the problems of flattening academic text vectors and low discrimination.
The following describes an implementation method for performing semantic feature extraction on academic texts based on the pre-training model generated as above in the embodiment of the present invention. Fig. 3 is a schematic flow chart of the academic text semantic feature extraction method based on the pre-training model in the embodiment of the present invention. As shown in fig. 3, the method comprises the steps of:
step S110, acquiring academic resource text data.
By way of example, online network academic data (i.e., web page academic resource data) can be crawled and collated in terms of acquisition of academic resource text data using crawler technology (e.g., script crawler technology). Regular expression, Xpath or CSS (Cascading Style Sheets) selectors can be adopted to formulate corresponding crawling rules for different types of webpage academic resource data, so that academic information such as treatises, scholars, patents and projects in the Internet and various knowledge bases can be acquired. For example, as shown in fig. 4, crawling of thesis details and patent details can be realized by URL parsing and stitching crawling rules for webpage academic resource data of types such as a thesis index page and a patent publication page, and crawling of subject field tree contents can be realized by recursive crawling for webpage resource data of types of subject field tree pages. In the aspect of acquiring and processing academic resource data, when the data crawling method is used for crawling data of the academic resource data of webpages related to scientific research achievements such as papers and patents by using a script crawler, the data items of the detail pages of the original articles are incomplete due to the fact that the original webpages often have a reverse crawling mechanism. In order to solve the problem, the document ID of the original URL of the webpage to be crawled is extracted to construct a new URL in the crawling process, and the crawler is guided to a detail page without a back-crawling mechanism, so that complete document information of the webpage to be crawled is obtained, and complete scientific research result data information can be obtained.
Based on the crawled document content, academic resource text data may be obtained, which may include structured text data and unstructured text data, where structured text data refers to text data that is suitable for implementation by two-dimensional table logical expression, and unstructured text data refers to text data that is inconvenient for representation by two-dimensional table logical expression. For example, for academic resource text data, unstructured text data may include paper abstract text, patent abstract text, and other text; the structured text data obtained by the report parsing may include the formatted text data under the attributes of the student, the research institution, the subject field, the patent, the thesis, the conference, and the like. The scrapy crawler technology can preliminarily realize the screening and grabbing of academic resources.
In the process of crawling academic resource text data, the acquired structured text data and the attribute thereof are directly stored in a MySql database, and the structured text data can also comprise structured entity data; for unstructured text data, in order to facilitate the retrieval of texts by a knowledge service component, the invention stores the unstructured text data into an elastic search database based on an inverted index. Furthermore, the invention can also use an information extraction algorithm based on deep semantic representation to extract the association relationship between the academic subject word entity, the subject field and the academic entity in the academic text, and then, by combining the structured entity data stored in the MySql database and the relationship between the entities, the entity-entity relationship triple can be constructed and further stored in the Neo4j database.
Through the step S110, the acquisition and storage of academic resource text data are designed, the text data acquisition part and the text data storage part are included, structured and unstructured academic resource text data can be obtained and stored in corresponding databases, and the structuralization of multi-source and multi-field academic achievements is realized. In this step S110, academic resource text information such as treatises, scholars, patents, projects, and the like in the internet and each knowledge base can be obtained.
And step S120, inputting the acquired academic resource text data into a pre-training model to obtain a multi-dimensional academic text semantic feature vector.
In this step, the pre-training model is the student pre-training model obtained through the steps described above. That is, the student pre-training model is obtained by finely adjusting the Bert pre-training model based on the multiple load sample loss function, training the student model by knowledge distillation with the adjusted Bert pre-training model as a teacher model.
The academic resource text data obtained in step S110 may be input to a student pre-training model, and the student pre-training model outputs a multidimensional academic text semantic feature vector.
In the embodiment of the invention, the academic resource text data input to the student pre-training model can be acquired structured text data or unstructured text data. For example, in the case that the input data is unstructured text data, the input unstructured text data may be text data such as a thesis abstract text, a patent abstract text, and the like, or text data of other academic types; in the case where the input data is unstructured text data, the input structured text data may be text data under a specific attribute, and may also be entity content text data.
And S130, performing dimensionality reduction compression on the multi-dimensional academic text semantic feature vector output by the student pre-training model, and outputting the final academic text semantic feature.
In the embodiment of the invention, a dimension reduction algorithm can be selected from the existing dimension reduction algorithms to perform dimension reduction compression on the feature vector. Existing dimension reduction algorithms may include Principal Component Analysis (PCA) dimension reduction algorithms, Independent Component Analysis (ICA) dimension reduction algorithms, Linear Discriminant Analysis (LDA) dimension reduction algorithms, and the like, but the present invention is not limited thereto.
As an example, the multidimensional vector data output by the student pre-training model is subjected to dimensionality reduction through a Principal Component Analysis (PCA) dimensionality reduction algorithm to obtain a final academic text semantic feature vector.
According to the invention, by using the Bert pre-training model, before a specific downstream task is executed, the converter model is pre-trained by using large-scale corpora, then the Bert pre-training model is finely adjusted based on multiple load sample loss functions under the specific downstream task based on the pre-training parameters, and the finely adjusted Bert pre-training model is used as a teacher model to train the student model through knowledge distillation, so that the pre-trained student model is used as the pre-training model for extraction of academic text semantic features in the invention, a better effect can be obtained, the vector generation speed can be accelerated while the vector generation quality is improved, and the method is suitable for text vector generation under an academic big data scene.
In the embodiment of the invention, because the Bert pre-training model is trained by adopting the natural language reasoning data set, the pre-training model which is provided and finely adjusted on the basis of the Bert pre-training model can contain more semantic information, and the sentence vector distance between negative examples is farther while the distance between similar sentence vectors is reduced by a plurality of heavy sample loss functions, so that the pre-training model can generate more accurate and reasonable sentence vectors. Meanwhile, PCA dimension reduction corrects the distribution of sentence vectors to a certain degree, so that the calculated similarity is more reasonable. Due to the reduction of the number of model layers, the speed of generating the vector by the model is higher than that of the original model.
The embodiment of the invention effectively combines the fine adjustment of the pre-training model, knowledge distillation and PCA dimension reduction. The method improves the quality of vector generation, accelerates the speed of vector generation, and is suitable for text vector generation in an academic big data scene. Because the Bert pre-training model is subjected to fine adjustment on the basis of the multiple negative sample loss function on the natural language inference data set, the semantic vector distribution condition under the academic big data scene is obviously improved. The Bert-FDP pre-training model provided by the invention realizes accurate vector representation of academic resource texts and ensures the sentence embedding and generating speed in a big data scene.
The experimental result shows that the correlation coefficient values of the trained Bert-FDP model on each data set are improved compared with the correlation coefficient values of the commonly used text feature extraction model, the Bert-FDP model can generate better text vectors, and the problems of flattening of academic text vectors and low discrimination are effectively solved.
Correspondingly to the method, the invention also provides an academic text semantic feature extraction system based on the pre-training model, which comprises a computer device and a memory, wherein the memory is used for storing computer instructions, the processor is used for executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system realizes the steps of the method.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the foregoing steps of the edge computing server deployment method. The computer readable storage medium may be a tangible storage medium such as Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disks, removable storage disks, CD-ROMs, or any other form of storage medium known in the art.
Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link.
It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A semantic feature extraction method for academic texts based on a pre-training model is characterized by comprising the following steps:
acquiring academic resource text data;
inputting the acquired academic resource text data into a pre-training model to obtain a multi-dimensional academic text semantic feature vector; the pre-training model is a student pre-training model obtained by finely tuning a Bert pre-training model based on a multiple load sample loss function, and training the student model by knowledge distillation by taking the finely tuned Bert pre-training model as a teacher model;
and performing dimensionality reduction compression on the multidimensional academic text semantic feature vector, and outputting a final academic text semantic feature vector.
2. The method of claim 1, wherein obtaining academic resource text data comprises: crawling webpage academic resource data through a crawler technology to obtain academic resource text data;
in the process of crawling webpage academic resource text data by a script crawler, aiming at a webpage to be crawled with a reverse crawling mechanism, extracting a document ID of an original URL of the webpage to be crawled, constructing a new URL by using the extracted document ID, and guiding the crawler to a detailed page without the reverse crawling mechanism, so that complete document information of the webpage to be crawled is obtained.
3. The method of claim 1, wherein the fine-tuning the Bert pre-trained model based on the multiple negative sample loss function, and training the student model by knowledge distillation using the fine-tuned Bert pre-trained model as a teacher model comprises: fine-tuning the Bert pre-training model by utilizing a natural language reasoning data set or a semantic text similarity benchmark data set based on a multiple heavy sample loss function, and training a student model by utilizing a wiki data set to take the fine-tuned Bert pre-training model as a teacher model through knowledge distillation;
and the input of the Bert pre-training model is a sentence pair containing a relation label in the natural language reasoning data set.
4. The method of claim 3, wherein the multiple heavy load sample loss function satisfies the following equation:
wherein,uandvrespectively show the sentence vector sequences obtained based on the Bert pre-training modelu 1 , …, u i , …, u K ]And 2v 1 , …, v i , …, v K ],Representing a sentence vectorAndthe dot product between the two (C) and (D),representing a pre-trained model used in computing dot products between sentence vectors,Krepresenting the number of sentence pairs input to the Bert pre-trained model.
5. The method of claim 1, wherein the loss function used in the student model training process is an MSE loss function, and the MSE loss function is expressed as:
6. The method of any one of claims 1-5, wherein performing dimension reduction compression on the multidimensional academic text semantic feature vector comprises: and performing dimensionality reduction compression on the multidimensional academic text feature vector output by the pre-training model by using a principal component analysis dimensionality reduction algorithm.
7. The method of any of claims 1-5, wherein the academic resource text data comprises structural chemistry source text data and/or non-structural chemistry source text data.
8. The method of claim 1, wherein when training the student models by knowledge distillation using the trimmed Bert pre-training model as the teacher model, the teacher model comprises 12 hidden layers, and the [1,4,7,10] hidden layer of the teacher model is reserved as the hidden layer of the student model.
9. An academic text semantic feature extraction system based on a pre-trained model, the system comprising a processor and a memory, wherein the memory stores computer instructions, the processor is used for executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system realizes the steps of the method according to any one of claims 1 to 8.
10. A computer storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210778073.8A CN114841173B (en) | 2022-07-04 | 2022-07-04 | Academic text semantic feature extraction method and system based on pre-training model and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210778073.8A CN114841173B (en) | 2022-07-04 | 2022-07-04 | Academic text semantic feature extraction method and system based on pre-training model and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114841173A true CN114841173A (en) | 2022-08-02 |
CN114841173B CN114841173B (en) | 2022-11-18 |
Family
ID=82573934
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210778073.8A Active CN114841173B (en) | 2022-07-04 | 2022-07-04 | Academic text semantic feature extraction method and system based on pre-training model and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114841173B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116187163A (en) * | 2022-12-20 | 2023-05-30 | 北京知呱呱科技服务有限公司 | Construction method and system of pre-training model for patent document processing |
CN117116408A (en) * | 2023-10-25 | 2023-11-24 | 湖南科技大学 | Relation extraction method for electronic medical record analysis |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110852426A (en) * | 2019-11-19 | 2020-02-28 | 成都晓多科技有限公司 | Pre-training model integration acceleration method and device based on knowledge distillation |
CN111062489A (en) * | 2019-12-11 | 2020-04-24 | 北京知道智慧信息技术有限公司 | Knowledge distillation-based multi-language model compression method and device |
US20210182662A1 (en) * | 2019-12-17 | 2021-06-17 | Adobe Inc. | Training of neural network based natural language processing models using dense knowledge distillation |
-
2022
- 2022-07-04 CN CN202210778073.8A patent/CN114841173B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110852426A (en) * | 2019-11-19 | 2020-02-28 | 成都晓多科技有限公司 | Pre-training model integration acceleration method and device based on knowledge distillation |
CN111062489A (en) * | 2019-12-11 | 2020-04-24 | 北京知道智慧信息技术有限公司 | Knowledge distillation-based multi-language model compression method and device |
US20210182662A1 (en) * | 2019-12-17 | 2021-06-17 | Adobe Inc. | Training of neural network based natural language processing models using dense knowledge distillation |
Non-Patent Citations (2)
Title |
---|
LU, WH等: "TwinBERT: Distilling Knowledge to Twin-Structured Compressed BERT Models for Large-Scale Retrieval", 《2020 | CIKM 20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT》 * |
岳增营等: "基于语言模型的预训练技术研究综述", 《中文信息学报》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116187163A (en) * | 2022-12-20 | 2023-05-30 | 北京知呱呱科技服务有限公司 | Construction method and system of pre-training model for patent document processing |
CN116187163B (en) * | 2022-12-20 | 2024-02-20 | 北京知呱呱科技有限公司 | Construction method and system of pre-training model for patent document processing |
CN117116408A (en) * | 2023-10-25 | 2023-11-24 | 湖南科技大学 | Relation extraction method for electronic medical record analysis |
CN117116408B (en) * | 2023-10-25 | 2024-01-26 | 湖南科技大学 | Relation extraction method for electronic medical record analysis |
Also Published As
Publication number | Publication date |
---|---|
CN114841173B (en) | 2022-11-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Karpathy et al. | Deep visual-semantic alignments for generating image descriptions | |
Hofmann | The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data | |
CN114841173B (en) | Academic text semantic feature extraction method and system based on pre-training model and storage medium | |
Gao et al. | Convolutional neural network based sentiment analysis using Adaboost combination | |
CN111581364B (en) | Chinese intelligent question-answer short text similarity calculation method oriented to medical field | |
CN118093834B (en) | AIGC large model-based language processing question-answering system and method | |
Wehrmann et al. | Order embeddings and character-level convolutions for multimodal alignment | |
CN111968700A (en) | Method and system for extracting rice phenomics knowledge map relation based on BERT | |
Cheng et al. | A semi-supervised deep learning image caption model based on Pseudo Label and N-gram | |
Miao et al. | Application of CNN-BiGRU Model in Chinese short text sentiment analysis | |
CN114265936A (en) | Method for realizing text mining of science and technology project | |
Patil et al. | Convolutional neural networks for text categorization with latent semantic analysis | |
CN115759119A (en) | Financial text emotion analysis method, system, medium and equipment | |
CN115344668A (en) | Multi-field and multi-disciplinary science and technology policy resource retrieval method and device | |
CN112100382B (en) | Clustering method and device, computer readable storage medium and processor | |
Imad et al. | Automated Arabic News Classification using the Convolutional Neural Network. | |
CN116595170A (en) | Medical text classification method based on soft prompt | |
CN114298020B (en) | Keyword vectorization method based on topic semantic information and application thereof | |
CN110674293A (en) | Text classification method based on semantic migration | |
Zhang et al. | Research on answer selection based on LSTM | |
Phat et al. | Vietnamese text classification algorithm using long short term memory and Word2Vec | |
CN112836014A (en) | Multi-field interdisciplinary-oriented expert selection method | |
Sangeetha et al. | Sentiment Analysis on Movie Reviews: A Comparative Analysis | |
El-Gayar | Automatic generation of image caption based on semantic relation using deep visual attention prediction | |
Nguyen et al. | Combining Multi-vision Embedding in Contextual Attention for Vietnamese Visual Question Answering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |