CN114818698A - Mixed word embedding method of natural language text and mathematical language text - Google Patents
Mixed word embedding method of natural language text and mathematical language text Download PDFInfo
- Publication number
- CN114818698A CN114818698A CN202210469691.4A CN202210469691A CN114818698A CN 114818698 A CN114818698 A CN 114818698A CN 202210469691 A CN202210469691 A CN 202210469691A CN 114818698 A CN114818698 A CN 114818698A
- Authority
- CN
- China
- Prior art keywords
- mathematical
- language text
- expression
- relative position
- natural language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 230000014509 gene expression Effects 0.000 claims abstract description 161
- 238000012549 training Methods 0.000 claims abstract description 52
- 230000000873 masking effect Effects 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 claims description 32
- 238000004364 calculation method Methods 0.000 claims description 23
- 239000004816 latex Substances 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 9
- 239000000126 substance Substances 0.000 claims description 9
- 230000000295 complement effect Effects 0.000 claims description 7
- 230000009471 action Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 238000006073 displacement reaction Methods 0.000 claims description 2
- 238000013519 translation Methods 0.000 abstract description 9
- 238000010586 diagram Methods 0.000 description 4
- 238000009795 derivation Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 101100136092 Drosophila melanogaster peng gene Proteins 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a mixed word embedding method of a natural language text and a mathematical language text, which comprises the following steps: identifying and preprocessing the mixed text to obtain a mathematical resource data set consisting of the text and a mathematical expression; carrying out position coding on a mathematical expression with a tree structure, and keeping the relative position translation of the tree structure unchanged; carrying out unified position coding on a text with linear structure characteristics and a mathematical expression with tree structure characteristics; and sending the relative position code into an attention module of a pre-training model, pre-training mathematical resources by adopting a masking language model and a next sentence prediction two standard pre-training tasks, and obtaining the embedded vector expression rich in context information by each symbol after the pre-training is finished.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a mixed word embedding method of a natural language text and a mathematical language text.
Background
Mathematical text refers to natural language text containing mathematical expressions, has characteristics of ambiguity and polymorphism, and widely appears in STEM disciplines and higher education. The natural language text has a linear structure characteristic, the mathematical expression has a tree structure characteristic, and the word embedding expression of the mixed text plays a vital role in the related field of the mathematical text. The traditional word embedding technology is suitable for processing texts with linear characteristics, and is difficult to process mathematical expressions with tree-structured characteristics.
The mathematical expression can be expressed into two most important Tree structures, one is a Symbol Layout Tree (SLT), and the representation is constructed according to the written line of the expression and has the appearance information of the mathematical expression; the other is an Operator Tree (OPT), which is constructed according to an Operator hierarchy in an expression and has mathematical expression semantic information. In 2021, Peng et al, university of Beijing, proposed a math expression pre-training model MathBERT based on BERT, which could obtain word-embedded expressions of mixed text. The author inputs LaTeX sequence of mathematical expression, OPT tree middle sequence traversal sequence and context text sequence as BERT model, and extracts the structure information of OPT tree by using attention masking matrix, so that adjacent nodes in the tree structure can be mutually visible in the masking matrix. And finally, adding a masking structure prediction task on the masking language model and the context prediction task to train a BERT model. However, this method artificially limits the attention calculation range and makes it difficult to acquire word-embedding information that depends on a long distance. In the same year, Shen et al, Pa.university, proposed a MathBERT model for mathematics education, and innovatively fine-tuned the BERT using an automatic scoring task and a cognitive tracking prediction task. But the author uses a simple linear sequence of mathematical text as input, ignores the tree structure characteristic of a mathematical expression and enables word embedding to lack mathematical semantic information.
Disclosure of Invention
Aiming at the technical problems that the word embedding expression is not comprehensive and accurate enough due to the fact that the mathematical text has extensive ambiguity and polymorphism characteristics depending on the context and the prior method is difficult to extract the semantic relation of the remote-dependent mathematical expression, the invention carries out position coding on the mathematical expression with the tree structure according to the position expression principle of the mathematical structure and the structural characteristics of the natural language and mathematical language mixed text, unifies the position coding on the text with linear sequence characteristics and the mathematical expression with the tree structure characteristics, and further obtains the word embedding expression of the natural language and mathematical language mixed text by finely adjusting a pre-training model under the mathematical language processing task.
In order to achieve the above object, the present invention provides a mixed word embedding method of a natural language text and a mathematical language text, comprising:
s1: preprocessing learning resources including a natural language text and a mathematical language text to obtain a mathematical resource data set, wherein the mathematical language text is a mathematical expression with a tree structure, and the natural language text is a context with linear sequence characteristics;
s2: absolute position coding is carried out on a mathematical expression with a tree structure by adopting a position coding mode based on branches, and relative position coding of two nodes in the tree structure is calculated according to an absolute position coding result;
s3: adopting negative integer position coding for a context with linear sequence characteristics, expressing by using a complement, then taking a root node of a tree structure as a first node of a linear sequence to realize uniform position coding of a mathematical expression and the context, and then calculating the relative position coding of any two nodes in the tree structure and the linear sequence according to the uniform position coding;
s4: inputting the data set of the mathematical resources obtained in the step S1 into a BERT pre-training model, wherein the BERT pre-training model is provided with a position coding module and an attention module, inputting the unified position code obtained in the step S3 into the position coding module, and coding the relative positions of any two nodes in the tree structure and the linear sequence calculated in the step S3 into the attention module of the BERT pre-training model for training, and pre-training the mathematical resources by adopting a masking language model and a next sentence prediction two standard pre-training tasks to obtain a trained word embedding model;
s5: and processing the natural language text and the mathematical language text by using the trained word embedding model to obtain the final mixed word embedding expression.
In one embodiment, the step S1 of preprocessing the learning resource containing the natural language text and the mathematical language text includes:
processing a learning resource containing a natural language text and a mathematical language text into a symbol sequence, wherein the mathematical expression is in a LaTeX format, and the mathematical resource data set is a mathematical resource set expressed as L ═ L { (L) 1 ,L 2 ,…,L i ,…,L N’ },L i Indicating the ith mathematical resource.
In one embodiment, processing a learning resource containing natural language text and mathematical language text into a sequence of symbols includes:
utilizing a mathematical expression in a LaTeX format of an im2 tagging word segmentation tool to perform word segmentation to obtain a symbol sequence of a word segmentation result of the mathematical expression, utilizing a TangenS tool to convert the mathematical expression in the LaTeX format into an operator OPT tree, and performing depth-first traversal on the OPT tree to obtain a symbol sequence of a traversal result of a tree structure of the mathematical expression, wherein the jth mathematical expression of the ith mathematical resource is represented as a jth mathematical expression of the ith mathematical resource Denotes the j (th)The nth' symbol of the mathematical expression after being participled by the LaTeX format,representing the kth symbol obtained by depth-first traversal of an OPT tree of the jth mathematical expression, wherein each mathematical resource consists of a natural language text and a mathematical expression, the natural language text is the context of the mathematical expression, and the mathematical expression M is i,j Is in the context of C i,j ={t z |t z ∈L i ,|z-p ij R is less than or equal to I, wherein t is equal to or less than R z Representing the z-th natural language word, p ij Is a mathematical expression M i,j The position in the sequence as a whole, R is at most 64;
obtaining the expression of each mathematical resource according to the natural language and the symbolic expression form of the mathematical expression, wherein the ith mathematical resource is expressed as:
N T is the total length of the natural language text;
when the mathematical expression M i,j When the device is composed of a plurality of continuous equality or inequality, the device is divided into equal signs and unequal signsObtaining a mathematical resource data set according to the expression of each mathematical resource as a pre-training model data setWherein i is the number of the learning resource, j is the number of the mathematical expression, and w is the number of the sub-expression.
In one embodiment, S2 introduces a shift operation in absolute position encoding, the mathematical expression is N-ary tree, and the root node is defined asFor any subsequent child nodeThe encoding method is as follows:
s2.1: representing the sub-nodes of all branches by using one hot codes, wherein the one hot codes have N bits, and for the sub-node of the r-th branch, the r-th bit of the one hot codes from the right to the left is 1, and the rest bits are 0; s2.2: the position code of the father node is shifted to the left by N bits and then is added with the one hot code of the branch child node, the final absolute position code of the branch node is obtained, and any node in the final expression tree is represented asWherein n is a nodeAbsolute position coding of D n In the form of an absolute position-coded decimal representation,is D n When calculating the relative position of the nodes in the tree, the following method is adopted:
wherein PE represents a relative position calculation function, T represents a tree,representing nodes in a tree of mathematical expressionsAnd nodeRelative position calculation function of D m Is a nodeThe absolute value of the code of (a),is D m The length of the binary code of (a),<<representing the left shift operator.
In one embodiment, step S3 includes:
for natural language text with linear sequence features, relative position coding is performed, wherein the relative position between words is defined as the difference of absolute positions and expressed asa and b represent the absolute position of the device,a relative position calculation function for representing the second word, wherein the relative position coding is adopted by coding the position of the linear sequence by negative integer, and the length of the position coding of the linear sequence is L S =n T ×l T ,n T Maximum bifurcation Tree, l, representing a Tree Structure T Representing the maximum layer number of the tree structure, and representing a negative integer by using a complementary code;
taking a root node of the tree structure as a head node of the linear sequence to realize unified position coding of two structures, wherein the calculation of relative positions in the unified position coding is as follows:
wherein the content of the first and second substances,representing any two nodesAnda function is calculated of the relative position between,representing nodes in a linear sequenceAnd nodeA relative position calculation function therebetween, S denotes a sequence,representing nodes in a tree structureAnd nodeA function of the relative position between them is calculated,representing nodes in a linear sequenceAnd a relative position calculation function between the root nodes,representing root nodes and nodes in a tree structureA function of the relative position between them is calculated,representing nodes in a tree structureAnd a relative position calculation function between the root nodes,representing root node andnodes in a linear sequenceThe function is calculated for the relative position therebetween.
In one embodiment, when the relative position codes of any two nodes in the tree structure and the linear sequence are sent to an attention module of a BERT pre-training model for training, the action expression of the relative position codes is as follows:
wherein the content of the first and second substances,is the A and B relative position embedding vector of the first layer transform in the BERT model,is the l-th layer a-th word embedding vector,is the layer I, the B word embedding vector, W Q,l Is the Query matrix of the l-th layer, W K,l Is the Key matrix at level l, d is the word vector dimension,is an unnormalized attention weight.
In one embodiment, the final mixed word embedding expression is calculated as:
wherein the content of the first and second substances,to representThe normalization process of (1) is performed,denotes the B-th word embedding vector, W, of the l-th layer V,l Is the Value matrix of the l-th layer, n 1 Represents a total of n 1 The word, when the layer l is the last layer,and embedding the words of the A-th word in the l-th layer as the final mixed word embedded expression of the expression.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides a mixed word embedding method of a natural language text and a mathematical language text, which comprises the following steps: identifying and preprocessing the mixed text to obtain a mathematical resource data set consisting of the text and a mathematical expression; absolute position coding is carried out on a mathematical expression with a tree structure, relative position coding is calculated according to the absolute position coding result, and the relative position translation of the tree structure is kept unchanged; carrying out unified position coding on a text with linear structure characteristics and a mathematical expression with tree structure characteristics; and the relative position codes are sent to an attention module of a pre-training model, a masking language model and a next sentence prediction two standard pre-training tasks are adopted to pre-train mathematical resources, and after the pre-training is finished, each symbol can be represented by an embedded vector rich in context information, so that the information contained in the final word embedded expression is richer, and the expression is more accurate.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart of a mixed word embedding method for natural language text and mathematical language text according to an embodiment of the present invention;
FIG. 2 is a flow chart of data preprocessing for a word embedding method in an embodiment of the present invention;
FIG. 3 is a diagram of a mathematical expression tree in an embodiment of the present invention;
FIG. 4 is a schematic diagram of tree position coding in an embodiment of the present invention;
FIG. 5 is a schematic diagram of a unified position code in an embodiment of the present invention;
FIG. 6 is a diagram of a pre-training model in an embodiment of the invention.
Detailed Description
The invention provides a mixed word embedding method of a natural language text and a mathematical language text, which comprises the following steps: identifying and preprocessing the mixed text to obtain a mathematical resource data set consisting of the text and a mathematical expression; carrying out position coding on a mathematical expression with a tree structure, and keeping the relative position translation of the tree structure unchanged; carrying out unified position coding on a text with linear structure characteristics and a mathematical expression with tree structure characteristics; and sending the relative position code into an attention module of a pre-training model, pre-training mathematical resources by adopting a masking language model and a next sentence prediction two standard pre-training tasks, and obtaining the embedded vector expression rich in context information by each symbol after the pre-training is finished.
Compared with the prior art, the position coding is carried out on the tree structure of the mathematical expression, the relative position translation invariance of the position coding of the tree structure is ensured, and the text with linear structure characteristics and the mathematical expression with tree structure characteristics are subjected to unified position coding and are used for a BERT pre-training model, so that the word embedding expression is extracted.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a mixed word embedding method of a natural language text and a mathematical language text, which comprises the following steps:
s1: preprocessing learning resources including a natural language text and a mathematical language text to obtain a mathematical resource data set, wherein the mathematical language text is a mathematical expression with a tree structure, and the natural language text is a context with linear sequence characteristics;
s2: absolute position coding is carried out on a mathematical expression with a tree structure by adopting a position coding mode based on branches, and relative position coding of two nodes in the tree structure is calculated according to an absolute position coding result;
s3: adopting negative integer position coding for a context with linear sequence characteristics, expressing by using a complement, then taking a root node of a tree structure as a first node of a linear sequence to realize uniform position coding of a mathematical expression and the context, and then calculating the relative position coding of any two nodes in the tree structure and the linear sequence according to the uniform position coding;
s4: inputting the data set of the mathematical resources obtained in the step S1 into a BERT pre-training model, wherein the BERT pre-training model is provided with a position coding module and an attention module, the unified position coding obtained in the step S3 is inputted into the position coding module, the relative position codes of any two nodes in the tree structure and the linear sequence calculated in the step S3 are sent into the attention module of the BERT pre-training model for training, and the mathematical resources are pre-trained by adopting a masking language model and two standard pre-training tasks for predicting a next sentence, so that a trained word embedding model is obtained;
s5: and processing the natural language text and the mathematical language text by using the trained word embedding model to obtain the final mixed word embedding expression.
Please refer to fig. 1, which is a flowchart illustrating a method for embedding mixed words of natural language text and mathematical language text according to an embodiment of the present invention.
Specifically, S1 is preprocessing of learning resources, S2 is coding of a mathematical expression having a tree structure, S3 is position coding of a context, and unified position coding of the mathematical expression and the context is realized, S4 is training of a BERT pre-training model using unified position coding, and S5 is application of a trained word embedding model.
In one embodiment, the step S1 of preprocessing the learning resource containing the natural language text and the mathematical language text includes:
processing a learning resource containing a natural language text and a mathematical language text into a symbol sequence, wherein the mathematical expression is in a LaTeX format, and the mathematical resource data set is a mathematical resource set expressed as L ═ L { (L) 1 ,L 2 ,…,L i ,…,L N’ },L i Indicating the ith mathematical resource.
N' represents the total number of mathematical resources.
In one embodiment, processing a learning resource containing natural language text and mathematical language text into a sequence of symbols includes:
utilizing a mathematical expression in a LaTeX format of an im2 tagging word segmentation tool to perform word segmentation to obtain a mathematical expression word segmentation result symbol sequence, utilizing a TangenS tool to convert the mathematical expression in the LaTeX format into an operator OPT tree, and performing depth-first traversal on the OPT tree to obtain a mathematical expression tree structure traversal result symbol sequence, wherein the jth mathematical expression of the ith mathematical resource is represented as a jth mathematical expression The nth' symbol of the jth mathematical expression after being participled by the LaTeX format is represented,k-th symbol obtained by depth-first traversal of OPT tree representing jth mathematical expressionEach mathematical resource is composed of a natural language text and a mathematical expression, wherein the natural language text is the context of the mathematical expression, and the mathematical expression M i,j Is in the context of C i,j ={t z |t z ∈L i ,|z-p ij R is less than or equal to I, wherein t is equal to or less than R z Representing the z-th natural language word, p ij Is a mathematical expression M i,j The position in the sequence as a whole, R is at most 64;
obtaining the expression of each mathematical resource according to the natural language and the symbolic expression form of the mathematical expression, wherein the ith mathematical resource is expressed as:
N T is the total length of the natural language text;
when the mathematical expression M i,j When the device is composed of a plurality of continuous equality or inequality, the device is divided into equal signs and unequal signsObtaining a mathematical resource data set according to the expression of each mathematical resource as a pre-training model data setWherein i is the number of the learning resource, j is the number of the mathematical expression, and w is the number of the sub-expression.
In a specific implementation process, as shown in fig. 2, the data is preprocessed to obtain a data set of mathematical resources. And processing the learning resources including the natural language and the mathematical language into a symbol sequence through a Mathpix OCR interface to obtain a mathematical resource set, wherein the learning resources include an answering process. The Mathpix OCR interface may extract text and mathematical expressions from the picture and convert the mathematical formulas to LaTeX format. With L ═ L 1 ,L 2 ,…,L i ,…,L N Denotes the entire set of mathematical resources, where L i The ith mathematical resource is represented.
And (3) segmenting the mathematical expression in the LaTeX format by using an im2markup segmentation tool for the mathematical expression of each mathematical resource to obtain a mathematical expression symbol sequence. Using the TangenS tool, the mathematical expression in LaTeX format is converted into an operator OPT tree, which is shown in FIG. 3. And performing depth-first traversal on the OPT tree to obtain a mathematical expression symbol sequence. Therefore, two symbol sequences can be obtained by one mathematical expression, and for the jth mathematical expression of the ith mathematical resource, the j can be expressed as:
whereinThe nth' symbol of the jth mathematical expression after being participled by the LaTeX format is represented,and (4) a k-th operator or operand obtained by depth-first traversal of the OPT tree representing the jth mathematical expression.
Each mathematical resource is composed of a text and a mathematical expression, wherein the text is the context of the mathematical expression, and the context may contain description and interpretation information of the mathematical expression and is the key of semantic association between a mathematical symbol and a natural language. Since the length of the input data is limited, the context of the mathematical expression needs to be defined. Defining a mathematical expression M i,j Is in the context of C i,j ={t z |t z ∈L i ,|z-p ij R is less than or equal to I, wherein t is equal to or less than R z Denotes the z-th natural language word, p ij Is a mathematical expression M i,j The position of the whole in the sequence is maximum to 64, namely 64 natural language text symbols before and after the mathematical expression are taken, and 128 natural language text symbols in total are taken as context;
in summary, for the ith mathematical resource, it can be expressed as:
In learning resources, mathematical expression M i,j Often, multi-step derivation is included, namely, a plurality of continuous equations or inequalities form an expression together, and the expression can be further segmented into sub-expressions by taking equal signs and unequal signs as marksEach sub-expressionOnly one equal sign or unequal sign is included, namely only one derivation step is provided, and the sub-expressions of the multi-step derivation expression share one context.
Finally, all learning resources are processed according to the process to form a pre-training model data setWherein i is a learning resource serial number, j is a mathematical expression serial number, and w is a sub-expression serial number.
In one embodiment, S2 introduces a shift operation in absolute position encoding, the mathematical expression is N-ary tree, and the root node is defined asFor any subsequent child node, the coding mode is as follows:
s2.1: representing the sub-nodes of all branches by using one hot codes, wherein the one hot codes have N bits, and for the sub-node of the r-th branch, the r-th bit of the one hot codes from the right to the left is 1, and the rest bits are 0;
s2.2: the position code of the father node is shifted to the left by N bits and then is added with the one hot code of the branch child node, the final absolute position code of the branch node is obtained, and any node in the final expression tree is represented asWherein n is a nodeAbsolute position coding of, D n In the form of an absolute position-coded decimal representation,is D n When calculating the relative position of the nodes in the tree, the following method is adopted:
wherein PE represents a relative position calculation function, T represents a tree,representing nodes in a mathematical expression treeAnd nodeRelative position calculation function of D m Is a nodeThe absolute value of the code of (a),is D m The length of the binary code of (a),<<representing the left shift operator.
In a specific implementation process, step S2 performs position coding on the mathematical expression having the tree structure, and ensures that the position coding having the tree structure has relative position translation invariance. The relative position can reflect the relation between words, in a linear sequence, the relative position is defined as the difference of absolute positions, and the translation invariance of the relative position refers to that the position offset is the same no matter the absolute position of a word, as long as the relative position is the same. The relative position is unchanged, so that semantic association with the words with fixed relative positions can be always realized no matter where the words are, and the training process can obtain the semantic relationship.
In order to ensure that the position code of the tree structure has relative position translation invariance, the invention adopts a branch-based coding mode and introduces displacement operation when calculating the relative position.
As shown in fig. 4, assuming that the mathematical expression tree is a 3-way tree, the root node is defined as (000), and any subsequent child node is a first branch child node, the root node is left-shifted by 3 bits and then a one hot code is added (001), and finally the branch child node is represented as (000001), similarly, the 2 nd branch child node is (000010), and the 3 rd branch child node is (000100).
Any node in the expression tree can be represented asWherein D n A decimal representation of n is encoded for absolute position,is D n Of the relative position of the nodes in the computation treeWhen the tree is in use, the relative position between any two nodes in the tree is independent of the absolute position according to the formula. As shown in fig. 4, node p 20,9 And node p 0,3 Has a relative position value of 20, node p 84,12 And node p 1,6 Is also 20.
In one embodiment, step S3 includes:
for natural language text with linear sequence features, relative position coding is performed, wherein the relative position between words is defined as the difference of absolute positions and expressed asa and b represent the absolute position of the device,a relative position calculation function for representing the second word, wherein the relative position coding is adopted by coding the position of the linear sequence by negative integer, and the length of the position coding of the linear sequence is L S =n T ×l T ,n T Maximum bifurcation Tree, l, representing a Tree Structure T Representing the maximum layer number of the tree structure, and representing a negative integer by using a complementary code;
taking a root node of the tree structure as a head node of the linear sequence to realize unified position coding of two structures, wherein the calculation of relative positions in the unified position coding is as follows:
wherein the content of the first and second substances,representing any two nodesAnda function of the relative position between them is calculated,representing nodes in a linear sequenceAnd nodeA relative position calculation function therebetween, S denotes a sequence,representing nodes in a tree structureAnd nodeA function of the relative position between them is calculated,representing nodes in a linear sequenceAnd a relative position calculation function between the root nodes,representing root nodes and nodes in a tree structureA function of the relative position between them is calculated,representing nodes in a tree structureAnd a relative position calculation function between the root nodes,representing root nodes and nodes in a linear sequenceThe function is calculated for the relative position therebetween.
Specifically, step S3 fuses the linear sequence of contextual natural language with the coding model of S2. For natural language text with a linear sequence, the relative position between words is defined as the difference in absolute position, which can be expressed asa and b represent absolute positions, and translation invariance of relative positions is easily satisfied:
In order to unify the position coding of the linear sequence and the tree structure, the position of the linear sequence is coded by negative integers and is represented by a complementary code, so that the highest position of the position coding of all natural language words is 1, and the highest position of the position coding of the nodes of the tree structure is 0. Wherein the position code of the linear sequence has a length L S =n T ×l T ,n T Maximum bifurcation Tree, l, representing a Tree Structure T Representing the maximum number of layers of the tree structure. Finally, the root node of the tree structure is used as the head node of the linear sequence to realize the unified representation of the two structures, as shown in fig. 5. The formula for calculating the relative position of the unified position code is shown above.
In one embodiment, when the relative position codes of any two nodes in the tree structure and the linear sequence are sent to an attention module of a BERT pre-training model for training, the action expression of the relative position codes is as follows:
wherein the content of the first and second substances,is the A and B relative position embedding vector of the first layer transform in the BERT model,is the ith layer a word embedding vector,is the B-th word embedding vector of the l-th layer, W Q,l Is the Query matrix of the l-th layer, W K,l Is the Key matrix at level l, d is the word vector dimension,is not normalizedThe gravity value.
In one embodiment, the final mixed word embedding expression is calculated by:
wherein the content of the first and second substances,to representThe normalization process of (a) is performed,denotes the B-th word embedding vector, W, of the l-th layer V,l Is the Value matrix of the l-th layer, n 1 Represents a total of n 1 The word, when the layer l is the last layer,and embedding the words of the A-th word in the l-th layer as the final mixed word embedded expression of the expression.
Specifically, through the foregoing steps, learning resources are converted into data setsWherein the context sequence C i,j And mathematical expressionsAfter word segmentation, the code is coded by using a uniform position coding mode and sent to a BERT pre-training model. As shown in fig. 6, the input of the BERT pre-training model consists of three parts, the first part is a context sequence, the second part is a mathematical expression LaTeX sequence, and the third part is a depth-first traversal sequence of a mathematical expression OPT tree. The relative position code is then fed into the attention module of the pre-training model BERT for training. In the attention module, the action expression of relative position coding is as beforeAs shown herein. By passingThe final mixed word embedding expression is obtained by the calculation formula.
After the pre-training model is adjusted, a Masking Language Model (MLM) and a Next Sentence Prediction (NSP) two standard pre-training tasks are adopted to obtain mixed word embedded expression of a natural Language text and a mathematical Language text.
In the MLM task, 15% of words are randomly extracted from three input sequences, and among the extracted words, 80% of the words are replaced by [ MASK ] labels, 10% of the words are replaced by random other words, and 10% of the words are not changed. The MLM task uses cross entropy as a loss function, which in this embodiment is expressed as:
whereinIs the estimated probability of the masked word x after linear classification and Softmax regression, and p (x) is its original distribution, i.e., its one hot vector.
In the NSP task, 50% of the NSP tasks are randomly selectedWill be its context C i,j The context of the random formula is replaced. The NSP task also generally uses cross entropy as a loss function, which in this embodiment is expressed as:
wherein if the context is not replaced by p-1, if the context is replaced by p-0,is the estimated probability that the context matches the formula.
In summary, compared with the prior art, the invention performs position coding on the tree structure of the mathematical expression, ensures that the position coding of the tree structure has relative position translation invariance, unifies the position coding of the text with linear structure characteristics and the mathematical expression with tree structure characteristics, and uses the unified position coding for the BERT pre-training model to further extract the word embedding expression, so that the final word embedding expression contains more information and is more accurate in expression.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications may be made in addition to or substituted for those described in the detailed description by those skilled in the art without departing from the spirit of the invention or exceeding the scope of the claims set forth below.
Claims (7)
1. A method for embedding mixed words of a natural language text and a mathematical language text, comprising:
s1: preprocessing learning resources including a natural language text and a mathematical language text to obtain a mathematical resource data set, wherein the mathematical language text is a mathematical expression with a tree structure, and the natural language text is a context with linear sequence characteristics;
s2: absolute position coding is carried out on a mathematical expression with a tree structure by adopting a position coding mode based on branches, and relative position coding of two nodes in the tree structure is calculated according to an absolute position coding result;
s3: adopting negative integer position coding for a context with linear sequence characteristics, expressing by using a complement, then taking a root node of a tree structure as a first node of a linear sequence to realize uniform position coding of a mathematical expression and the context, and then calculating the relative position coding of any two nodes in the tree structure and the linear sequence according to the uniform position coding;
s4: inputting the data set of the mathematical resources obtained in the step S1 into a BERT pre-training model, wherein the BERT pre-training model is provided with a position coding module and an attention module, inputting the unified position code obtained in the step S3 into the position coding module, and coding the relative positions of any two nodes in the tree structure and the linear sequence calculated in the step S3 into the attention module of the BERT pre-training model for training, and pre-training the mathematical resources by adopting a masking language model and a next sentence prediction two standard pre-training tasks to obtain a trained word embedding model;
s5: and processing the natural language text and the mathematical language text by using the trained word embedding model to obtain the final mixed word embedding expression.
2. The mixed word embedding method of the natural language text and the mathematical language text as claimed in claim 1, wherein the step S1 of preprocessing the learning resource containing the natural language text and the mathematical language text includes:
processing a learning resource containing a natural language text and a mathematical language text into a symbol sequence, wherein the mathematical expression is in a LaTeX format, and the mathematical resource data set is a mathematical resource set expressed as L ═ L { (L) 1 ,L 2 ,…,L i ,…,L N’ },L i Indicating the ith mathematical resource.
3. The method of mixed word embedding of natural language text and mathematical language text as claimed in claim 2, wherein processing a learning resource containing natural language text and mathematical language text into a sequence of symbols comprises:
utilizing a mathematical expression in a LaTeX format of an im2 tagging word segmentation tool to perform word segmentation to obtain a symbol sequence of a word segmentation result of the mathematical expression, utilizing a TangenS tool to convert the mathematical expression in the LaTeX format into an operator OPT tree, and performing depth-first traversal on the OPT tree to obtain a symbol sequence of a traversal result of a tree structure of the mathematical expression, wherein the jth mathematical expression of the ith mathematical resource is represented as a jth mathematical expression of the ith mathematical resource The nth' symbol of the jth mathematical expression after being participled by the LaTeX format is represented,representing the kth symbol obtained by depth-first traversal of an OPT tree of the jth mathematical expression, wherein each mathematical resource consists of a natural language text and a mathematical expression, the natural language text is the context of the mathematical expression, and the mathematical expression M is i,j Is in the context of C i,j ={t z |t z ∈L i ,|z-p ij R is less than or equal to I, wherein t is equal to or less than R z Representing the z-th natural language word, p ij Is a mathematical expression M i,j The position in the sequence as a whole, R is at most 64;
obtaining the expression of each mathematical resource according to the natural language and the symbolic expression form of the mathematical expression, wherein the ith mathematical resource is expressed as:
N T is the total length of the natural language text;
when the mathematical expression M i,j When the device is composed of a plurality of continuous equality or inequality, the device is divided into equal signs and unequal signsObtaining a mathematical resource data set according to the expression of each mathematical resource as a pre-training model data setWherein i is the number of the learning resource, j is the number of the mathematical expression, and w is the number of the sub-expression.
4. The method for embedding mixed words of natural language text and mathematical language text according to claim 1, wherein S2 is performing absolute position codingIn time, a displacement operation is introduced, a mathematical expression is an N-ary tree, and a root node is defined asFor any subsequent child node, the coding mode is as follows:
s2.1: representing the sub-nodes of all branches by using one hot codes, wherein the one hot codes have N bits, and for the sub-node of the r-th branch, the r-th bit of the one hot codes from the right to the left is 1, and the rest bits are 0; s2.2: the position code of the father node is shifted to the left by N bits and then is added with the one hot code of the branch child node, the final absolute position code of the branch node is obtained, and any node in the final expression tree is represented asWherein n is a nodeAbsolute position coding of, D n In the form of an absolute position-coded decimal representation,is D n When calculating the relative position of the nodes in the tree, the following method is adopted:
wherein PE represents a relative position calculation function, T represents a tree,representing nodes in a mathematical expression treeAnd nodeRelative position calculation function of D m Is a nodeThe absolute value of the code of (a),is D m The length of the binary code of (a),<<representing the left shift operator.
5. The mixed word embedding method of natural language text and mathematical language text as claimed in claim 1, wherein the step S3 includes:
for natural language text with linear sequence features, relative position coding is performed, wherein the relative position between words is defined as the difference of absolute positions and expressed asa and b represent the absolute position of the object,a relative position calculation function for representing the second word, wherein the relative position coding is adopted by coding the position of the linear sequence by negative integer, and the length of the position coding of the linear sequence is L S =n T ×l T ,n T Maximum bifurcation Tree, l, representing a Tree Structure T Representing the maximum layer number of the tree structure, and representing a negative integer by using a complementary code;
taking a root node of the tree structure as a head node of the linear sequence to realize unified position coding of two structures, wherein the calculation of relative positions in the unified position coding is as follows:
wherein the content of the first and second substances,representing any two nodesAnda function is calculated of the relative position between,representing nodes in a linear sequenceAnd nodeA relative position calculation function therebetween, S denotes a sequence,representing nodes in a tree structureAnd nodeA function of the relative position between them is calculated,representing nodes in a linear sequenceAnd a relative position calculation function between the root nodes,representing root nodeNodes in point and tree structuresA function of the relative position between them is calculated,representing nodes in a tree structureAnd a relative position calculation function between the root nodes,representing root nodes and nodes in a linear sequenceThe function is calculated for the relative position therebetween.
6. The method of mixed word embedding of natural language text and mathematical language text as claimed in claim 1, wherein when the relative position codes of any two nodes in the tree structure and the linear sequence are fed into the attention module of the BERT pre-training model for training, the action expressions of the relative position codes are as follows:
wherein the content of the first and second substances,is the A and B relative position embedding vector of the first layer transform in the BERT model,is the ith layer a word embedding vector,is the B-th word embedding vector of the l-th layer, W Q,l Is the Query matrix of the l-th layer, W K,l Is the Key matrix at level l, d is the word vector dimension,is an unnormalized attention weight.
7. The method of claim 6, wherein the final mixed-word embedding expression is calculated by:
wherein the content of the first and second substances,to representThe normalization process of (a) is performed,denotes the B-th word embedding vector, W, of the l-th layer V,l Is the Value matrix of the l-th layer, n 1 Represents a total of n 1 The word, when the layer l is the last layer,and embedding the words of the A-th word in the l-th layer as the final mixed word embedded expression of the expression.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210469691.4A CN114818698B (en) | 2022-04-28 | 2022-04-28 | Mixed word embedding method for natural language text and mathematical language text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210469691.4A CN114818698B (en) | 2022-04-28 | 2022-04-28 | Mixed word embedding method for natural language text and mathematical language text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114818698A true CN114818698A (en) | 2022-07-29 |
CN114818698B CN114818698B (en) | 2024-04-16 |
Family
ID=82508716
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210469691.4A Active CN114818698B (en) | 2022-04-28 | 2022-04-28 | Mixed word embedding method for natural language text and mathematical language text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114818698B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20200040652A (en) * | 2018-10-10 | 2020-04-20 | 고려대학교 산학협력단 | Natural language processing system and method for word representations in natural language processing |
CN111444709A (en) * | 2020-03-09 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Text classification method, device, storage medium and equipment |
US20210012199A1 (en) * | 2019-07-04 | 2021-01-14 | Zhejiang University | Address information feature extraction method based on deep neural network model |
CN113239700A (en) * | 2021-04-27 | 2021-08-10 | 哈尔滨理工大学 | Text semantic matching device, system, method and storage medium for improving BERT |
-
2022
- 2022-04-28 CN CN202210469691.4A patent/CN114818698B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20200040652A (en) * | 2018-10-10 | 2020-04-20 | 고려대학교 산학협력단 | Natural language processing system and method for word representations in natural language processing |
US20210012199A1 (en) * | 2019-07-04 | 2021-01-14 | Zhejiang University | Address information feature extraction method based on deep neural network model |
CN111444709A (en) * | 2020-03-09 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Text classification method, device, storage medium and equipment |
CN113239700A (en) * | 2021-04-27 | 2021-08-10 | 哈尔滨理工大学 | Text semantic matching device, system, method and storage medium for improving BERT |
Non-Patent Citations (1)
Title |
---|
杨晨;宋晓宁;宋威;: "SentiBERT:结合情感信息的预训练语言模型", 计算机科学与探索, no. 09, pages 1563 - 1570 * |
Also Published As
Publication number | Publication date |
---|---|
CN114818698B (en) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113128229B (en) | Chinese entity relation joint extraction method | |
CN107291693B (en) | Semantic calculation method for improved word vector model | |
CN108416058B (en) | Bi-LSTM input information enhancement-based relation extraction method | |
CN113642330B (en) | Rail transit standard entity identification method based on catalogue theme classification | |
CN110134946B (en) | Machine reading understanding method for complex data | |
CN109145190B (en) | Local citation recommendation method and system based on neural machine translation technology | |
CN111753024B (en) | Multi-source heterogeneous data entity alignment method oriented to public safety field | |
CN112434535B (en) | Element extraction method, device, equipment and storage medium based on multiple models | |
CN110472235A (en) | A kind of end-to-end entity relationship joint abstracting method towards Chinese text | |
CN111274804A (en) | Case information extraction method based on named entity recognition | |
CN115438674B (en) | Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment | |
CN112463924B (en) | Text intention matching method for intelligent question answering based on internal correlation coding | |
CN114169312A (en) | Two-stage hybrid automatic summarization method for judicial official documents | |
CN110222338A (en) | A kind of mechanism name entity recognition method | |
CN111597807B (en) | Word segmentation data set generation method, device, equipment and storage medium thereof | |
CN113010679A (en) | Question and answer pair generation method, device and equipment and computer readable storage medium | |
CN111967267A (en) | XLNET-based news text region extraction method and system | |
CN115600605A (en) | Method, system, equipment and storage medium for jointly extracting Chinese entity relationship | |
CN117171333A (en) | Electric power file question-answering type intelligent retrieval method and system | |
CN113220964B (en) | Viewpoint mining method based on short text in network message field | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
CN114372454A (en) | Text information extraction method, model training method, device and storage medium | |
CN114818698B (en) | Mixed word embedding method for natural language text and mathematical language text | |
CN115587595A (en) | Multi-granularity entity recognition method for pathological text naming | |
CN114297408A (en) | Relation triple extraction method based on cascade binary labeling framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |