CN114818698B - Mixed word embedding method for natural language text and mathematical language text - Google Patents

Mixed word embedding method for natural language text and mathematical language text Download PDF

Info

Publication number
CN114818698B
CN114818698B CN202210469691.4A CN202210469691A CN114818698B CN 114818698 B CN114818698 B CN 114818698B CN 202210469691 A CN202210469691 A CN 202210469691A CN 114818698 B CN114818698 B CN 114818698B
Authority
CN
China
Prior art keywords
mathematical
expression
language text
node
relative position
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210469691.4A
Other languages
Chinese (zh)
Other versions
CN114818698A (en
Inventor
董石
唐家玉
陶雪云
王志锋
田元
陈加
陈迪
左明章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central China Normal University
Original Assignee
Central China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central China Normal University filed Critical Central China Normal University
Priority to CN202210469691.4A priority Critical patent/CN114818698B/en
Publication of CN114818698A publication Critical patent/CN114818698A/en
Application granted granted Critical
Publication of CN114818698B publication Critical patent/CN114818698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a mixed word embedding method of natural language text and mathematical language text, which comprises the following steps: identifying and preprocessing the mixed text to obtain a mathematical resource data set consisting of the text and the mathematical expression; performing position coding on a mathematical expression with a tree structure, and keeping the relative position translation of the tree structure unchanged; unified position coding is carried out on the text with the linear structure characteristic and the mathematical expression with the tree structure characteristic; and sending the relative position codes to an attention module of a pre-training model, pre-training the mathematical resource by adopting a masking language model and a sentence-in prediction two standard pre-training task, and obtaining an embedded vector representation rich in context information by each symbol after the pre-training is completed.

Description

Mixed word embedding method for natural language text and mathematical language text
Technical Field
The invention relates to the technical field of natural language processing, in particular to a mixed word embedding method of natural language text and mathematical language text.
Background
Mathematical text refers to natural language text containing mathematical expressions, and has characteristics of ambiguity and polymorphism, which are widely used in STEM subjects and higher education. Natural language text has linear structural features, while mathematical expressions have tree structural features, and word embedded expressions of such mixed text have a crucial role in the field of related mathematical texts. Conventional word embedding techniques are suitable for processing text with linear features, and difficult to process mathematical expressions with tree-structured features.
The mathematical expression can be expressed as two most important tree structures, one is a Symbol layout tree (Symbol LAYER TREE, SLT), which is constructed according to the writing line of the expression, with mathematical expression appearance information; the other is an Operator Tree (OPT), which is constructed from an Operator hierarchy in the expression, with mathematical expression semantic information. In 2021, peng et al, university of beijing, proposed a BERT-based mathematical expression pre-training model MathBERT that could obtain word embedded expressions of mixed text. The authors input the LaTeX sequence of the mathematical expression, the intermediate traversal sequence of the OPT tree, and the context text sequence as the BERT model, and extract the structural information of the OPT tree using the attention masking matrix, so that adjacent nodes in the tree structure are visible to each other in the masking matrix. And finally, adding a masking structure prediction task to the masking language model and the context prediction task to train the BERT model. However, the method artificially limits the calculation range of the attention, and is difficult to acquire word embedding information which is far dependent. In the same year, shen et al, university of pennsylvania, have proposed a MathBERT model oriented to mathematical education, innovatively using an automatic scoring task and a cognitive tracking prediction task to fine tune BERT. However, the author uses a simple linear sequence of mathematical texts as input, ignores the tree structure features of mathematical expressions, and makes word embedding lack of mathematical semantic information.
Disclosure of Invention
Aiming at the technical problems that the mathematical text is extensive and depends on the ambiguity and polymorphism characteristics of the context, and the semantic relation of the mathematical expression which is difficult to extract and depends on a long distance by the existing method, so that word embedding representation is not comprehensive and accurate, the invention performs position coding on the mathematical expression with a tree structure according to the position expression principle of the mathematical structure and the structural characteristics of the mixed text of the natural language and the mathematical language, performs unified position coding on the text with the linear sequence characteristics and the mathematical expression with the tree structure characteristics, and further obtains the word embedding representation of the mixed text of the natural language and the mathematical language by fine adjustment of a pre-training model under the mathematical language processing task.
In order to achieve the above object, the present invention provides a mixed word embedding method for natural language text and mathematical language text, comprising:
S1: preprocessing learning resources containing natural language texts and mathematical language texts to obtain a mathematical resource data set, wherein the mathematical language texts are mathematical expressions with tree structures, and the natural language texts are contexts with linear sequence characteristics;
S2: absolute position coding is carried out on the mathematical expression with the tree structure by adopting a position coding mode based on branches, and the relative position coding of two nodes in the tree structure is calculated according to the absolute position coding result;
s3: the method comprises the steps of adopting negative integer position coding for a context with linear sequence characteristics, using complementary codes to represent, then taking a root node of a tree structure as a first node of a linear sequence, realizing unified position coding of a mathematical expression and the context, and calculating relative position coding of any two nodes in the tree structure and the linear sequence according to the unified position coding;
S4: inputting the mathematical resource data set obtained in the step S1 into a BERT pre-training model, wherein the BERT pre-training model is provided with a position coding module and an attention module, inputting the unified position code obtained in the step S3 into the position coding module, and sending the relative position codes of any two nodes in the tree structure and the linear sequence calculated in the step S3 into the attention module of the BERT pre-training model for training, and pre-training mathematical resources by adopting a masking language model and a next sentence prediction two standard pre-training task to obtain a trained word embedding model;
S5: and processing the natural language text and the mathematical language text by using the trained word embedding model to obtain the final mixed word embedding expression.
In one embodiment, step S1 pre-processes a learning resource containing natural language text and mathematical language text, including:
processing learning resources containing natural language text and mathematical language text into a symbol sequence, wherein the mathematical expression is in LaTeX format, the mathematical resource data set is a mathematical resource set, and the mathematical resource data set is expressed as L= { L 1,L2,…,Li,…,LN'},Li and represents an ith mathematical resource.
In one embodiment, processing a learning resource containing natural language text and mathematical language text into a sequence of symbols includes:
Performing word segmentation by utilizing a mathematical expression in an im2mark up word segmentation tool LaTeX format to obtain a symbol sequence of a mathematical expression word segmentation result, converting the mathematical expression in the LaTeX format into an operator OPT tree by utilizing a TangenS tool, performing depth-first traversal on the OPT tree to obtain a symbol sequence of a mathematical expression tree structure traversal result, wherein the j-th mathematical expression of the i-th mathematical resource is expressed as and represents an n' symbol of the j-th mathematical expression after the word segmentation in the LaTeX format, the/> represents a k symbol of the j-th mathematical expression, which is obtained by performing depth-first traversal on the OPT tree, each mathematical resource consists of a natural language text and a mathematical expression, wherein the natural language text is the context of the mathematical expression, the context of the mathematical expression M i,j is C i,j={tz|tz∈Li,|z-pij I is not more than R, the t z represents the z-th natural language, and the p ij is the position of the mathematical expression M i,j in the sequence as a whole, and R is maximally 64;
The expression of each mathematical resource is obtained according to the symbol expression form of the natural language and the mathematical expression, wherein the ith mathematical resource is expressed as follows:
N T is the total length of the natural language text;
When the mathematical expression M i,j is made up of a plurality of equations or inequalities, the mathematical resource data set is obtained from the expression of each mathematical resource by dividing with the equal and unequal signs as labels, as the pre-training model data set , where i is the learning resource number, j is the mathematical expression number, and w is the sub-expression number.
In one embodiment, when S2 performs absolute position coding, a displacement operation is introduced, the mathematical expression is N-ary tree, the root node is pairs of any subsequent child nodes, and the coding mode is as follows:
S2.1: the child nodes of all branches are represented by one hot code, the one hot code has N bits, and for the child node of the r branch, the r bit of the one hot code from right to left is 1, and the rest bits are 0; s2.2: the position code of the father node is shifted left by N bits and then added with one hot code of the branch child node, then the final absolute position code of the branch node is obtained, any node in the final expression tree is expressed as , wherein N is the absolute position code of node/> , D n is the decimal representation of the absolute position code, and/> is the binary code length of D n, and when the relative position of the node in the tree is calculated, the following method is adopted:
Wherein PE represents a relative position calculation function, T represents a tree, represents a relative position calculation function of a node and a node/> in a mathematical expression tree, D m is an absolute code value of the node/> ,/> is a binary code length of D m, and < represents a left shift operator.
In one embodiment, step S3 includes:
For natural language text with linear sequence characteristics, carrying out relative position coding, wherein the relative positions among words are defined as differences of absolute positions, the differences are expressed as a and b to express absolute positions, and the relative position calculating function of the second word is expressed by '', wherein the adopted relative position coding mode is to code the positions of the linear sequence by negative integers, the length of the linear sequence position coding is L S=nT×lT,nT to express the maximum bifurcation tree of a tree structure, L T to express the maximum layer number of the tree structure and the complementary codes to express the negative integers;
And taking the root node of the tree structure as the first node of the linear sequence to realize unified position coding of two structures, wherein the calculation of the relative position in the unified position coding is as follows:
wherein represents a relative position calculation function between any two nodes/> and/> ,/> represents a relative position calculation function between a node/> and a node/> in the linear sequence, S represents a sequence,/> represents a relative position calculation function between a node/> and a node/> in the tree structure,/> represents a relative position calculation function between a node/> and a root node in the linear sequence,/> represents a relative position calculation function between a root node and a node/> in the tree structure,/> represents a relative position calculation function between a node/> and a root node in the tree structure, and/> represents a relative position calculation function between a root node and a node/> in the linear sequence.
In one embodiment, when the attention module of the BERT pre-training model is trained by sending the relative position codes of any two nodes in the tree structure and the linear sequence, the functional expression of the relative position codes is as follows:
Wherein is the A and B relative position embedding vector of the first layer of transformers in the BERT model,/> is the first layer of A word embedding vector,/> is the first layer of B word embedding vector, W Q,l is the Query matrix of the first layer, W K,l is the Key matrix of the first layer, d is the word vector dimension,/> is the unnormalized attention weight.
In one embodiment, the final mixed word embedding expression is calculated by:
Wherein represents the normalization process of/> ,/> represents the first layer of the B-th word embedding vector, W V,l is the Value matrix of the first layer, n 1 represents a total of n 1 words, and when the first layer is the last layer,/> is the word embedding of the a-th word in the first layer as the expression of the final mixed word embedding expression.
The above technical solutions in the embodiments of the present application at least have one or more of the following technical effects:
The invention provides a mixed word embedding method of natural language text and mathematical language text, which comprises the following steps: identifying and preprocessing the mixed text to obtain a mathematical resource data set consisting of the text and the mathematical expression; absolute position coding is carried out on the mathematical expression with the tree structure, and relative position coding is calculated according to the absolute position coding result, so that the relative position translation of the tree structure is kept unchanged; unified position coding is carried out on the text with the linear structure characteristic and the mathematical expression with the tree structure characteristic; and sending the relative position codes to an attention module of a pre-training model, pre-training mathematical resources by adopting a masking language model and a sentence-in prediction two standard pre-training task, and obtaining an embedded vector representation rich in context information by each symbol after the pre-training is finished, so that the information contained in the final word embedded expression is richer and the expression is more accurate.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for embedding mixed words of natural language text and mathematical language text provided by an embodiment of the present invention;
FIG. 2 is a flow chart of data preprocessing of a word embedding method in an embodiment of the present invention;
FIG. 3 is a schematic representation of a tree of mathematical expressions in an embodiment of the invention;
FIG. 4 is a schematic diagram of tree position coding in an embodiment of the invention;
FIG. 5 is a diagram of unified position coding in an embodiment of the invention;
FIG. 6 is a schematic diagram of a pre-training model in an embodiment of the invention.
Detailed Description
The invention provides a mixed word embedding method of natural language text and mathematical language text, which comprises the following steps: identifying and preprocessing the mixed text to obtain a mathematical resource data set consisting of the text and the mathematical expression; performing position coding on a mathematical expression with a tree structure, and keeping the relative position translation of the tree structure unchanged; unified position coding is carried out on the text with the linear structure characteristic and the mathematical expression with the tree structure characteristic; and sending the relative position codes to an attention module of a pre-training model, pre-training the mathematical resource by adopting a masking language model and a sentence-in prediction two standard pre-training task, and obtaining an embedded vector representation rich in context information by each symbol after the pre-training is completed.
Compared with the prior art, the method has the advantages that the position coding is carried out on the tree structure of the data expression, the relative position translation invariance of the position coding of the tree structure is guaranteed, the text with linear structural characteristics and the mathematical expression with the tree structural characteristics are uniformly position coded, the text and the mathematical expression with the tree structural characteristics are used for the BERT pre-training model, and further the word embedding expression is extracted.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention provides a mixed word embedding method of natural language text and mathematical language text, which comprises the following steps:
S1: preprocessing learning resources containing natural language texts and mathematical language texts to obtain a mathematical resource data set, wherein the mathematical language texts are mathematical expressions with tree structures, and the natural language texts are contexts with linear sequence characteristics;
S2: absolute position coding is carried out on the mathematical expression with the tree structure by adopting a position coding mode based on branches, and the relative position coding of two nodes in the tree structure is calculated according to the absolute position coding result;
s3: the method comprises the steps of adopting negative integer position coding for a context with linear sequence characteristics, using complementary codes to represent, then taking a root node of a tree structure as a first node of a linear sequence, realizing unified position coding of a mathematical expression and the context, and calculating relative position coding of any two nodes in the tree structure and the linear sequence according to the unified position coding;
S4: inputting the mathematical resource data set obtained in the step S1 into a BERT pre-training model, wherein the BERT pre-training model is provided with a position coding module and an attention module, the unified position coding obtained in the step S3 is input into the position coding module, the relative position coding of any two nodes in the tree structure and the linear sequence calculated in the step S3 is sent into the attention module of the BERT pre-training model for training, and a masking language model and a sentence-making prediction two standard pre-training task are adopted for pre-training mathematical resources, so that a trained word embedding model is obtained;
S5: and processing the natural language text and the mathematical language text by using the trained word embedding model to obtain the final mixed word embedding expression.
Referring to fig. 1, a flowchart of a method for embedding mixed words of natural language text and mathematical language text is provided in an embodiment of the present invention.
Specifically, S1 is preprocessing of learning resources, S2 is encoding of a mathematical expression having a tree structure, S3 is position encoding of a context, and unified position encoding of the mathematical expression and the context is achieved, S4 is training of a BERT pre-training model by using unified position encoding, and S5 is application of a trained word embedding model.
In one embodiment, step S1 pre-processes a learning resource containing natural language text and mathematical language text, including:
processing learning resources containing natural language text and mathematical language text into a symbol sequence, wherein the mathematical expression is in LaTeX format, the mathematical resource data set is a mathematical resource set, and the mathematical resource data set is expressed as L= { L 1,L2,…,Li,…,LN'},Li and represents an ith mathematical resource.
N' represents the total number of mathematical resources.
In one embodiment, processing a learning resource containing natural language text and mathematical language text into a sequence of symbols includes:
Performing word segmentation by utilizing a mathematical expression in an im2mark up word segmentation tool LaTeX format to obtain a mathematical expression word segmentation result symbol sequence, converting the mathematical expression in the LaTeX format into an operator OPT tree by utilizing a TangenS tool, performing depth-first traversal on the OPT tree to obtain a mathematical expression tree structure traversal result symbol sequence, wherein the j-th mathematical expression of the i-th mathematical resource is expressed as to represent the n' symbol of the j-th mathematical expression after word segmentation in the LaTeX format, and/ to represent the k symbol of the OPT tree of the j-th mathematical expression after depth-first traversal, wherein each mathematical resource consists of a natural language text and a mathematical expression, the natural language text is the context of the mathematical expression, the context of the mathematical expression M i,j is C i,j={tz|tz∈Li,|z-pij I & ltoreq R }, t z is the z-th natural language word, and p ij is the position of the mathematical expression M i,j in the sequence as a whole, and R is 64;
The expression of each mathematical resource is obtained according to the symbol expression form of the natural language and the mathematical expression, wherein the ith mathematical resource is expressed as follows:
N T is the total length of the natural language text;
When the mathematical expression M i,j is made up of a plurality of equations or inequalities, the mathematical resource data set is obtained from the expression of each mathematical resource by dividing with the equal and unequal signs as labels, as the pre-training model data set , where i is the learning resource number, j is the mathematical expression number, and w is the sub-expression number.
In the specific implementation process, as shown in fig. 2, the data is preprocessed to obtain a mathematical resource data set. Processing learning resources containing natural language and mathematical language into symbol sequences through Mathpix OCR interfaces to obtain a mathematical resource set, wherein the learning resources contain a solution process. The Mathpix OCR interface can extract text and mathematical expressions from the pictures and convert the mathematical formulas to LaTeX format. The entire set of mathematical resources is denoted by l= { L 1,L2,…,Li,…,LN } where L i denotes the ith mathematical resource.
And (3) word segmentation is carried out on the mathematical expression of each mathematical resource by using an im2mark up word segmentation tool to obtain a mathematical expression symbol sequence. The mathematical expression in LaTeX format is converted into an operator OPT tree by TangenS tool, and the operator OPT tree of the mathematical expression is shown in figure 3. And performing depth-first traversal on the OPT tree to obtain a mathematical expression symbol sequence. Thus, one mathematical expression can yield two symbol sequences, which for the j-th mathematical expression of the i-th mathematical resource can be expressed as:
Wherein/> denotes the nth symbol of the jth mathematical expression after the LaTeX format word segmentation,/> denotes the kth operator or operand obtained by depth-first traversal of the OPT tree of the jth mathematical expression.
Each mathematical resource consists of text and a mathematical expression, where text is the context of the mathematical expression, which may contain descriptive and explanatory information for the mathematical expression, which is the key for semantic association between mathematical symbols and natural language. Since the length of input data is limited, the context of the mathematical expression needs to be defined. Defining the context of the mathematical expression M i,j as C i,j={tz|tz∈Li,|z-pij I.ltoreq.R, wherein t z represents the z-th natural language word, p ij is the position of the mathematical expression M i,j as a whole in the sequence, rmax is 64, namely taking 64 of the front and rear of the mathematical expression, and 128 natural language text symbols are taken as the context;
to sum up, for the ith mathematical resource, it can be expressed as:
NT Is the total length of the natural language text.
In learning resources, the mathematical expression M i,j often contains multiple deductions, i.e., multiple ligations or inequalities together form an expression, which may be further split into sub-expressions with equal and unequal signs as labels, each sub-expression/> containing only one equal or unequal sign, i.e., only one deduction step, the sub-expressions of the multiple deduction expressions sharing a context.
Finally, all learning resources are processed according to the process to form a pre-training model dataset , wherein i is the learning resource serial number, j is the mathematical expression number, and w is the sub-expression number.
In one embodiment, when S2 performs absolute position coding, a displacement operation is introduced, the mathematical expression is N-ary tree, the root node is pairs of any subsequent child nodes, and the coding mode is as follows:
S2.1: the child nodes of all branches are represented by one hot code, the one hot code has N bits, and for the child node of the r branch, the r bit of the one hot code from right to left is 1, and the rest bits are 0;
s2.2: the position code of the father node is shifted left by N bits and then added with one hot code of the branch child node, then the final absolute position code of the branch node is obtained, any node in the final expression tree is expressed as , wherein N is the absolute position code of the node , D n is the decimal representation of the absolute position code, and/ is the binary code length of D n, and when the relative position of the node in the tree is calculated, the following method is adopted:
Wherein PE represents a relative position calculation function, T represents a tree, represents a relative position calculation function of a node and a node/> in a mathematical expression tree, D m is an absolute code value of the node/> ,/> is a binary code length of D m, and < represents a left shift operator.
In the specific implementation process, step S2 performs position coding on the mathematical expression having the tree structure, and ensures that the position coding of the tree structure has relative position translation invariance. The relative position can reflect the word-to-word relationship, and in a linear sequence, the relative position is defined as the difference in absolute position, and the relative position translational invariance means that the position offset should be the same regardless of the absolute position of the word as long as the relative position is the same. The relative position is unchanged, so that no matter where the word is, the word with fixed relative position can be subjected to semantic association, and the training process can obtain the semantic relationship.
In order to ensure that the position coding of the tree structure has relative position translation invariance, the invention adopts a coding mode based on branches, and displacement operation is introduced when the relative position is calculated.
As shown in fig. 4, assuming that the mathematical expression tree is a 3-way tree, a root node is defined as (000), any one of the following child nodes is defined as a first branch child node, the root node is shifted left by 3 bits and then one hot code is added (001), and finally the branch child node is represented as (000001), and similarly, a2 nd branch child node is represented as (000010), and a3 rd branch child node is represented as (000100).
Any node in the expression tree can be represented as where D n is a decimal representation of the absolute position code n and/> is the binary code length of D n, and when calculating the relative position of the node in the tree/> , it can be seen from the above formula that the relative position between any two nodes in the tree is independent of the absolute position. As shown in fig. 4, the relative position value of the node p 20,9 and the node p 0,3 is 20, and the relative position value of the node p 84,12 and the node p 1,6 is also 20.
In one embodiment, step S3 includes:
for natural language text with linear sequence characteristics, carrying out relative position coding, wherein the relative positions among words are defined as differences of absolute positions, the differences are expressed as a and b to express absolute positions, and the relative position calculating function of the second word is expressed by '', wherein the adopted relative position coding mode is to code the positions of the linear sequence by negative integers, the length of the linear sequence position coding is L S=nT×lT,nT to express the maximum bifurcation tree of a tree structure, L T to express the maximum layer number of the tree structure and the complementary codes to express the negative integers;
And taking the root node of the tree structure as the first node of the linear sequence to realize unified position coding of two structures, wherein the calculation of the relative position in the unified position coding is as follows:
Wherein represents a relative position calculation function between any two nodes/> and/> ,/> represents a relative position calculation function between a node/> and a node/> in the linear sequence, S represents sequence,/> represents a relative position calculation function between a node/> and a node/> in the tree structure,/> represents a relative position calculation function between a node/> and a root node in the linear sequence,/> represents a relative position calculation function between a root node and a node in the tree structure,/> represents a relative position calculation function between a node/> and a root node in the tree structure, and/> represents a relative position calculation function between a root node and a node/> in the linear sequence.
Specifically, step S3 fuses the linear sequence of the contextual natural language with the coding model of S2. For natural language text with a linear sequence, the relative position between words is defined as the difference in absolute position, which can be expressed as a and b to represent absolute positions, and translational invariance to the relative position can be easily satisfied: /(I)
In order to unify the linear sequence and the tree structure, the positions of the linear sequence are encoded by negative integers and represented by complementary codes, so that the position encoding highest bits of all natural language words are 1, and the position encoding highest bits of the tree structure nodes are 0. Where the length of the position code of the linear sequence is L S=nT×lT,nT representing the maximum bifurcation tree of the tree structure and L T representing the maximum number of layers of the tree structure. Finally, the root node of the tree structure is used as the first node of the linear sequence, so that unified representation of the two structures is realized, as shown in fig. 5. The calculation formula of the relative position of the unified position code is shown above.
In one embodiment, when the attention module of the BERT pre-training model is trained by sending the relative position codes of any two nodes in the tree structure and the linear sequence, the functional expression of the relative position codes is as follows:
Wherein is the A and B relative position embedding vector of the first layer of transformers in the BERT model,/> is the first layer of A word embedding vector,/> is the first layer of B word embedding vector, W Q,l is the Query matrix of the first layer, W K,l is the Key matrix of the first layer, d is the word vector dimension,/> is the unnormalized attention weight.
In one embodiment, the final mixed word embedding expression is calculated by:
Wherein represents the normalization process of/> ,/> represents the first layer of the B-th word embedding vector, W V,l is the Value matrix of the first layer, n 1 represents a total of n 1 words, and when the first layer is the last layer,/> is the word embedding of the a-th word in the first layer as the expression of the final mixed word embedding expression.
Specifically, through the foregoing steps, the learning resources are converted into a dataset , where the context sequence C i,j and the mathematical expression/> are coded by using a unified position coding mode after word segmentation, and sent to the BERT pre-training model. As shown in fig. 6, the input of the BERT pre-training model consists of three parts, the first part being a context sequence, the second part being a mathematical expression LaTeX sequence, and the third part being a depth-first traversal sequence of the mathematical expression OPT tree. The relative position codes are then fed into the attention module of the pre-training model BERT for training. In the attention module, the relative position-coded effect expression is as described above. And obtaining the final mixed word embedding expression through a calculation formula of/> .
After the pre-training model is adjusted, a mixed word embedding expression of the natural language text and the mathematical language text is obtained by adopting two standard pre-training tasks of a masking language model (Masked Language Modeling, MLM) and a lower sentence prediction (Next Sentence Prediction, NSP).
In the MLM task, 15% of words are randomly extracted from three input sequences, 80% of the extracted words are replaced by [ MASK ] marks, 10% of the words are replaced by random other words, and 10% of the words are unchanged. The MLM task uses cross entropy as a loss function, denoted in this embodiment as:
Where/> is the estimated probability of the masked word x after linear classification and Softmax regression, and p (x) is its original distribution, i.e., its one hot vector.
In the NSP task, 50% of are randomly selected to replace its context C i,j with the context of the random formula. NSP tasks also typically employ cross entropy as a loss function, denoted in this embodiment as:
Where if the context is not replaced with p=1, p=0 if the context is replaced,/> is the estimated probability that the context matches the formula.
In summary, compared with the prior art, the method has the advantages that the position coding is carried out on the tree structure of the data expression, the position coding of the tree structure is guaranteed to have relative position translation invariance, the text with linear structural characteristics and the mathematical expression with the tree structure characteristics are uniformly position coded and used for the BERT pre-training model, and word embedding expression is extracted, so that the information contained in the final word embedding expression is more abundant, and the expression is more accurate.
The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Various modifications may be made to the particular embodiments described, or equivalents may be substituted, by those skilled in the art without departing from the spirit of the invention or exceeding the scope of the invention as defined by the appended claims.

Claims (7)

1. A method for embedding mixed words of natural language text and mathematical language text, comprising:
S1: preprocessing learning resources containing natural language texts and mathematical language texts to obtain a mathematical resource data set, wherein the mathematical language texts are mathematical expressions with tree structures, and the natural language texts are contexts with linear sequence characteristics;
S2: absolute position coding is carried out on the mathematical expression with the tree structure by adopting a position coding mode based on branches, and the relative position coding of two nodes in the tree structure is calculated according to the absolute position coding result;
s3: the method comprises the steps of adopting negative integer position coding for a context with linear sequence characteristics, using complementary codes to represent, then taking a root node of a tree structure as a first node of a linear sequence, realizing unified position coding of a mathematical expression and the context, and calculating relative position coding of any two nodes in the tree structure and the linear sequence according to the unified position coding;
S4: inputting the mathematical resource data set obtained in the step S1 into a BERT pre-training model, wherein the BERT pre-training model is provided with a position coding module and an attention module, inputting the unified position code obtained in the step S3 into the position coding module, and sending the relative position codes of any two nodes in the tree structure and the linear sequence calculated in the step S3 into the attention module of the BERT pre-training model for training, and pre-training mathematical resources by adopting a masking language model and a next sentence prediction two standard pre-training task to obtain a trained word embedding model;
S5: processing the natural language text and the mathematical language text by using the trained word embedding model to obtain a final mixed word embedding expression;
wherein the masking language model task in S4 uses cross entropy as a loss function, expressed as:
Where/> is the estimated probability of the masked word x after linear classification and Softmax regression, p (x) is its original distribution,
The following prediction task adopts cross entropy as a loss function, and is expressed as:
Where p=1 if the context is not replaced, and p=0 if it is replaced by context,/> is the estimated probability that the context matches the formula.
2. The mixed word embedding method of natural language text and mathematical language text as claimed in claim 1, wherein the step S1 of preprocessing the learning resource containing the natural language text and the mathematical language text comprises:
processing learning resources containing natural language text and mathematical language text into a symbol sequence, wherein the mathematical expression is in LaTeX format, the mathematical resource data set is a mathematical resource set, and the mathematical resource data set is expressed as L= { L 1,L2,…,Li,…,LN'},Li and represents an ith mathematical resource.
3. The method of mixed word embedding of natural language text and mathematical language text as claimed in claim 2, wherein processing a learning resource containing natural language text and mathematical language text as a symbol sequence includes:
Performing word segmentation by utilizing a mathematical expression in an im2mark up word segmentation tool LaTeX format to obtain a symbol sequence of a mathematical expression word segmentation result, converting the mathematical expression in the LaTeX format into an operator OPT tree by utilizing a TangenS tool, performing depth-first traversal on the OPT tree to obtain a symbol sequence of a mathematical expression tree structure traversal result, wherein the j-th mathematical expression of the i-th mathematical resource is expressed as and represents an n' symbol of the j-th mathematical expression after the word segmentation in the LaTeX format, the/> represents a k symbol of the j-th mathematical expression, which is obtained by performing depth-first traversal on the OPT tree, each mathematical resource consists of a natural language text and a mathematical expression, wherein the natural language text is the context of the mathematical expression, the context of the mathematical expression M i,j is C i,j={tz|tz∈Li,|z-pij I is not more than R, the t z represents the z-th natural language, and the p ij is the position of the mathematical expression M i,j in the sequence as a whole, and R is maximally 64;
The expression of each mathematical resource is obtained according to the symbol expression form of the natural language and the mathematical expression, wherein the ith mathematical resource is expressed as follows:
N T is the total length of the natural language text;
When the mathematical expression M i,j is formed by a plurality of continuous equations or inequalities, the mathematical resource data set is obtained by dividing the mathematical expression M i,j into with the equal sign and the inequality sign according to the expression of each mathematical resource, and is used as the pre-training model data set/> , wherein i is the learning resource serial number, j is the mathematical expression serial number, and w is the sub-expression serial number.
4. The method for embedding mixed words of natural language text and mathematical language text according to claim 1, wherein S2 introduces a displacement operation when performing absolute position coding, the mathematical expression is an N-ary tree, and a root node is defined as pairs of any subsequent child nodes, and the coding mode is as follows:
S2.1: the child nodes of all branches are represented by one hot code, the one hot code has N bits, and for the child node of the r branch, the r bit of the one hot code from right to left is 1, and the rest bits are 0; s2.2: the position code of the father node is shifted left by N bits and then added with one hot code of the branch child node, then the final absolute position code of the branch node is obtained, any node in the final expression tree is expressed as , wherein N is the absolute position code of node/> , D n is the decimal representation of the absolute position code, and/> is the binary code length of D n, and when the relative position of the node in the tree is calculated, the following method is adopted:
Where PE represents a relative position calculation function, T represents a tree, represents a relative position calculation function of nodes/> and nodes/> in the mathematical expression tree, D m is an absolute code value of nodes/> , and/ is a binary code length of D m, and < represents a left-shift operator.
5. The mixed word embedding method of natural language text and mathematical language text as claimed in claim 1, wherein the step S3 includes:
For natural language text with linear sequence characteristics, carrying out relative position coding, wherein the relative positions among words are defined as differences of absolute positions, the differences are expressed as a and b to express absolute positions, and the relative position calculating function of the second word is expressed by '', wherein the adopted relative position coding mode is to code the positions of the linear sequence by negative integers, the length of the linear sequence position coding is L S=nT×lT,nT to express the maximum bifurcation tree of a tree structure, L T to express the maximum layer number of the tree structure and the complementary codes to express the negative integers;
And taking the root node of the tree structure as the first node of the linear sequence to realize unified position coding of two structures, wherein the calculation of the relative position in the unified position coding is as follows:
Wherein represents a relative position calculation function between any two nodes/> and/> ,/> represents a relative position calculation function between node/> and node/> in the linear sequence, S represents sequence,/> represents a relative position calculation function between node/> and node/> in the tree structure,/> represents a relative position calculation function between node and a root node in the linear sequence,/> represents a relative position calculation function between a root node and node/> in the tree structure,/> represents a relative position calculation function between node/> and a root node in the tree structure, and/> represents a relative position calculation function between the root node and node/> in the linear sequence.
6. The method for embedding mixed words of natural language text and mathematical language text according to claim 1, wherein when the attention module of the BERT pre-training model is trained by feeding the relative position codes of any two nodes in the tree structure and the linear sequence, the functional expression of the relative position codes is as follows:
Wherein is the A and B relative position embedding vector of the first layer of transformers in the BERT model,/> is the first layer of A word embedding vector,/> is the first layer of B word embedding vector, W Q,l is the Query matrix of the first layer, W K,l is the Key matrix of the first layer, d is the word vector dimension,/> is the unnormalized attention weight.
7. The method for embedding mixed words of natural language text and mathematical language text according to claim 6, wherein the final mixed word embedding expression is calculated by:
Wherein represents the normalization process of/> ,/> represents the first layer of the B-th word embedding vector, W V,l is the Value matrix of the first layer, n 1 represents a total of n 1 words, and when the first layer is the last layer,/> is the word embedding of the a-th word in the first layer as the expression of the final mixed word embedding expression.
CN202210469691.4A 2022-04-28 2022-04-28 Mixed word embedding method for natural language text and mathematical language text Active CN114818698B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210469691.4A CN114818698B (en) 2022-04-28 2022-04-28 Mixed word embedding method for natural language text and mathematical language text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210469691.4A CN114818698B (en) 2022-04-28 2022-04-28 Mixed word embedding method for natural language text and mathematical language text

Publications (2)

Publication Number Publication Date
CN114818698A CN114818698A (en) 2022-07-29
CN114818698B true CN114818698B (en) 2024-04-16

Family

ID=82508716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210469691.4A Active CN114818698B (en) 2022-04-28 2022-04-28 Mixed word embedding method for natural language text and mathematical language text

Country Status (1)

Country Link
CN (1) CN114818698B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200040652A (en) * 2018-10-10 2020-04-20 고려대학교 산학협력단 Natural language processing system and method for word representations in natural language processing
CN111444709A (en) * 2020-03-09 2020-07-24 腾讯科技(深圳)有限公司 Text classification method, device, storage medium and equipment
CN113239700A (en) * 2021-04-27 2021-08-10 哈尔滨理工大学 Text semantic matching device, system, method and storage medium for improving BERT

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200040652A (en) * 2018-10-10 2020-04-20 고려대학교 산학협력단 Natural language processing system and method for word representations in natural language processing
CN111444709A (en) * 2020-03-09 2020-07-24 腾讯科技(深圳)有限公司 Text classification method, device, storage medium and equipment
CN113239700A (en) * 2021-04-27 2021-08-10 哈尔滨理工大学 Text semantic matching device, system, method and storage medium for improving BERT

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SentiBERT:结合情感信息的预训练语言模型;杨晨;宋晓宁;宋威;;计算机科学与探索(第09期);第1563-1570页 *

Also Published As

Publication number Publication date
CN114818698A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN113128229B (en) Chinese entity relation joint extraction method
CN113642330B (en) Rail transit standard entity identification method based on catalogue theme classification
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN107291693B (en) Semantic calculation method for improved word vector model
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
WO2023065544A1 (en) Intention classification method and apparatus, electronic device, and computer-readable storage medium
CN109145190B (en) Local citation recommendation method and system based on neural machine translation technology
CN110196913A (en) Multiple entity relationship joint abstracting method and device based on text generation formula
CN109684642B (en) Abstract extraction method combining page parsing rule and NLP text vectorization
CN110110054A (en) A method of obtaining question and answer pair in the slave non-structured text based on deep learning
CN111079431A (en) Entity relation joint extraction method based on transfer learning
CN111832293B (en) Entity and relation joint extraction method based on head entity prediction
CN111209749A (en) Method for applying deep learning to Chinese word segmentation
CN114169312A (en) Two-stage hybrid automatic summarization method for judicial official documents
CN112364132A (en) Similarity calculation model and system based on dependency syntax and method for building system
Bilgin et al. Sentiment analysis with term weighting and word vectors
CN110222338A (en) A kind of mechanism name entity recognition method
CN114153971A (en) Error-containing Chinese text error correction, identification and classification equipment
CN111222329B (en) Sentence vector training method, sentence vector model, sentence vector prediction method and sentence vector prediction system
CN113012822A (en) Medical question-answering system based on generating type dialogue technology
CN111967267A (en) XLNET-based news text region extraction method and system
CN115935957A (en) Sentence grammar error correction method and system based on syntactic analysis
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN114757184A (en) Method and system for realizing knowledge question answering in aviation field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant