CN114818698A - Mixed word embedding method of natural language text and mathematical language text - Google Patents

Mixed word embedding method of natural language text and mathematical language text Download PDF

Info

Publication number
CN114818698A
CN114818698A CN202210469691.4A CN202210469691A CN114818698A CN 114818698 A CN114818698 A CN 114818698A CN 202210469691 A CN202210469691 A CN 202210469691A CN 114818698 A CN114818698 A CN 114818698A
Authority
CN
China
Prior art keywords
mathematical
language text
expression
relative position
natural language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210469691.4A
Other languages
Chinese (zh)
Other versions
CN114818698B (en
Inventor
董石
唐家玉
陶雪云
王志锋
田元
陈加
陈迪
左明章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central China Normal University
Original Assignee
Central China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central China Normal University filed Critical Central China Normal University
Priority to CN202210469691.4A priority Critical patent/CN114818698B/en
Publication of CN114818698A publication Critical patent/CN114818698A/en
Application granted granted Critical
Publication of CN114818698B publication Critical patent/CN114818698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a mixed word embedding method of a natural language text and a mathematical language text, which comprises the following steps: identifying and preprocessing the mixed text to obtain a mathematical resource data set consisting of the text and a mathematical expression; carrying out position coding on a mathematical expression with a tree structure, and keeping the relative position translation of the tree structure unchanged; carrying out unified position coding on a text with linear structure characteristics and a mathematical expression with tree structure characteristics; and sending the relative position code into an attention module of a pre-training model, pre-training mathematical resources by adopting a masking language model and a next sentence prediction two standard pre-training tasks, and obtaining the embedded vector expression rich in context information by each symbol after the pre-training is finished.

Description

Mixed word embedding method of natural language text and mathematical language text
Technical Field
The invention relates to the technical field of natural language processing, in particular to a mixed word embedding method of a natural language text and a mathematical language text.
Background
Mathematical text refers to natural language text containing mathematical expressions, has characteristics of ambiguity and polymorphism, and widely appears in STEM disciplines and higher education. The natural language text has a linear structure characteristic, the mathematical expression has a tree structure characteristic, and the word embedding expression of the mixed text plays a vital role in the related field of the mathematical text. The traditional word embedding technology is suitable for processing texts with linear characteristics, and is difficult to process mathematical expressions with tree-structured characteristics.
The mathematical expression can be expressed into two most important Tree structures, one is a Symbol Layout Tree (SLT), and the representation is constructed according to the written line of the expression and has the appearance information of the mathematical expression; the other is an Operator Tree (OPT), which is constructed according to an Operator hierarchy in an expression and has mathematical expression semantic information. In 2021, Peng et al, university of Beijing, proposed a math expression pre-training model MathBERT based on BERT, which could obtain word-embedded expressions of mixed text. The author inputs LaTeX sequence of mathematical expression, OPT tree middle sequence traversal sequence and context text sequence as BERT model, and extracts the structure information of OPT tree by using attention masking matrix, so that adjacent nodes in the tree structure can be mutually visible in the masking matrix. And finally, adding a masking structure prediction task on the masking language model and the context prediction task to train a BERT model. However, this method artificially limits the attention calculation range and makes it difficult to acquire word-embedding information that depends on a long distance. In the same year, Shen et al, Pa.university, proposed a MathBERT model for mathematics education, and innovatively fine-tuned the BERT using an automatic scoring task and a cognitive tracking prediction task. But the author uses a simple linear sequence of mathematical text as input, ignores the tree structure characteristic of a mathematical expression and enables word embedding to lack mathematical semantic information.
Disclosure of Invention
Aiming at the technical problems that the word embedding expression is not comprehensive and accurate enough due to the fact that the mathematical text has extensive ambiguity and polymorphism characteristics depending on the context and the prior method is difficult to extract the semantic relation of the remote-dependent mathematical expression, the invention carries out position coding on the mathematical expression with the tree structure according to the position expression principle of the mathematical structure and the structural characteristics of the natural language and mathematical language mixed text, unifies the position coding on the text with linear sequence characteristics and the mathematical expression with the tree structure characteristics, and further obtains the word embedding expression of the natural language and mathematical language mixed text by finely adjusting a pre-training model under the mathematical language processing task.
In order to achieve the above object, the present invention provides a mixed word embedding method of a natural language text and a mathematical language text, comprising:
s1: preprocessing learning resources including a natural language text and a mathematical language text to obtain a mathematical resource data set, wherein the mathematical language text is a mathematical expression with a tree structure, and the natural language text is a context with linear sequence characteristics;
s2: absolute position coding is carried out on a mathematical expression with a tree structure by adopting a position coding mode based on branches, and relative position coding of two nodes in the tree structure is calculated according to an absolute position coding result;
s3: adopting negative integer position coding for a context with linear sequence characteristics, expressing by using a complement, then taking a root node of a tree structure as a first node of a linear sequence to realize uniform position coding of a mathematical expression and the context, and then calculating the relative position coding of any two nodes in the tree structure and the linear sequence according to the uniform position coding;
s4: inputting the data set of the mathematical resources obtained in the step S1 into a BERT pre-training model, wherein the BERT pre-training model is provided with a position coding module and an attention module, inputting the unified position code obtained in the step S3 into the position coding module, and coding the relative positions of any two nodes in the tree structure and the linear sequence calculated in the step S3 into the attention module of the BERT pre-training model for training, and pre-training the mathematical resources by adopting a masking language model and a next sentence prediction two standard pre-training tasks to obtain a trained word embedding model;
s5: and processing the natural language text and the mathematical language text by using the trained word embedding model to obtain the final mixed word embedding expression.
In one embodiment, the step S1 of preprocessing the learning resource containing the natural language text and the mathematical language text includes:
processing a learning resource containing a natural language text and a mathematical language text into a symbol sequence, wherein the mathematical expression is in a LaTeX format, and the mathematical resource data set is a mathematical resource set expressed as L ═ L { (L) 1 ,L 2 ,…,L i ,…,L N’ },L i Indicating the ith mathematical resource.
In one embodiment, processing a learning resource containing natural language text and mathematical language text into a sequence of symbols includes:
utilizing a mathematical expression in a LaTeX format of an im2 tagging word segmentation tool to perform word segmentation to obtain a symbol sequence of a word segmentation result of the mathematical expression, utilizing a TangenS tool to convert the mathematical expression in the LaTeX format into an operator OPT tree, and performing depth-first traversal on the OPT tree to obtain a symbol sequence of a traversal result of a tree structure of the mathematical expression, wherein the jth mathematical expression of the ith mathematical resource is represented as a jth mathematical expression of the ith mathematical resource
Figure BDA0003621858900000031
Figure BDA0003621858900000032
Denotes the j (th)The nth' symbol of the mathematical expression after being participled by the LaTeX format,
Figure BDA0003621858900000033
representing the kth symbol obtained by depth-first traversal of an OPT tree of the jth mathematical expression, wherein each mathematical resource consists of a natural language text and a mathematical expression, the natural language text is the context of the mathematical expression, and the mathematical expression M is i,j Is in the context of C i,j ={t z |t z ∈L i ,|z-p ij R is less than or equal to I, wherein t is equal to or less than R z Representing the z-th natural language word, p ij Is a mathematical expression M i,j The position in the sequence as a whole, R is at most 64;
obtaining the expression of each mathematical resource according to the natural language and the symbolic expression form of the mathematical expression, wherein the ith mathematical resource is expressed as:
Figure BDA0003621858900000034
N T is the total length of the natural language text;
when the mathematical expression M i,j When the device is composed of a plurality of continuous equality or inequality, the device is divided into equal signs and unequal signs
Figure BDA0003621858900000035
Obtaining a mathematical resource data set according to the expression of each mathematical resource as a pre-training model data set
Figure BDA0003621858900000036
Wherein i is the number of the learning resource, j is the number of the mathematical expression, and w is the number of the sub-expression.
In one embodiment, S2 introduces a shift operation in absolute position encoding, the mathematical expression is N-ary tree, and the root node is defined as
Figure BDA0003621858900000037
For any subsequent child nodeThe encoding method is as follows:
s2.1: representing the sub-nodes of all branches by using one hot codes, wherein the one hot codes have N bits, and for the sub-node of the r-th branch, the r-th bit of the one hot codes from the right to the left is 1, and the rest bits are 0; s2.2: the position code of the father node is shifted to the left by N bits and then is added with the one hot code of the branch child node, the final absolute position code of the branch node is obtained, and any node in the final expression tree is represented as
Figure BDA0003621858900000038
Wherein n is a node
Figure BDA0003621858900000039
Absolute position coding of D n In the form of an absolute position-coded decimal representation,
Figure BDA00036218589000000310
is D n When calculating the relative position of the nodes in the tree, the following method is adopted:
Figure BDA0003621858900000041
wherein PE represents a relative position calculation function, T represents a tree,
Figure BDA0003621858900000042
representing nodes in a tree of mathematical expressions
Figure BDA0003621858900000043
And node
Figure BDA0003621858900000044
Relative position calculation function of D m Is a node
Figure BDA0003621858900000045
The absolute value of the code of (a),
Figure BDA0003621858900000046
is D m The length of the binary code of (a),<<representing the left shift operator.
In one embodiment, step S3 includes:
for natural language text with linear sequence features, relative position coding is performed, wherein the relative position between words is defined as the difference of absolute positions and expressed as
Figure BDA0003621858900000047
a and b represent the absolute position of the device,
Figure BDA0003621858900000048
a relative position calculation function for representing the second word, wherein the relative position coding is adopted by coding the position of the linear sequence by negative integer, and the length of the position coding of the linear sequence is L S =n T ×l T ,n T Maximum bifurcation Tree, l, representing a Tree Structure T Representing the maximum layer number of the tree structure, and representing a negative integer by using a complementary code;
taking a root node of the tree structure as a head node of the linear sequence to realize unified position coding of two structures, wherein the calculation of relative positions in the unified position coding is as follows:
Figure BDA0003621858900000049
wherein the content of the first and second substances,
Figure BDA00036218589000000410
representing any two nodes
Figure BDA00036218589000000411
And
Figure BDA00036218589000000412
a function is calculated of the relative position between,
Figure BDA00036218589000000413
representing nodes in a linear sequence
Figure BDA00036218589000000414
And node
Figure BDA00036218589000000415
A relative position calculation function therebetween, S denotes a sequence,
Figure BDA00036218589000000416
representing nodes in a tree structure
Figure BDA00036218589000000417
And node
Figure BDA00036218589000000418
A function of the relative position between them is calculated,
Figure BDA00036218589000000419
representing nodes in a linear sequence
Figure BDA00036218589000000420
And a relative position calculation function between the root nodes,
Figure BDA00036218589000000421
representing root nodes and nodes in a tree structure
Figure BDA00036218589000000422
A function of the relative position between them is calculated,
Figure BDA00036218589000000423
representing nodes in a tree structure
Figure BDA00036218589000000424
And a relative position calculation function between the root nodes,
Figure BDA00036218589000000425
representing root node andnodes in a linear sequence
Figure BDA0003621858900000051
The function is calculated for the relative position therebetween.
In one embodiment, when the relative position codes of any two nodes in the tree structure and the linear sequence are sent to an attention module of a BERT pre-training model for training, the action expression of the relative position codes is as follows:
Figure BDA0003621858900000052
wherein the content of the first and second substances,
Figure BDA0003621858900000053
is the A and B relative position embedding vector of the first layer transform in the BERT model,
Figure BDA0003621858900000054
is the l-th layer a-th word embedding vector,
Figure BDA0003621858900000055
is the layer I, the B word embedding vector, W Q,l Is the Query matrix of the l-th layer, W K,l Is the Key matrix at level l, d is the word vector dimension,
Figure BDA0003621858900000056
is an unnormalized attention weight.
In one embodiment, the final mixed word embedding expression is calculated as:
Figure BDA0003621858900000057
wherein the content of the first and second substances,
Figure BDA0003621858900000058
to represent
Figure BDA0003621858900000059
The normalization process of (1) is performed,
Figure BDA00036218589000000510
denotes the B-th word embedding vector, W, of the l-th layer V,l Is the Value matrix of the l-th layer, n 1 Represents a total of n 1 The word, when the layer l is the last layer,
Figure BDA00036218589000000511
and embedding the words of the A-th word in the l-th layer as the final mixed word embedded expression of the expression.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides a mixed word embedding method of a natural language text and a mathematical language text, which comprises the following steps: identifying and preprocessing the mixed text to obtain a mathematical resource data set consisting of the text and a mathematical expression; absolute position coding is carried out on a mathematical expression with a tree structure, relative position coding is calculated according to the absolute position coding result, and the relative position translation of the tree structure is kept unchanged; carrying out unified position coding on a text with linear structure characteristics and a mathematical expression with tree structure characteristics; and the relative position codes are sent to an attention module of a pre-training model, a masking language model and a next sentence prediction two standard pre-training tasks are adopted to pre-train mathematical resources, and after the pre-training is finished, each symbol can be represented by an embedded vector rich in context information, so that the information contained in the final word embedded expression is richer, and the expression is more accurate.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart of a mixed word embedding method for natural language text and mathematical language text according to an embodiment of the present invention;
FIG. 2 is a flow chart of data preprocessing for a word embedding method in an embodiment of the present invention;
FIG. 3 is a diagram of a mathematical expression tree in an embodiment of the present invention;
FIG. 4 is a schematic diagram of tree position coding in an embodiment of the present invention;
FIG. 5 is a schematic diagram of a unified position code in an embodiment of the present invention;
FIG. 6 is a diagram of a pre-training model in an embodiment of the invention.
Detailed Description
The invention provides a mixed word embedding method of a natural language text and a mathematical language text, which comprises the following steps: identifying and preprocessing the mixed text to obtain a mathematical resource data set consisting of the text and a mathematical expression; carrying out position coding on a mathematical expression with a tree structure, and keeping the relative position translation of the tree structure unchanged; carrying out unified position coding on a text with linear structure characteristics and a mathematical expression with tree structure characteristics; and sending the relative position code into an attention module of a pre-training model, pre-training mathematical resources by adopting a masking language model and a next sentence prediction two standard pre-training tasks, and obtaining the embedded vector expression rich in context information by each symbol after the pre-training is finished.
Compared with the prior art, the position coding is carried out on the tree structure of the mathematical expression, the relative position translation invariance of the position coding of the tree structure is ensured, and the text with linear structure characteristics and the mathematical expression with tree structure characteristics are subjected to unified position coding and are used for a BERT pre-training model, so that the word embedding expression is extracted.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a mixed word embedding method of a natural language text and a mathematical language text, which comprises the following steps:
s1: preprocessing learning resources including a natural language text and a mathematical language text to obtain a mathematical resource data set, wherein the mathematical language text is a mathematical expression with a tree structure, and the natural language text is a context with linear sequence characteristics;
s2: absolute position coding is carried out on a mathematical expression with a tree structure by adopting a position coding mode based on branches, and relative position coding of two nodes in the tree structure is calculated according to an absolute position coding result;
s3: adopting negative integer position coding for a context with linear sequence characteristics, expressing by using a complement, then taking a root node of a tree structure as a first node of a linear sequence to realize uniform position coding of a mathematical expression and the context, and then calculating the relative position coding of any two nodes in the tree structure and the linear sequence according to the uniform position coding;
s4: inputting the data set of the mathematical resources obtained in the step S1 into a BERT pre-training model, wherein the BERT pre-training model is provided with a position coding module and an attention module, the unified position coding obtained in the step S3 is inputted into the position coding module, the relative position codes of any two nodes in the tree structure and the linear sequence calculated in the step S3 are sent into the attention module of the BERT pre-training model for training, and the mathematical resources are pre-trained by adopting a masking language model and two standard pre-training tasks for predicting a next sentence, so that a trained word embedding model is obtained;
s5: and processing the natural language text and the mathematical language text by using the trained word embedding model to obtain the final mixed word embedding expression.
Please refer to fig. 1, which is a flowchart illustrating a method for embedding mixed words of natural language text and mathematical language text according to an embodiment of the present invention.
Specifically, S1 is preprocessing of learning resources, S2 is coding of a mathematical expression having a tree structure, S3 is position coding of a context, and unified position coding of the mathematical expression and the context is realized, S4 is training of a BERT pre-training model using unified position coding, and S5 is application of a trained word embedding model.
In one embodiment, the step S1 of preprocessing the learning resource containing the natural language text and the mathematical language text includes:
processing a learning resource containing a natural language text and a mathematical language text into a symbol sequence, wherein the mathematical expression is in a LaTeX format, and the mathematical resource data set is a mathematical resource set expressed as L ═ L { (L) 1 ,L 2 ,…,L i ,…,L N’ },L i Indicating the ith mathematical resource.
N' represents the total number of mathematical resources.
In one embodiment, processing a learning resource containing natural language text and mathematical language text into a sequence of symbols includes:
utilizing a mathematical expression in a LaTeX format of an im2 tagging word segmentation tool to perform word segmentation to obtain a mathematical expression word segmentation result symbol sequence, utilizing a TangenS tool to convert the mathematical expression in the LaTeX format into an operator OPT tree, and performing depth-first traversal on the OPT tree to obtain a mathematical expression tree structure traversal result symbol sequence, wherein the jth mathematical expression of the ith mathematical resource is represented as a jth mathematical expression
Figure BDA0003621858900000081
Figure BDA0003621858900000082
The nth' symbol of the jth mathematical expression after being participled by the LaTeX format is represented,
Figure BDA0003621858900000083
k-th symbol obtained by depth-first traversal of OPT tree representing jth mathematical expressionEach mathematical resource is composed of a natural language text and a mathematical expression, wherein the natural language text is the context of the mathematical expression, and the mathematical expression M i,j Is in the context of C i,j ={t z |t z ∈L i ,|z-p ij R is less than or equal to I, wherein t is equal to or less than R z Representing the z-th natural language word, p ij Is a mathematical expression M i,j The position in the sequence as a whole, R is at most 64;
obtaining the expression of each mathematical resource according to the natural language and the symbolic expression form of the mathematical expression, wherein the ith mathematical resource is expressed as:
Figure BDA0003621858900000084
N T is the total length of the natural language text;
when the mathematical expression M i,j When the device is composed of a plurality of continuous equality or inequality, the device is divided into equal signs and unequal signs
Figure BDA0003621858900000085
Obtaining a mathematical resource data set according to the expression of each mathematical resource as a pre-training model data set
Figure BDA0003621858900000086
Wherein i is the number of the learning resource, j is the number of the mathematical expression, and w is the number of the sub-expression.
In a specific implementation process, as shown in fig. 2, the data is preprocessed to obtain a data set of mathematical resources. And processing the learning resources including the natural language and the mathematical language into a symbol sequence through a Mathpix OCR interface to obtain a mathematical resource set, wherein the learning resources include an answering process. The Mathpix OCR interface may extract text and mathematical expressions from the picture and convert the mathematical formulas to LaTeX format. With L ═ L 1 ,L 2 ,…,L i ,…,L N Denotes the entire set of mathematical resources, where L i The ith mathematical resource is represented.
And (3) segmenting the mathematical expression in the LaTeX format by using an im2markup segmentation tool for the mathematical expression of each mathematical resource to obtain a mathematical expression symbol sequence. Using the TangenS tool, the mathematical expression in LaTeX format is converted into an operator OPT tree, which is shown in FIG. 3. And performing depth-first traversal on the OPT tree to obtain a mathematical expression symbol sequence. Therefore, two symbol sequences can be obtained by one mathematical expression, and for the jth mathematical expression of the ith mathematical resource, the j can be expressed as:
Figure BDA0003621858900000091
wherein
Figure BDA0003621858900000092
The nth' symbol of the jth mathematical expression after being participled by the LaTeX format is represented,
Figure BDA0003621858900000093
and (4) a k-th operator or operand obtained by depth-first traversal of the OPT tree representing the jth mathematical expression.
Each mathematical resource is composed of a text and a mathematical expression, wherein the text is the context of the mathematical expression, and the context may contain description and interpretation information of the mathematical expression and is the key of semantic association between a mathematical symbol and a natural language. Since the length of the input data is limited, the context of the mathematical expression needs to be defined. Defining a mathematical expression M i,j Is in the context of C i,j ={t z |t z ∈L i ,|z-p ij R is less than or equal to I, wherein t is equal to or less than R z Denotes the z-th natural language word, p ij Is a mathematical expression M i,j The position of the whole in the sequence is maximum to 64, namely 64 natural language text symbols before and after the mathematical expression are taken, and 128 natural language text symbols in total are taken as context;
in summary, for the ith mathematical resource, it can be expressed as:
Figure BDA0003621858900000094
N T is the total length of the natural language text.
In learning resources, mathematical expression M i,j Often, multi-step derivation is included, namely, a plurality of continuous equations or inequalities form an expression together, and the expression can be further segmented into sub-expressions by taking equal signs and unequal signs as marks
Figure BDA0003621858900000095
Each sub-expression
Figure BDA0003621858900000096
Only one equal sign or unequal sign is included, namely only one derivation step is provided, and the sub-expressions of the multi-step derivation expression share one context.
Finally, all learning resources are processed according to the process to form a pre-training model data set
Figure BDA0003621858900000097
Wherein i is a learning resource serial number, j is a mathematical expression serial number, and w is a sub-expression serial number.
In one embodiment, S2 introduces a shift operation in absolute position encoding, the mathematical expression is N-ary tree, and the root node is defined as
Figure BDA0003621858900000098
For any subsequent child node, the coding mode is as follows:
s2.1: representing the sub-nodes of all branches by using one hot codes, wherein the one hot codes have N bits, and for the sub-node of the r-th branch, the r-th bit of the one hot codes from the right to the left is 1, and the rest bits are 0;
s2.2: the position code of the father node is shifted to the left by N bits and then is added with the one hot code of the branch child node, the final absolute position code of the branch node is obtained, and any node in the final expression tree is represented as
Figure BDA0003621858900000101
Wherein n is a node
Figure BDA0003621858900000102
Absolute position coding of, D n In the form of an absolute position-coded decimal representation,
Figure BDA0003621858900000103
is D n When calculating the relative position of the nodes in the tree, the following method is adopted:
Figure BDA0003621858900000104
wherein PE represents a relative position calculation function, T represents a tree,
Figure BDA0003621858900000105
representing nodes in a mathematical expression tree
Figure BDA0003621858900000106
And node
Figure BDA0003621858900000107
Relative position calculation function of D m Is a node
Figure BDA0003621858900000108
The absolute value of the code of (a),
Figure BDA0003621858900000109
is D m The length of the binary code of (a),<<representing the left shift operator.
In a specific implementation process, step S2 performs position coding on the mathematical expression having the tree structure, and ensures that the position coding having the tree structure has relative position translation invariance. The relative position can reflect the relation between words, in a linear sequence, the relative position is defined as the difference of absolute positions, and the translation invariance of the relative position refers to that the position offset is the same no matter the absolute position of a word, as long as the relative position is the same. The relative position is unchanged, so that semantic association with the words with fixed relative positions can be always realized no matter where the words are, and the training process can obtain the semantic relationship.
In order to ensure that the position code of the tree structure has relative position translation invariance, the invention adopts a branch-based coding mode and introduces displacement operation when calculating the relative position.
As shown in fig. 4, assuming that the mathematical expression tree is a 3-way tree, the root node is defined as (000), and any subsequent child node is a first branch child node, the root node is left-shifted by 3 bits and then a one hot code is added (001), and finally the branch child node is represented as (000001), similarly, the 2 nd branch child node is (000010), and the 3 rd branch child node is (000100).
Any node in the expression tree can be represented as
Figure BDA00036218589000001010
Wherein D n A decimal representation of n is encoded for absolute position,
Figure BDA00036218589000001011
is D n Of the relative position of the nodes in the computation tree
Figure BDA00036218589000001012
When the tree is in use, the relative position between any two nodes in the tree is independent of the absolute position according to the formula. As shown in fig. 4, node p 20,9 And node p 0,3 Has a relative position value of 20, node p 84,12 And node p 1,6 Is also 20.
In one embodiment, step S3 includes:
for natural language text with linear sequence features, relative position coding is performed, wherein the relative position between words is defined as the difference of absolute positions and expressed as
Figure BDA0003621858900000111
a and b represent the absolute position of the device,
Figure BDA0003621858900000112
a relative position calculation function for representing the second word, wherein the relative position coding is adopted by coding the position of the linear sequence by negative integer, and the length of the position coding of the linear sequence is L S =n T ×l T ,n T Maximum bifurcation Tree, l, representing a Tree Structure T Representing the maximum layer number of the tree structure, and representing a negative integer by using a complementary code;
taking a root node of the tree structure as a head node of the linear sequence to realize unified position coding of two structures, wherein the calculation of relative positions in the unified position coding is as follows:
Figure BDA0003621858900000113
wherein the content of the first and second substances,
Figure BDA0003621858900000114
representing any two nodes
Figure BDA0003621858900000115
And
Figure BDA0003621858900000116
a function of the relative position between them is calculated,
Figure BDA0003621858900000117
representing nodes in a linear sequence
Figure BDA0003621858900000118
And node
Figure BDA0003621858900000119
A relative position calculation function therebetween, S denotes a sequence,
Figure BDA00036218589000001110
representing nodes in a tree structure
Figure BDA00036218589000001111
And node
Figure BDA00036218589000001112
A function of the relative position between them is calculated,
Figure BDA00036218589000001113
representing nodes in a linear sequence
Figure BDA00036218589000001114
And a relative position calculation function between the root nodes,
Figure BDA00036218589000001115
representing root nodes and nodes in a tree structure
Figure BDA00036218589000001116
A function of the relative position between them is calculated,
Figure BDA00036218589000001117
representing nodes in a tree structure
Figure BDA00036218589000001118
And a relative position calculation function between the root nodes,
Figure BDA00036218589000001119
representing root nodes and nodes in a linear sequence
Figure BDA00036218589000001120
The function is calculated for the relative position therebetween.
Specifically, step S3 fuses the linear sequence of contextual natural language with the coding model of S2. For natural language text with a linear sequence, the relative position between words is defined as the difference in absolute position, which can be expressed as
Figure BDA00036218589000001121
a and b represent absolute positions, and translation invariance of relative positions is easily satisfied:
Figure BDA0003621858900000121
In order to unify the position coding of the linear sequence and the tree structure, the position of the linear sequence is coded by negative integers and is represented by a complementary code, so that the highest position of the position coding of all natural language words is 1, and the highest position of the position coding of the nodes of the tree structure is 0. Wherein the position code of the linear sequence has a length L S =n T ×l T ,n T Maximum bifurcation Tree, l, representing a Tree Structure T Representing the maximum number of layers of the tree structure. Finally, the root node of the tree structure is used as the head node of the linear sequence to realize the unified representation of the two structures, as shown in fig. 5. The formula for calculating the relative position of the unified position code is shown above.
In one embodiment, when the relative position codes of any two nodes in the tree structure and the linear sequence are sent to an attention module of a BERT pre-training model for training, the action expression of the relative position codes is as follows:
Figure BDA0003621858900000122
wherein the content of the first and second substances,
Figure BDA0003621858900000123
is the A and B relative position embedding vector of the first layer transform in the BERT model,
Figure BDA0003621858900000124
is the ith layer a word embedding vector,
Figure BDA0003621858900000125
is the B-th word embedding vector of the l-th layer, W Q,l Is the Query matrix of the l-th layer, W K,l Is the Key matrix at level l, d is the word vector dimension,
Figure BDA0003621858900000126
is not normalizedThe gravity value.
In one embodiment, the final mixed word embedding expression is calculated by:
Figure BDA0003621858900000127
wherein the content of the first and second substances,
Figure BDA0003621858900000128
to represent
Figure BDA0003621858900000129
The normalization process of (a) is performed,
Figure BDA00036218589000001210
denotes the B-th word embedding vector, W, of the l-th layer V,l Is the Value matrix of the l-th layer, n 1 Represents a total of n 1 The word, when the layer l is the last layer,
Figure BDA00036218589000001211
and embedding the words of the A-th word in the l-th layer as the final mixed word embedded expression of the expression.
Specifically, through the foregoing steps, learning resources are converted into data sets
Figure BDA00036218589000001212
Wherein the context sequence C i,j And mathematical expressions
Figure BDA00036218589000001213
After word segmentation, the code is coded by using a uniform position coding mode and sent to a BERT pre-training model. As shown in fig. 6, the input of the BERT pre-training model consists of three parts, the first part is a context sequence, the second part is a mathematical expression LaTeX sequence, and the third part is a depth-first traversal sequence of a mathematical expression OPT tree. The relative position code is then fed into the attention module of the pre-training model BERT for training. In the attention module, the action expression of relative position coding is as beforeAs shown herein. By passing
Figure BDA0003621858900000131
The final mixed word embedding expression is obtained by the calculation formula.
After the pre-training model is adjusted, a Masking Language Model (MLM) and a Next Sentence Prediction (NSP) two standard pre-training tasks are adopted to obtain mixed word embedded expression of a natural Language text and a mathematical Language text.
In the MLM task, 15% of words are randomly extracted from three input sequences, and among the extracted words, 80% of the words are replaced by [ MASK ] labels, 10% of the words are replaced by random other words, and 10% of the words are not changed. The MLM task uses cross entropy as a loss function, which in this embodiment is expressed as:
Figure BDA0003621858900000132
wherein
Figure BDA0003621858900000133
Is the estimated probability of the masked word x after linear classification and Softmax regression, and p (x) is its original distribution, i.e., its one hot vector.
In the NSP task, 50% of the NSP tasks are randomly selected
Figure BDA0003621858900000134
Will be its context C i,j The context of the random formula is replaced. The NSP task also generally uses cross entropy as a loss function, which in this embodiment is expressed as:
Figure BDA0003621858900000135
wherein if the context is not replaced by p-1, if the context is replaced by p-0,
Figure BDA0003621858900000136
is the estimated probability that the context matches the formula.
In summary, compared with the prior art, the invention performs position coding on the tree structure of the mathematical expression, ensures that the position coding of the tree structure has relative position translation invariance, unifies the position coding of the text with linear structure characteristics and the mathematical expression with tree structure characteristics, and uses the unified position coding for the BERT pre-training model to further extract the word embedding expression, so that the final word embedding expression contains more information and is more accurate in expression.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications may be made in addition to or substituted for those described in the detailed description by those skilled in the art without departing from the spirit of the invention or exceeding the scope of the claims set forth below.

Claims (7)

1. A method for embedding mixed words of a natural language text and a mathematical language text, comprising:
s1: preprocessing learning resources including a natural language text and a mathematical language text to obtain a mathematical resource data set, wherein the mathematical language text is a mathematical expression with a tree structure, and the natural language text is a context with linear sequence characteristics;
s2: absolute position coding is carried out on a mathematical expression with a tree structure by adopting a position coding mode based on branches, and relative position coding of two nodes in the tree structure is calculated according to an absolute position coding result;
s3: adopting negative integer position coding for a context with linear sequence characteristics, expressing by using a complement, then taking a root node of a tree structure as a first node of a linear sequence to realize uniform position coding of a mathematical expression and the context, and then calculating the relative position coding of any two nodes in the tree structure and the linear sequence according to the uniform position coding;
s4: inputting the data set of the mathematical resources obtained in the step S1 into a BERT pre-training model, wherein the BERT pre-training model is provided with a position coding module and an attention module, inputting the unified position code obtained in the step S3 into the position coding module, and coding the relative positions of any two nodes in the tree structure and the linear sequence calculated in the step S3 into the attention module of the BERT pre-training model for training, and pre-training the mathematical resources by adopting a masking language model and a next sentence prediction two standard pre-training tasks to obtain a trained word embedding model;
s5: and processing the natural language text and the mathematical language text by using the trained word embedding model to obtain the final mixed word embedding expression.
2. The mixed word embedding method of the natural language text and the mathematical language text as claimed in claim 1, wherein the step S1 of preprocessing the learning resource containing the natural language text and the mathematical language text includes:
processing a learning resource containing a natural language text and a mathematical language text into a symbol sequence, wherein the mathematical expression is in a LaTeX format, and the mathematical resource data set is a mathematical resource set expressed as L ═ L { (L) 1 ,L 2 ,…,L i ,…,L N’ },L i Indicating the ith mathematical resource.
3. The method of mixed word embedding of natural language text and mathematical language text as claimed in claim 2, wherein processing a learning resource containing natural language text and mathematical language text into a sequence of symbols comprises:
utilizing a mathematical expression in a LaTeX format of an im2 tagging word segmentation tool to perform word segmentation to obtain a symbol sequence of a word segmentation result of the mathematical expression, utilizing a TangenS tool to convert the mathematical expression in the LaTeX format into an operator OPT tree, and performing depth-first traversal on the OPT tree to obtain a symbol sequence of a traversal result of a tree structure of the mathematical expression, wherein the jth mathematical expression of the ith mathematical resource is represented as a jth mathematical expression of the ith mathematical resource
Figure FDA0003621858890000021
Figure FDA0003621858890000022
The nth' symbol of the jth mathematical expression after being participled by the LaTeX format is represented,
Figure FDA0003621858890000023
representing the kth symbol obtained by depth-first traversal of an OPT tree of the jth mathematical expression, wherein each mathematical resource consists of a natural language text and a mathematical expression, the natural language text is the context of the mathematical expression, and the mathematical expression M is i,j Is in the context of C i,j ={t z |t z ∈L i ,|z-p ij R is less than or equal to I, wherein t is equal to or less than R z Representing the z-th natural language word, p ij Is a mathematical expression M i,j The position in the sequence as a whole, R is at most 64;
obtaining the expression of each mathematical resource according to the natural language and the symbolic expression form of the mathematical expression, wherein the ith mathematical resource is expressed as:
Figure FDA0003621858890000024
N T is the total length of the natural language text;
when the mathematical expression M i,j When the device is composed of a plurality of continuous equality or inequality, the device is divided into equal signs and unequal signs
Figure FDA0003621858890000025
Obtaining a mathematical resource data set according to the expression of each mathematical resource as a pre-training model data set
Figure FDA0003621858890000026
Wherein i is the number of the learning resource, j is the number of the mathematical expression, and w is the number of the sub-expression.
4. The method for embedding mixed words of natural language text and mathematical language text according to claim 1, wherein S2 is performing absolute position codingIn time, a displacement operation is introduced, a mathematical expression is an N-ary tree, and a root node is defined as
Figure FDA0003621858890000027
For any subsequent child node, the coding mode is as follows:
s2.1: representing the sub-nodes of all branches by using one hot codes, wherein the one hot codes have N bits, and for the sub-node of the r-th branch, the r-th bit of the one hot codes from the right to the left is 1, and the rest bits are 0; s2.2: the position code of the father node is shifted to the left by N bits and then is added with the one hot code of the branch child node, the final absolute position code of the branch node is obtained, and any node in the final expression tree is represented as
Figure FDA0003621858890000028
Wherein n is a node
Figure FDA0003621858890000029
Absolute position coding of, D n In the form of an absolute position-coded decimal representation,
Figure FDA00036218588900000210
is D n When calculating the relative position of the nodes in the tree, the following method is adopted:
Figure FDA0003621858890000031
wherein PE represents a relative position calculation function, T represents a tree,
Figure FDA0003621858890000032
representing nodes in a mathematical expression tree
Figure FDA0003621858890000033
And node
Figure FDA0003621858890000034
Relative position calculation function of D m Is a node
Figure FDA0003621858890000035
The absolute value of the code of (a),
Figure FDA0003621858890000036
is D m The length of the binary code of (a),<<representing the left shift operator.
5. The mixed word embedding method of natural language text and mathematical language text as claimed in claim 1, wherein the step S3 includes:
for natural language text with linear sequence features, relative position coding is performed, wherein the relative position between words is defined as the difference of absolute positions and expressed as
Figure FDA0003621858890000037
a and b represent the absolute position of the object,
Figure FDA0003621858890000038
a relative position calculation function for representing the second word, wherein the relative position coding is adopted by coding the position of the linear sequence by negative integer, and the length of the position coding of the linear sequence is L S =n T ×l T ,n T Maximum bifurcation Tree, l, representing a Tree Structure T Representing the maximum layer number of the tree structure, and representing a negative integer by using a complementary code;
taking a root node of the tree structure as a head node of the linear sequence to realize unified position coding of two structures, wherein the calculation of relative positions in the unified position coding is as follows:
Figure FDA0003621858890000039
wherein the content of the first and second substances,
Figure FDA00036218588900000310
representing any two nodes
Figure FDA00036218588900000311
And
Figure FDA00036218588900000312
a function is calculated of the relative position between,
Figure FDA00036218588900000313
representing nodes in a linear sequence
Figure FDA00036218588900000314
And node
Figure FDA00036218588900000315
A relative position calculation function therebetween, S denotes a sequence,
Figure FDA00036218588900000316
representing nodes in a tree structure
Figure FDA00036218588900000317
And node
Figure FDA00036218588900000318
A function of the relative position between them is calculated,
Figure FDA00036218588900000319
representing nodes in a linear sequence
Figure FDA00036218588900000320
And a relative position calculation function between the root nodes,
Figure FDA00036218588900000321
representing root nodeNodes in point and tree structures
Figure FDA00036218588900000322
A function of the relative position between them is calculated,
Figure FDA00036218588900000323
representing nodes in a tree structure
Figure FDA0003621858890000041
And a relative position calculation function between the root nodes,
Figure FDA0003621858890000042
representing root nodes and nodes in a linear sequence
Figure FDA0003621858890000043
The function is calculated for the relative position therebetween.
6. The method of mixed word embedding of natural language text and mathematical language text as claimed in claim 1, wherein when the relative position codes of any two nodes in the tree structure and the linear sequence are fed into the attention module of the BERT pre-training model for training, the action expressions of the relative position codes are as follows:
Figure FDA0003621858890000044
wherein the content of the first and second substances,
Figure FDA0003621858890000045
is the A and B relative position embedding vector of the first layer transform in the BERT model,
Figure FDA0003621858890000046
is the ith layer a word embedding vector,
Figure FDA0003621858890000047
is the B-th word embedding vector of the l-th layer, W Q,l Is the Query matrix of the l-th layer, W K,l Is the Key matrix at level l, d is the word vector dimension,
Figure FDA0003621858890000048
is an unnormalized attention weight.
7. The method of claim 6, wherein the final mixed-word embedding expression is calculated by:
Figure FDA0003621858890000049
wherein the content of the first and second substances,
Figure FDA00036218588900000410
to represent
Figure FDA00036218588900000411
The normalization process of (a) is performed,
Figure FDA00036218588900000412
denotes the B-th word embedding vector, W, of the l-th layer V,l Is the Value matrix of the l-th layer, n 1 Represents a total of n 1 The word, when the layer l is the last layer,
Figure FDA00036218588900000413
and embedding the words of the A-th word in the l-th layer as the final mixed word embedded expression of the expression.
CN202210469691.4A 2022-04-28 2022-04-28 Mixed word embedding method for natural language text and mathematical language text Active CN114818698B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210469691.4A CN114818698B (en) 2022-04-28 2022-04-28 Mixed word embedding method for natural language text and mathematical language text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210469691.4A CN114818698B (en) 2022-04-28 2022-04-28 Mixed word embedding method for natural language text and mathematical language text

Publications (2)

Publication Number Publication Date
CN114818698A true CN114818698A (en) 2022-07-29
CN114818698B CN114818698B (en) 2024-04-16

Family

ID=82508716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210469691.4A Active CN114818698B (en) 2022-04-28 2022-04-28 Mixed word embedding method for natural language text and mathematical language text

Country Status (1)

Country Link
CN (1) CN114818698B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200040652A (en) * 2018-10-10 2020-04-20 고려대학교 산학협력단 Natural language processing system and method for word representations in natural language processing
CN111444709A (en) * 2020-03-09 2020-07-24 腾讯科技(深圳)有限公司 Text classification method, device, storage medium and equipment
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model
CN113239700A (en) * 2021-04-27 2021-08-10 哈尔滨理工大学 Text semantic matching device, system, method and storage medium for improving BERT

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200040652A (en) * 2018-10-10 2020-04-20 고려대학교 산학협력단 Natural language processing system and method for word representations in natural language processing
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model
CN111444709A (en) * 2020-03-09 2020-07-24 腾讯科技(深圳)有限公司 Text classification method, device, storage medium and equipment
CN113239700A (en) * 2021-04-27 2021-08-10 哈尔滨理工大学 Text semantic matching device, system, method and storage medium for improving BERT

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨晨;宋晓宁;宋威;: "SentiBERT:结合情感信息的预训练语言模型", 计算机科学与探索, no. 09, pages 1563 - 1570 *

Also Published As

Publication number Publication date
CN114818698B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN113128229B (en) Chinese entity relation joint extraction method
CN107291693B (en) Semantic calculation method for improved word vector model
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN113642330B (en) Rail transit standard entity identification method based on catalogue theme classification
CN110134946B (en) Machine reading understanding method for complex data
CN109145190B (en) Local citation recommendation method and system based on neural machine translation technology
CN111753024B (en) Multi-source heterogeneous data entity alignment method oriented to public safety field
CN112434535B (en) Element extraction method, device, equipment and storage medium based on multiple models
CN110472235A (en) A kind of end-to-end entity relationship joint abstracting method towards Chinese text
CN111274804A (en) Case information extraction method based on named entity recognition
CN115438674B (en) Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN112463924B (en) Text intention matching method for intelligent question answering based on internal correlation coding
CN114169312A (en) Two-stage hybrid automatic summarization method for judicial official documents
CN110222338A (en) A kind of mechanism name entity recognition method
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN113010679A (en) Question and answer pair generation method, device and equipment and computer readable storage medium
CN111967267A (en) XLNET-based news text region extraction method and system
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN117171333A (en) Electric power file question-answering type intelligent retrieval method and system
CN113220964B (en) Viewpoint mining method based on short text in network message field
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN114372454A (en) Text information extraction method, model training method, device and storage medium
CN114818698B (en) Mixed word embedding method for natural language text and mathematical language text
CN115587595A (en) Multi-granularity entity recognition method for pathological text naming
CN114297408A (en) Relation triple extraction method based on cascade binary labeling framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant