CN107153640A

CN107153640A - A kind of segmenting method towards elementary mathematics field

Info

Publication number: CN107153640A
Application number: CN201710317698.3A
Authority: CN
Inventors: 林辉
Original assignee: Chengdu Foresight Nehology Science And Technology Ltd
Current assignee: Lin Hui
Priority date: 2017-05-08
Filing date: 2017-05-08
Publication date: 2017-09-12

Abstract

The invention discloses a kind of segmenting method towards elementary mathematics field, participle model first according to needed for elementary mathematics Chinese word segmentation, it is word by formula, variable and symbol definition, and provided respectively according to part of speech classification on the basis of conventional word segmentation standard for Chinese；The mathematics tagged corpus after participle and part of speech mark is recycled, and to being switched over by the model after training, obtains field participle and part-of-speech tagging model；Judge whether word segmentation result meets the specification in elementary mathematics field, if it is, participle success；If it is not, then carrying out participle again using participle post processor.It is directed to art of mathematics, it can be very good to handle the natural language comprising elements such as symbol, mathematical formulae, figures, can effectively promote the key technology in the artificial intelligence fields such as natural language processing, image, semantic understanding, machine learning to be directed to the research and application of art of mathematics.

Description

A kind of segmenting method towards elementary mathematics field

Technical field

The present invention relates to natural language processing technique field, and in particular to a kind of participle side towards elementary mathematics field Method.

Background technology

Development and the continuous maturation of artificial intelligence technology with information technology, natural language processing (NLP) have been obtained extensively General application, while relevant theory, technology have also obtained very big development.But most of natural language processing at present Research in terms of being recognized with image, semantic all concentrates on the fields such as news, forum, blog, and the research of professional domain is less, relates to And to processing such as symbol, mathematical formulaes with regard to less.However, the text of art of mathematics not only includes natural language, also include The contents such as symbol, mathematical formulae, and natural language included in it also has necessarily with being usually used for the daily language exchanged Difference.

Current existing natural language processing algorithm is not directly applicable art of mathematics, if it is desired to realize computer Elementary mathematics problem automatic calculation, and generate the answer process of class people and be accomplished by comprising elements such as symbol, mathematical formulae, figures Natural language handled, it is necessary to merged and extend natural language processing and image, semantic understand research.

The content of the invention

Based on this, in view of the above-mentioned problems, being necessary to propose a kind of segmenting method towards elementary mathematics field, it is directed to Art of mathematics, can be very good to handle the natural language comprising elements such as symbol, mathematical formulae, figures, can be effective Ground promotes the key technology in the artificial intelligence fields such as natural language processing, image, semantic understanding, machine learning to be led for mathematics The research and application in domain.

The technical scheme is that：

A kind of segmenting method towards elementary mathematics field, comprises the following steps：

S1：Participle model according to needed for elementary mathematics Chinese word segmentation, is defined according to word segmentation standard for Chinese, simultaneously will Formula, variable and symbol definition are word, and are provided respectively according to part of speech classification；

S2：Using the mathematics tagged corpus after participle and part of speech mark, and to being switched over by the model after training, Obtain field participle and part-of-speech tagging model；

S3：Judge whether word segmentation result meets the specification in elementary mathematics field, if it is, participle success；If it is not, then Participle again is carried out using participle post processor.

In terms of basic framework, the present invention handles framework and the feature learning side based on deep learning using large-scale data Method, using extensive un-annotated data construction feature set, and using characteristic set integrated structure machine learning method come complete Into processing task.For specific tasks, the text feature of the invention according to art of mathematics, and combine general natural language processing base The analysis method towards art of mathematics has been invented in the achievement in research of plinth problem, research.

For the participle model required for elementary mathematics Chinese word segmentation, the present invention is on the basis of conventional word segmentation standard for Chinese On, formula, variable, symbol etc. are also defined as word, part of speech is provided respectively according to classification；Then studied using oneself Model domain-adaptive method, using a small amount of mathematics tagged corpus marked by participle and part of speech, to passing through news corpus The model being trained is switched over；The method can make full use of the information of existing training corpus, with reference to a small amount of mark language Material obtains field participle and part-of-speech tagging model；Again lexeme classification problem of the participle as word, prefix is represented with B, E represents word Tail, M is represented in word, and S represents monosyllabic word, and the word between B and E and S individual characters are constituted into participle；When word segmentation result do not meet it is elementary During the specification of art of mathematics, participle again is carried out using participle post processor, the method for fully utilizing statistics and rule.

As the further optimization of such scheme, the step S1 specifically includes following steps：

Before underway literary participle, the element of Sparse in art of mathematics is transformed to accordingly according to its generic Chinese word.

On the basis of conventional word segmentation standard for Chinese, formula, variable, symbol etc. are also defined as word, part of speech according to Classification is provided respectively, in the case of this Marking Guidelines, and the peculiar element of art of mathematics may have Sparse Sex chromosome mosaicism, such as occurrence number can be very low in corpus for most of formula；Therefore, before underway literary participle, we are first The peculiar content of these art of mathematics is first transformed to corresponding Chinese word according to its generic, in being conducive to after carrying out Literary participle, improves the accuracy of participle.

As the further optimization of such scheme, the step S2 specifically includes following steps：

S21：Un-annotated data according to needed for the feature learning method based on deep learning, collects corresponding elementary mathematics Problem and corresponding answer text, and using training initial its form of word vector representation；

S22：Marked using 4-tags, training corpus is pre-processed, represent prefix, alphabetical " E " table with alphabetical " B " respectively Show suffix, letter ' M ' is represented in word, alphabetical " S " represents monosyllabic word；And mathematic(al) representation or additional character are identified as one Word；

S23：It is trained using the maximized method of language model, and adds the relevant information of sentence place chapter.

Deep learning (Deep learning) described in this programme is a new field in machine learning research, Its motivation is the neutral net for setting up, simulating human brain progress analytic learning, and it imitates the mechanism of human brain to explain data, for example Image, sound and text.

The present invention passes through first against the magnanimity un-annotated data problem needed for the feature learning method based on deep learning Network examination paper bank collects 10,000 multiple tracks elementary mathematics problems and corresponding answer text, and initial word vector table is trained using it Show form；Then traditional 4-tags labelling methods are used, training corpus is pre-processed, prefix is represented with B respectively, E represents word Tail, M is represented in word, and S represents monosyllabic word；The maximized method of language model is recycled to be trained, while adding where sentence The information of chapter come improve word vector study the degree of accuracy, due to word vector study computation complexity greatly, so adopting Collateral learning is carried out with large-scale data processing framework.

Corresponding participle post processing can be then carried out when the result after participle does not meet elementary mathematics specification, it is wrong for participle Based on context sentence then linguistic context and mathematical knowledge can re-start participle by mistake so that word segmentation result can be very good to be used in Art of mathematics.

As the further optimization of such scheme, the step S3 specifically includes following steps：

S31, based on context linguistic context and mathematical programming, carry out participle again, successively participle to the sentence of participle mistake The words of the previous word section of mistake carries out stack-incoming operation；

S32, pop while and participle mistake latter word section words matched；

S33, when finding the additional character successful matching in mathematics, then prove the processing mistake of former sentence, it is necessary to participle The previous word section of mistake and latter word section are merged together, as a word.

In this programme, based on context linguistic context and mathematical knowledge it can then be re-started point for the sentence of participle mistake The words of the previous word of participle mistake section, is carried out stack-incoming operation by word successively, is then popped while Heing participle mistake The words of latter word section is matched, additional character (" () ", " { } ", " [] ") successful matching, explanation in discovery mathematics The processing of former sentence is wrong, then needs the previous word section and latter word section of participle mistake to be merged together, make One word.

As the further optimization of such scheme, described segmenting method is entered using condition random field Open-Source Tools CRF Row participle is operated.

The condition random field (CRF) of this programme is a kind of statistical modeling method for being usually used in pattern-recognition and machine learning, It is mainly used in structuring prediction；CRF is a kind of distinguishing undirected probabilistic graphical models, and it is generally used for marking or parsing order Data, such as natural language text or biological sequence and computer vision, in computer vision, CRF is frequently used for Object identifying With image segmentation；Do not consider " adjacent " sample typically when predicting the label of single sample using general category device, but CRF can be with Context is considered, for example, the label sequence of the sequence of linear chain CRF (it is popular in natural language processing) prediction input sample Row.

The beneficial effects of the invention are as follows：

1st, the present invention is trained using the maximized method of language model, while the information of chapter where adding sentence is come The degree of accuracy of word vector study is improved, and then can be recognized by computer, is easy to the shared utilization of resource.

2nd, formula, variable, symbol etc. are also defined as word by the present invention on the basis of conventional word segmentation standard for Chinese, Part of speech is provided respectively according to classification, investigated the domain-adaptive method of model, passes through participle and part of speech using a small amount of The mathematics tagged corpus of mark, is switched over to the model being trained by news corpus；It can make full use of existing The information of training corpus, field participle and part-of-speech tagging model are obtained with reference to a small amount of mark language material.

3rd, the present invention uses traditional 4-tags labelling methods, and training corpus is pre-processed, prefix, E generations is represented with B respectively Table suffix, M is represented in word, and S represents monosyllabic word, can then be carried out when the result after participle does not meet elementary mathematics specification corresponding Participle is post-processed, and then based on context linguistic context and mathematical knowledge participle can be re-started for the sentence of participle mistake so that point Word result can be very good to be used in art of mathematics.

Brief description of the drawings

Fig. 1 be described in the embodiment of the present invention towards elementary mathematics field segmenting method flow chart；

Fig. 2 is that corresponding table of the embodiment of the present invention 2 is not carrying out the Chinese word segmentation flow chart of post processor；

Fig. 3 is that corresponding table of the embodiment of the present invention 3 is carrying out the Chinese word segmentation flow chart of post processor.

Embodiment

Embodiments of the invention are described in detail below in conjunction with the accompanying drawings.

Embodiment

As shown in figure 1, a kind of segmenting method towards elementary mathematics field, comprises the following steps：

In one of the embodiments, the step S1 specifically includes following steps：

Formula, variable, symbol etc. are also defined as word, part of speech is provided respectively according to classification, in this mark rule In the case of model, the peculiar element of art of mathematics may have Sparse sex chromosome mosaicism, and such as most of formula are in corpus Occurrence number can be very low；Therefore, before underway literary participle, we are first by the peculiar content of these art of mathematics according to it Generic is transformed to corresponding Chinese word, is conducive to the Chinese word segmentation after carrying out, improves the accuracy of participle.

In another embodiment, the step S2 specifically includes following steps：

In another embodiment, the step S3 specifically includes following steps：

S32, pop while and participle mistake latter word section words matched；

In another embodiment, described segmenting method carries out participle operation using condition random field Open-Source Tools CRF.

The Chinese word segmentation flow chart of the invention as described in Fig. 2 and Fig. 3, lexeme classification problem of the participle as word, leads to Conventional B represents prefix, and E represents suffix, and M is represented in word, and S represents monosyllabic word, and the word between B and E, and S individual characters are constituted and divided Word, combines the method for statistics and rule through carrying out participle, comprises the following steps that：

A, one of elementary mathematics topic of input；

B, using the model trained carrying out lexeme mark to topic, (B represents prefix, and E represents suffix, and M is represented in word, S Represent monosyllabic word)；

C, the result after participle (word between B and E and S individual characters are constituted into participle) is saved in the data knot set Convenient use in structure；

D, extract relation in the mathematical problem after using participle and find that word segmentation result does not meet elementary mathematics rule during data Model (processing that mistake has been carried out during participle), then carry out step e participle post processing；

E, the sentence (mathematic(al) representation, bracket etc) for not meeting elementary mathematics specification for participle are re-started point Word, the matching of bracket is realized using stack, so as to avoid spliting a pair of brackets (" () ", " { } ", " [] ") to come.

Below with a kind of flow of segmenting method towards elementary mathematics field of example in detail：

Here a problem is selected to be inputted, topic information is：

Seek equation y=3x²Maximums of+the 2x on interval [1,2].

1st, using the CRF models progress lexeme mark trained, (wherein first row is sequence number, and secondary series is stem, the 3rd Row are the information of lexeme mark), as a result as shown in table 1：

Table 1

2nd, by the word between B and E, and S individual characters constitute a word, the result of participle (wherein first row as shown in table 2 For sequence number, secondary series is lexeme mark, and the 3rd row are word segmentation results)：

Table 2

3rd, it is also diversified when progress lexeme mark because the model of training is diversified, it is impossible to Ensure that the result of participle necessarily meets real needs, so above-mentioned topic is also possible to have point-score as shown in table 3：

1	S	Ask
			2	BE	Equation
3	BMMMMME	Y=3x²+2x
			4	S
5	BE	It is interval
			6	BME	[1,
7	BE	2]
			8	S	On
9	S	's
			10	BME	Maximum
11	S	。

Table 3

Substantially word segmentation result (being specifically shown in sequence number 6,7) above does not meet the specification of art of mathematics (because an area Between split), so needing to carry out participle again to word segmentation result, the words representated by sequence number 6 is carried out stacking behaviour successively Make, then pop while the words Heed representated by sequence number 7 carries out words matching, it can be found that the words representated by sequence number 6 In " "] in words representated by [" and sequence number 7 " number pairing, illustrate that the processing of original sentence is wrong, it is necessary to the institute of sequence number 6,7 The words of representative merges into a long word segmentation result (correct word segmentation result is as shown in table 2).

Embodiment described above only expresses the embodiment of the present invention, and it describes more specific and detailed, but simultaneously Therefore the limitation to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for one of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect scope.

Claims

1. a kind of segmenting method towards elementary mathematics field, it is characterised in that comprise the following steps：

S1：Participle model according to needed for elementary mathematics Chinese word segmentation, is defined according to word segmentation standard for Chinese, while by public affairs Formula, variable and symbol definition are word, and are provided respectively according to part of speech classification；

S2：Using the mathematics tagged corpus after participle and part of speech mark, to being switched over by the model after training, led Domain participle and part-of-speech tagging model；

S3：Judge whether word segmentation result meets the specification in elementary mathematics field, if it is, participle success；If it is not, then utilizing Participle post processor carries out participle again.

2. the segmenting method according to claim 1 towards elementary mathematics field, it is characterised in that：The step S1 is specific Comprise the following steps：

Before underway literary participle, during the element of Sparse in art of mathematics is transformed to accordingly according to its generic Cliction language.

3. the segmenting method according to claim 1 towards elementary mathematics field, it is characterised in that the step S2 is specific Comprise the following steps：

S22：Marked using 4-tags, training corpus is pre-processed, represent prefix with alphabetical " B " respectively, alphabetical " E " represents word Tail, letter ' M ' represents in word that alphabetical " S " represents monosyllabic word；And mathematic(al) representation or additional character are identified as a word；

4. the segmenting method according to claim 1 towards elementary mathematics field, it is characterised in that the step S3 is specific Comprise the following steps：

S31, based on context linguistic context and mathematical programming, carry out participle again, successively participle mistake to the sentence of participle mistake Previous word section words carry out stack-incoming operation；

S32, pop while and participle mistake latter word section words matched；

S33, when finding the additional character successful matching in mathematics, then prove the processing mistake of former sentence, it is necessary to participle mistake Previous word section and latter word section be merged together, as a word.

5. according to any described segmenting methods towards elementary mathematics field of claim 1-4, it is characterised in that：Described point Word method carries out participle operation using condition random field Open-Source Tools CRF.