WO2012077979A2

WO2012077979A2 - Method for extracting semantic distance from mathematical sentences and classifying mathematical sentences by semantic distance, device therefor, and computer readable recording medium

Info

Publication number: WO2012077979A2
Application number: PCT/KR2011/009439
Authority: WO
Inventors: 박근태; 박용길; 최형인; 위남숙; 이두석; 손정교; 김행문; 이동학
Original assignee: 에스케이텔레콤 주식회사; 주식회사 아이싸이랩
Priority date: 2010-12-07
Filing date: 2011-12-07
Publication date: 2012-06-14
Also published as: WO2012077979A3

Abstract

An embodiment of the present invention relates to a method for extracting a semantic distance from mathematical sentences and classifying the mathematical sentences by the semantic distance, a device therefor, and a computer readable recording medium. The embodiment of the present invention provides a method for extracting a semantic distance from mathematical sentences and classifying the mathematical sentences by the semantic distance, a device therefor, and a computer readable recording medium, wherein the method comprises: a user query input step for receiving a query from a user; a query parsing step for extracting a keyword that is included in the inputted user query; and a semantic distance extracting step for obtaining similarity by measuring a semantic distance between said extracted keyword and semantic information, in reference to information in which a natural language token that contains the semantic information and a mathematical formula token are indexed.

Description

Extraction of semantic distance of mathematical sentence and classification method of mathematical sentence by semantic distance, device for it and computer readable recording medium

Embodiments of the present invention relate to a method for classifying a mathematical sentence by semantic distance extraction and semantic distance of a mathematical sentence, an apparatus therefor, and a computer-readable recording medium. More specifically, the semantics of a mathematical sentence including a natural language and a mathematical expression to extract a semantic distance included in a mathematical sentence composed of a natural language and a standardized equation and to give a similarity to the stored mathematical content when searching the inputted mathematical sentence. The present invention relates to a method for classifying a mathematical sentence based on distance extraction and semantic distance, an apparatus therefor, and a computer-readable recording medium.

The contents described in this section merely provide background information on the embodiments of the present invention and do not constitute a prior art.

Human languages are rich, complex, and contain tremendous vocabulary with complex grammar and contextual meanings, but machine or software applications generally require data entry in accordance with certain formats or rules. Here, natural language input can be used in almost any software application for interacting with a person. The general natural language processing process is to separate natural language into tokens, map them to one or more motion information provided by the software application, and set each software application to have a unique set of motion information. In other words, software developers write code that interprets natural language input and map the input to the appropriate behavioral information for each application.

However, such a natural language processing method not only recognizes a formula, but also has a problem in that it is unable to provide a search result by identifying how similar a query to search for a mathematical sentence is with a stored mathematical sentence.

In order to solve this problem, an embodiment of the present invention has a main object to automatically extract semantic semantic information included in a mathematical sentence composed of a natural language and a standardized equation.

In order to achieve the above object, an embodiment of the present invention, a user query input unit for receiving a query from the user; A query parser for extracting a keyword included in an input user query; An index information unit for indexing natural tokens and mathematical tokens including semantic information; And a semantic distance extraction unit for measuring a semantic distance between the extracted keyword and the indexed semantic information to obtain a similarity, and an apparatus for classifying mathematical sentences based on semantic distance and semantic distance. .

An apparatus for classifying a mathematical sentence based on the semantic distance extraction and the semantic distance of the mathematical sentence may include an information input unit configured to receive a complex sentence including a natural language and a mathematical expression; And a semantic parsing unit that separates the natural language and the formula from the compound sentence, and analyzes the separated natural language and the respective configuration information constituting the formula to generate semantic information to generate a natural language token and a mathematical token. It may further comprise.

The semantic parser may generate semantic information after converting the compound sentence into a logical combination of simple sentences.

The semantic parser generates a natural language token that tokenizes the natural language, generates stop word filtering data that filters the stop word based on the natural language token, and performs deduplication to remove the duplicate word from the stop word filtering data. Filtering data may be generated and matched with operation information to which a predefined meaning is added to the deduplication filtering data to extract the semantic information.

The semantic parsing unit converts the equation into a tree form, performs a traversal process on the equation transformed into the tree form, generates a mathematical token that performs tokenization on the equation on which the traversal process is performed, and converts the equation into the semantic information. Can be extracted.

The semantic information may refer to a rule combining a combination of a natural language and a formula and motion information corresponding to the combination, and include motion information of the compound sentence extracted by comparing the natural token and the mathematical token with the rule. have.

The operation information may include a structural meaning of the natural token, a direction of the natural token and a point at which the influence of the natural token is affected.

The directionality may indicate whether the operation information is associated with an equation before the natural token, or with an equation after the natural token, or independent of the equation.

The semantic information may include a mathematical object generated by matching a mathematical expression target of the natural token among the mathematical token.

The query parsing unit separates a natural language and a formula from the user query, analyzes each piece of configuration information constituting the separated natural language and a formula, and generates semantic information to extract a keyword including a natural language token and a mathematical token. can do.

The semantic distance may be generated as a value proportional to the number of semantic elements common to the semantic elements of the extracted keyword and the semantic elements of the indexed semantic information.

The semantic element may be weighted for each semantic element.

The semantic distance is inversely proportional to the sum of weights of semantic elements equally present in the extracted keyword and the indexed semantic information, and is proportional to the sum of weights of all semantic elements included in the extracted keyword and the indexed semantic information. can do.

In addition, an embodiment of the present invention to achieve another object of the present invention, the user query input step of receiving a query from the user; A query parsing step of extracting a key word included in an input user query; And a semantic distance extraction step of obtaining similarity by measuring a semantic distance between the extracted keyword and the indexed semantic information with reference to the indexed information of the natural token and the mathematical token including the semantic information. It provides a semantic distance extraction and classification method of the mathematical sentence by the semantic distance.

In addition, in order to achieve another object of the present invention, an embodiment of the present invention is to read by a computer recording a program for executing each step of the method of extracting the semantic distance of the mathematical sentence and the classification method of the mathematical sentence by the semantic distance. Provides a record medium that can be.

As described above, according to an embodiment of the present invention, the semantic distance included in a natural sentence and a standardized mathematical expression (single or compound sentence) is extracted, and the similarity with the stored mathematical content when the inputted mathematical sentence is searched. There is an effect of obtaining.

In addition, by converting an input compound sentence into a logical combination of simple sentences and generating semantic information, semantic information can be efficiently extracted. In addition, by defining a representative keyword that is expressed in a mathematical sentence and describes motion information, the motion information of the mathematical sentence can be extracted by matching the representative keyword when the motion expression of various sentences is input.

In addition, by recognizing the equation that is not recognized by the natural language processing method, it is possible to identify the similarity between the query to search the mathematical sentence and the stored mathematical sentence and to provide the search result, so that the mathematical content cannot be searched by the conventional search method. This provides an effective search environment.

1 is a block diagram schematically illustrating an apparatus for classifying a mathematical sentence according to an embodiment of the present invention.

2 is an exemplary view showing a tree form representation of a compound sentence according to an embodiment of the present invention.

3 is a diagram illustrating an XML representation of "(S ₁ ∩S ₂ ) => (~ S ₃ 3S ₄ )".

4 is a diagram illustrating a primitive sentence structure of a mathematical sentence expression method.

FIG. 5 is a diagram illustrating an example in which a mathematical sentence is expressed with an action information and a semantic description.

FIG. 6 is a diagram illustrating an example of comparing two expressions expressed by action information and semantic description.

7 is a flowchart illustrating a classification method of mathematics sentences according to an embodiment of the present invention.

FIG. 8 illustrates a Boolean value set for each semantic element for an indexed mathematical sentence.

Classification apparatus 100 of the mathematical sentence according to an embodiment of the present invention is the information input unit 110, semantic parsing unit 120, index information unit 130, user query input unit 140, query parsing unit 150, The semantic distance extractor 160 and the result providing unit 170 may be configured.

The information input unit 110 receives combination data (composite sentences) composed of a combination of natural language and mathematical formula. Here, the combination data consisting of a combination of natural language and formula may be directly input by a user's manipulation or command, but is not necessarily limited thereto, and may receive document data consisting of a combination of natural language and formula from a separate external server. will be.

As shown in FIG. 2, when a structure that a single mathematical content can have is represented as a tree, child nodes constituting the corresponding mathematical content (root node) maintain word order information, which is one of important meanings. In other words, it is divided into natural language and expression. In addition, each natural language has a special meaning in accordance with the order of connection of the sentences. That is, many of the mathematical content may be a structure that weave formulas based on natural language. For example, a formula following a natural language may be connected as a specific condition, or a formula following may be defined. The semantic meaning can be extracted by integrating the natural language as well as the meaning and connection relationship of the natural language of each node. That is, in order to classify motion information such as solving or explaining the equation required by the mathematical content, the natural language tokens can be grasped by tying the whole natural language together to grasp the meaning. In this case, the directionality refers to indicating whether or not the natural token in the mathematical content is associated with, or independent of, the equation in front of the natural token.

The semantic parser 120 separates the natural language and the formula from the combination data, analyzes the respective pieces of configuration information constituting the separated natural language and the formula, and generates semantic information to generate the natural token and the mathematical token. That is, the semantic parsing unit 120 separates natural words and mathematical expressions from the combined data, and analyzes each piece of configuration information constituting the separated natural words and mathematical expressions to generate semantic information. Here, the semantic information may include motion information and a mathematical object. Referring to the operation of the semantic parser 120 in detail, the semantic parser 120 separates natural words and expressions from the combined data. That is, the semantic parsing unit 120 separates and recognizes natural words and mathematical expressions included in the combination data when the combination data consisting of a combination of natural words and mathematical expressions is input through the information input unit 110. The semantic parsing unit 120 analyzes each piece of configuration information constituting the separated natural language to generate a natural language token that tokenizes the natural language, and stops filtering the stop word based on the natural language token. Word filtering data is generated, deduplication filtering is performed on the stop word filtering data to generate deduplication filtering data, and operation information with a predetermined meaning assigned to the deduplication filtering data is matched. Here, the token refers to a unit that can be distinguished from consecutive sentences, and tokenization refers to a process of dividing a natural language into units of words, which can be understood by the classification apparatus 100 of a mathematical sentence.

Referring to tokenization in more detail, in one embodiment of the present invention, tokenization is largely divided into natural language tokenization and mathematical tokenization. Natural language tokenization refers to a process of recognizing each word corresponding to a result of separating natural language included in a combination data (math problem or compound sentence) based on a space as a natural language token. Meanwhile, the formula tokenization refers to a process of recognizing individual unit information obtained after parsing a formula included in the combination data as a formula token.

[Example ^{1] Find the function value 9y 3} + 8y 2 - 4y - 9 with y = -1

For example, in [Example 1], the information corresponding to the natural language token is 'Find', 'the', 'function', 'value', 'with', and the formula token is returned after parsing the information through parsing. Values can be polynomials (Polynomial), highest order (Maxdegree = 3), number of terms (Numofterm = 4), condition (y = -1), and so on.

In addition, when the stop word filtering is described in detail, the stop word means a set of words that are predefined in order to remove a portion corresponding to a token that is not necessary in the analysis of a sentence or a formula. That is, among the words of [Example 1], a word such as 'the' (in addition to a or to, etc.) is a stop word and is predefined in a dictionary form in the system. Here, the dictionary means a list including a set of words. That is, the semantic parsing unit 120 performs a process of removing a stop word that is not necessary for analysis after generating a natural language token, and stop word filtering is performed when a math problem becomes long (eg, in case of a descriptive problem). This prevents too many tokens from entering the analysis process and works to speed up the system's processing. And, when describing deduplication filtering, for example, "One solution in this equation is three, find another solution that the equation has." If there is a mathematical problem called "natural equation", two tokens "Equation" and "Solution" can be extracted, and in this case, two "Equation" tokens and two "Solution" tokens are duplicated. Deduplication filtering can be performed by removing each one.

The semantic parsing unit 120 performs deduplication filtering to select and remove duplicate data from the stop word filtering data to remove duplicate elements from natural language tokens, and predefine data corresponding to predicates in the generated deduplication data. The motion information may be matched with motion information to which a given meaning is given, where motion information refers to summary information that may be extracted based on a natural language token or a mathematical token. For example, in [Example 1], the motion information 'Solve' may be extracted based on the natural language token or the mathematical token. Here, the reason for matching and storing the data corresponding to the predicate in the deduplication filtering data is to obtain information on the representative operation of the entire sentence in the process of defining the combination data (math problem) as the schema. It is intended to be used as a helpful tool when analyzing search or similarity between problems later.

In addition, the semantic parser 120 may convert a formula into a tree, perform a traverse process on the formula converted into a tree, and perform tokenization on the formula on which the traversing process is performed. The semantic parser 120 may convert an equation written in Math ML (Mathematical Markup Language) into an XML tree and then convert it into a DOM (Document Object Model). The semantic parsing unit 120 may perform traversal in a depth-first search manner in which configuration information constituting an equation is gradually transmitted from a lowermost node to an upper node. On the other hand, the traversal process and depth-first search are explained in detail. In general, the formula is in the form of Math ML, which is composed of a tree, and the nodes of the tree are searched to extract information from the tree. The exit process is called a traversal process, and when performing the traversal process, a depth-first search can be used. Since the depth-first search traversal process starts at the root of the tree, enters the child node, and then moves to the parent node after all child nodes have been searched, all information held by the child node is passed to the parent node. It is efficient because only the number of edges, which are connecting lines between nodes, needs to be searched. Although depth-first search is illustrated here, the present invention is not limited thereto.

In addition, the semantic parser 120 may generate the semantic information after converting the mathematical content into a logical combination of simple sentences.

The semantic parsing unit 120 may express the mathematical content mixed with a mathematical expression and a natural language as a combination of simple sentences, and give meaning through semantic parsing of a portion indicated by C-MathML.

For example, suppose you have a mathematical sentence of the form "find a root that satisfies x ² > 1 for equation x ² + 2x-3 = 0".

If the above expression is expressed as a simple sentence as the following sentence.

(Example 2)

`` Solve ((x ² + 2x-3 = 0) ∩ (x ² > 1))

Solve (Square root of quadratic equation ∩ x is greater than 1) "

As seen in Example 2 above, all complex sentences can be separated into concatenated logical concatenations (and), (or), (not), and (if). As described above, the complex sentence is divided into logical concatenated words of the simple sentence, but the present invention is not limited thereto. The complex sentence may be divided into a plurality of simple sentences in various ways.

For example, in the XML of compound sentence, <SentenceRel> which means the relationship between sentences can be used as an XML tag for describing logical connection between simple sentences. You can use it the same way you use MathApp's <apply>.

Meanwhile, keywords corresponding to motion information extracted as semantic information may be specified. For example, by extracting motion information called solve based on natural language tokens and mathematical tokens from the mathematical content of Example 1 above, in the process of defining a mathematical problem as a schema, it may have information on a representative motion meaning of the entire problem. . For example, depending on the author, "Find the root of (x ² + 2x-3 = 0)" or "Answer (x ² + 2x-3 =" instead of "Solve (x ² + 2x-3 = 0)" You can also use various terms such as Find and Answer as keywords. These keywords are uniquely chosen so that their meanings do not overlap. For example, terms such as "Find the root of", "Find the solution", "Answer", "Calculate", and "What is the value of" are used to unify behavioral information into Solve from its association with a later equation. use. In addition to Solve, there can be various operation information such as Evaluate, Integrate, Differentiate, Factorize, and Expand.

Therefore, if various input terms that can be used in extracting a keyword corresponding to motion information are designated, motion information according to various inputs having a single meaning can be extracted.

In addition, the extracted motion information may include a point where the structural meaning of the natural token, the direction of the natural token and the influence of the natural token. Here, the directionality may indicate whether the motion information is associated with the equations before the natural token, the equations after the natural token, or independent.

Meanwhile, the semantic parser 120 may express semantic information from an equation, for example, in the case of (x ² + 2x-3 = 0), "Action (secondary equation)" or "Action ( Polynomial (degree = 2)) "and the like, but the present invention is not limited thereto.

4 is a diagram illustrating a primitive sentence structure of a mathematical sentence expression method. The sentence expression format listed in FIG. 4 is representative, and a more complicated form may be added through analysis of a mathematical problem.

The semantic information of the math problem may include motion information and a math object.

Action information represents the purpose that the mathematical sentence should solve basically. For example, it is information extracted from a problem based on information that allows an actual solver to take action as to whether a corresponding sentence solves a problem or explains a concept. This information is returned by defined rules pre-processed through natural language and mathematical tokens.

The semantic parsing unit 120 may include a mathematical object generated by matching a mathematical expression token as a semantic information among mathematical equation tokens.

The semantic parsing unit 120 extracts and automatically expresses the actual meaning of a mathematical expression composed of a compound sentence including a natural language and an expression.

1. Constructing rule relationship between mathematical expression and natural language

2. Steps to find the motion information meaning of the sentence by reading the sentence expressing the natural language and mathematical expression

3. Steps to Construct a Math Object

Can be performed.

Math objects are used to represent each subdivided entity included in a math problem. In other words, it can indicate what techniques or facts are needed to solve this math problem, and what types of functions are included in the math problem. The concept of this object can help with extensibility to support a variety of math problems. This information can be a mathematical object information obtained from the natural language and the information obtained from the equation.

Mathematical objects can extract information corresponding to knowledge such as technique, definition, and theorem, and this information has extensibility and necessary information through problem analysis If you have more, you can create and add categories of the desired type.

Based on this mathematics problem semantic information, the scope of application is very broad. For example, if someone wants to practice the problem of solving quadratic equations, instead of comparing natural language, parsing all of the XML in MathML form, and verifying that they have the information they want, instead of comparing their current mathematical problems, Based on this, you can quickly provide the information you want. In addition, it can be used in the process of determining the ranking (ranking) between the searched problems, this operation information can help the user to obtain the optimal search results.

The operation information and the math object of the acquired math problem can be stored in various forms according to the storage device, which can be expressed in parallel, serial, nested form, and the like.

The semantic description of the part of the Mathematical Object, which is represented as c-MathML as a component of the Simple sentence, can be composed as shown in Table 1 and Table 2, and the Mathemtical Object represented by c-MathML. Are separated by <MathObj> tag, and the relationship with various mathematical objects can be expressed as <MathRel> tag as shown in [Table 1] and [Table 2].

Table 1

TABLE 2

The index information unit 130 stores information obtained by indexing the semantic information extracted by the semantic parser 120. For example, the index information unit 130 indexes the semantic information received through the semantic parser 120 and stores the performed information. The index information unit 130 may generate semantic index information obtained by indexing semantic information, and generate query index information matching keyword information to semantic index information.

The user query input unit 140 receives a query from the user and transfers the input user query to the query parser 150. Here, the user query is a kind of search query and includes a keyword inputted by the user to search.

The user query input unit 140 may perform a similar operation to that of the information input unit 110, and may receive a combination data (complex sentence) composed of a combination of natural language and expression. Combination data consisting of a combination of natural language and formula may be directly input by a user's manipulation or command, but is not necessarily limited thereto, and may receive document data consisting of a combination of natural language and formula from a separate external server.

The query parsing unit 150 extracts a keyword included in the input user query. The extracted keyword may include semantic information, and the query parsing unit 150 may semantic parse the input user query to extract a keyword including semantic information. The query parser 150 may be similar to the operation of the semantic parser 120. That is, the query parsing unit 150 separates the natural language and the formula from the compound sentence input through the user query input unit 140, and analyzes each piece of configuration information constituting the separated natural language and the formula to generate semantic information. Thus, a keyword that generates a natural token and a mathematical token can be generated. Here, the sentence input through the user query input unit 140 may include only natural language or only equation. That is, if only the natural language is included in the input sentence, the generated keyword may include only the natural language token, and if the input sentence includes only the mathematical expression, only the mathematical token may exist in the generated keyword.

The semantic distance extractor 150 may determine the similarity between the semantic information included in the keyword extracted by the query parser 150 and the semantic information of the indexed information generated by the semantic parser 120 and stored in the index information unit 140. Similarity is obtained by measuring the semantic distance that represents.

Suppose you have a general formula like Example 3.

Example 3: Find two roots of x ² + 2x-3 = 0

The above mathematical sentence can be expressed as shown in FIG. 5 if it is expressed as an action information and a semantic description.

In the above example 3, the formula (x ² + 2x-3 = 0) of the general sentence cannot be used as a query target, and a quadratic equation that is semantic description information becomes a query target. Therefore, if there is no schema set, the semantic query cannot be processed, so the semantic description information can be used as a means for the semantic processing by the semantic distance extractor 160.

In addition to the simple semantic descriptions of quadratic equations as shown in Example 3, the schema is defined in combination with various semantic descriptions obtained through problem structuring (topics, problems, solutions, etc.). I can express it.

Table 3, Table 4, Table 5, and Table 6 show examples of XML description of one equation.

TABLE 3

Table 4

Table 5

Table 6

According to Tables 3, 4, 5, and 6, the mathematical content expressed by the natural language and the standardized formula is converted into a form that the classification apparatus 100 of the mathematical sentence can understand, and based on the meaning of the natural language and the mathematical formula. To extract the semantic information and structure it into an XML tree.

The semantic distance extractor 160 obtains the similarity by measuring the semantic distance between the extracted keyword and the semantic information.

Here, the semantic distance means a distance of a semantic description given in the process of converting a compound sentence composed of an expression and a natural language.

For example, suppose there are two types of sentences, such as Examples 4 and 5.

(Example 4) "Calculate two roots of equation x ² + 2x-3 = 0."

(Example 5) "Calculate the integral for the quadratic x ² + 3x + 5."

The above two expressions can be expressed as 6A and 6B of FIG. 6 when expressed by action information and semantic description.

As shown in FIG. 6, the equations (x ² + 2x-3 = 0) and (x ² + 3x + 5) of the general sentence are completely different problems of finding the root and negative integral of the quadratic formula, but the semantic description is a quadratic formula. It can be determined to be the same. Therefore, if you define semantic distance, you can easily measure the semantic distance of various sentences. For example, the semantic distance of the problem of finding the root and the negative integral of the quadratic equation is 2, and the semantic distance of the negative and derivative is 1 to define the semantic distance. Can be scored.

In obtaining the semantic distance, the semantic distance extractor 160 may be determined as a value proportional to the number of semantic elements common to the semantic elements of the extracted semantic information and index information stored in the index information unit 140. Can be. Here, the semantic distance is determined as a value proportional to the number of semantic elements. However, in the equation for implementing the semantic distance, the semantic distance is proportional to a value multiplied by the values of the semantic elements in order to obtain the number of common semantic elements. Various forms of equations may be applied, such as to generate.

In addition, the semantic distance may be implemented such that the more semantic elements are compared to the overall semantic elements of the two equation problems, the shorter the semantic distance is. In addition, without considering the total number of semantic elements of the two equation problems, the more semantic elements of the two equation problems, the shorter the semantic distance, and the less semantic elements, the longer the semantic distance can be implemented.

The semantic distance extractor 160 uses cosine similarity as shown in Equation 1 as an example for defining a semantic distance based on the correlation between semantic information of a keyword inputted by a user's query and semantic information indexed and stored. Can be.

Equation 1

(p: problem vector, q: query vector, v: number of elements in the vector)

Each semantic information of the first mathematical sentence and the second mathematical sentence for which the semantic distance is to be measured may be expressed as a Boolean vector. Thus, for example, as shown in FIG. 8, p _i is Boolean and indicates whether semantic i is present in the first mathematical sentence p, and q _i is Boolean and whether or not semantic i is present in the second mathematical sentence q. Can be represented. In other words, for a mathematical sentence, it is polynomial, function, argument, factor, problem solving, evaluating, number of variables, and degree. For example, if a Boolean value is set for each semantic element of each mathematical sentence, all the mathematical sentences may be represented by a Boolean vector representing semantic elements as shown in FIG. 8.

If there are six sentences with indexed and stored semantic information as shown in FIG. 8, the Boolean vector of each mathematical problem is: problem 1 = (1,1,1,0,1,1,1), problem 2 = (1 , 1,1,1,0,0,0), problem 3 = (0,0,0,0,0,1,1), ...

In this case, for example, if the Boolean vector for the semantic element included in the keyword extracted from the user query is (1,1,1,1,1,1,1), for all semantic information stored as shown in FIG. The semantic distance can be obtained by applying Equation 1.

Therefore, if Equation 1 is applied to the Boolean vector of the user query and the Boolean vector of problem 1, it becomes 6 / (root (7) * root (6)), and the Boolean vector of the user query and the Boolean vector of problem 2 When equation (1) is applied, it becomes 4 / (root (7) * root (4)). Similarly, Equation 1 can be applied to all indexed problems to obtain the semantic distance from the Boolean vector of the user query.

In Equation 1, if cos (q, p) has a value of “0”, it means that there is no corresponding semantic information or it is not related to the problem. On the other hand, if cos (q, p) has a value of “1”, it indicates that there is semantic information corresponding to a mathematical problem, and cos (q, p), which represents a semantic distance, has a value from 0 to 1. The closer to 1, the higher the semantic similarity between the two sentences. If cos (q, p) is 1, the semantic information of two sentences p, q is exactly the same. If cos (q, p) is 0, there is no semantic similarity between the two sentences p, q. It can be said.

In obtaining the semantic distance, the semantic distance extractor 160 generates a value proportional to the number of semantic elements common to the semantic elements of the extracted semantic information and index information stored in the index information unit 140. In this case, a weight may be set for each semantic element.

The semantic distance extractor 160 is a weighted cosine similarity as shown in Equation 2 as another example for defining a semantic distance based on the correlation between semantic information of a keyword inputted by a user's query and semantic information indexed and stored. Can also be used.

Equation 2

(p: problem vector, q: query vector, w _i , weight, v: number of elements in the vector)

That is, the semantic distance may be calculated by giving a weight w _i to each semantic element. In this case, the more semantic elements set by weight are matched, the closer the semantic distance between two mathematical sentences can be.

Similarly to Equation 1 in Equation 2, if cos _w (q, p) has a value of "0", it means that there is no corresponding semantic information or it is not related to the problem in the column. On the other hand, if cos _w (q, p) has a value of "1", it indicates that there is semantic information corresponding to the row, and if the weight is w _i according to the upper or lower relation or importance between the semantic information, Using the matrix, we can find the cosine angle between the mathematical sentence vector p and the query vector q.

In Equation 1, cos _w (q, p) representing a semantic distance has a value from 0 to 1, and the closer to 1, the higher the semantic similarity between two sentences. If cos _w (q, p) is 1, the semantic information of the two sentences p, q is exactly the same, and if cos _w (q, p) is 0, the semantic similarity between the two sentences p, q is not at all. It can be said that there is no.

Meanwhile, the semantic distance between the two equations is inversely proportional to the sum of the weights of the same semantic elements between the two equations and is proportional to the sum of the weights of all the semantic elements included in the two equations.

For example, the union of the semantic elements of the two equations A and the semantic elements of the equation B is called S = {s ₁ , s ₂ , ..., s _N } and weights corresponding to each of the N elements of the union. Assume the set of (Weight) is W = {w ₁ , w ₂ , ..., w _N }.

In this case, for each element s _m (m = 1, ..., N) of S, the weights of the corresponding semantic elements (s _m ) that are present in Equations A and B are added together to add up the semantic weights. Calculate (E).

Therefore, the semantic distance D of Equations A and B can be calculated by the following equation (D = (Sum (w _m ) / E)).

Equation 3

Therefore, as shown in Equation 3, the semantic distance between the two equations is inversely proportional to the sum E of the weights of the same semantic elements between the two equations, and the sum of the weights of all the semantic elements included in the two equations, Sum (w _m )).

In this case, the weight value w _m may be the same for all semantic elements (eg, 1), or may have different values for each semantic element according to importance between semantic elements.

For example, suppose there are three problems (A, B, and C):

1. Problem A: Solve the equation x ² + 2x + 1 = 0.

Problem B: Solve the equation x ² -4 = 0.

3. Problem C: Solve the equation x ³ -1 = 0 (where x> 0)

Also, if the semantic information extracted from the above problems is as follows

The semantic component of problem A: the action (solve), the order (secondary equation), the number of terms (3)

The semantic elements of problem B: motion (solve), order (quadratic equations), number of terms (paragraph 2)

The semantic elements of problem C: motion (solve), order (third equation), number of terms (2), conditional inequality.

If the weight of all semantic elements is assumed to be 1, the semantic distance (D) of problem A and problem B is the sum of the weights of all semantic elements (Sum (w _m )) according to the equation of Equation 3 3 and the same semantic element between problem A and problem B has motion ('solve') and order (secondary equation), so the sum of the weights (E) of the same semantic element is equal to 2, (D = 3/2 = 1.5). In addition, the semantic distance of problem A and problem C is equal to the sum of weights (Sum (w _m )) of all the semantic elements, and that there is only one motion ('unlock') the same semantic element between problem A and problem C. Since the sum E of the weights of the semantic elements is 1, it is (D = 4/1 = 4). In addition, the semantic distance of problem B and problem C is that the sum of the weights of all the semantic elements (Sum (w _m )) is 4 and the same semantic element between A and B operates ('unlocks') and the number of terms (paragraph 2). ), The sum E of the weights of the same semantic elements becomes 2, so that (D = 4/2 = 2).

If the equation order information is given the most importance and the weight for the order is 2 and the remaining information is 1, the semantic distance of problem A and problem B is the sum of the weights of all semantic elements (Sum (w _m )). ) Is 4 and the same semantic element between problem A and problem B has motion ('solve') and order (second order equation), so the sum of the weights (E) of the same semantic element is 3, (D = 4 / 3 = 1.33). In addition, the semantic distance between the problem A and the problem C, the sum of the weights of all the semantic elements (Sum (w _m )) is 5 and the sum (E) of the weights of the same semantic elements between the problem A and the problem C is 1, (D = 5/1 = 5) In addition, the semantic distance between the problem B and the problem C, the sum of the weights of all the semantic elements (Sum (w _m )) is 5 and the sum (E) of the weights of the same semantic elements between the problem B and the problem C is 2, (D = 5/2 = 2.5).

Through the semantic distance value between mathematical problems as above, if the value is small, it is determined that the similarity is high between the two mathematical problems, and if the value is large, it is determined that similarity between the two mathematical problems is low and use this information. Can be.

The result provider 170 may provide a ranking result page of the query index information scored based on the similarity calculated by the measurement of the semantic distance. Here, the ranking result page may be provided to the server or the terminal requesting the ranking result page, but is not necessarily limited thereto. When the classification apparatus 100 of the mathematical sentence is implemented as a stand-alone device, the display unit is provided. You will be able to display the ranking results page.

That is, the user query input through the user query input unit 140 is parsed by the query parser unit 150 and transmitted to the semantic distance extractor 160, and the result provider unit 170 indexes the stored mathematical content. The scoring is performed by comparing the correlation based on the semantic distance based on the index of the query and the user query, and the ranking is output on the user result page.

In accordance with an embodiment of the present invention, a method of classifying a mathematical sentence may include separating an natural language and a formula from an information input step S710 for receiving a compound sentence including a natural language and a mathematical formula, and a compound sentence. The semantic parsing step (S720) of generating natural language tokens and mathematical tokens by analyzing semantic information by analyzing each component of the separated natural language and formulas, and storing the information of indexing the extracted semantic information Index information step (S730), a user query input step for receiving a query from the user (S740), a query parsing step (S750) for extracting a keyword (Key Word) included in the input user query (S750), the extracted keyword And a semantic distance extraction step (S760) for obtaining a similarity by measuring a semantic distance between the semantic information and a semantic distance and a query index scored by the similarity calculated by the measurement of the semantic distance. The beam of ranking comprises a (Ranking) Results The results provide further comprising: providing a page (S770).

Here, the information input step (S710) is the operation of the information input unit 110, the semantic parsing step (S720) to the operation of the semantic parsing unit 120, the index information step (S730) to the operation of the index information unit 130. In operation S740, the user query input unit 140 operates, the query parsing operation S750, the query parser 150 operates, and the semantic distance extracting operation S760, the semantic distance extracting unit 160. ), The result providing step (S770) corresponds to the operation of the result providing step unit 170, so detailed description thereof will be omitted.

As described above, the mathematical sentence classification method according to an embodiment of the present invention described in FIG. 7 may be implemented in a program and recorded in a computer-readable recording medium. A computer-readable recording medium having recorded thereon a program for implementing a method of classifying a mathematical sentence according to an embodiment of the present invention includes all kinds of recording devices storing data that can be read by a computer system. Examples of such computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like, and are implemented in the form of a carrier wave (for example, transmission over the Internet). It includes being. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion. In addition, functional programs, codes, and code segments for implementing an embodiment of the present invention may be easily deduced by programmers in the art to which an embodiment of the present invention belongs.

In the above description, it is described that all the components constituting the embodiments of the present invention are combined or operated in one, but the present invention is not necessarily limited to these embodiments. That is, within the scope of the present invention, all of the components may be selectively combined to operate in one or more, and those skilled in the art to which the present invention pertains without departing from the essential characteristics of the present invention Various modifications and variations will be possible in the. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention but to describe the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The protection scope of the present invention should be interpreted by the following claims, and all technical ideas within the equivalent scope should be interpreted as being included in the scope of the present invention.

As described above, according to an exemplary embodiment of the present invention, a semantic distance included in a natural sentence and a standardized mathematical expression is extracted to give a similarity to the stored mathematical content when searching for an inputted mathematical sentence, thereby providing a user search environment. It is effective in providing a high level of industrial applicability.

CROSS-REFERENCE TO RELATED APPLICATION

This patent application is filed with the Korean Patent Application No. 10-2010-0124384 filed in Korea on December 7, 2010 and Patent Application No. 10-2011-0130024, filed in Korea on December 7, 2011. If priority is claimed under section (a) (35 USC §119 (a)), all of this is incorporated by reference into this patent application. In addition, if this patent application claims priority for the same reason for a country other than the United States, all its contents are incorporated into this patent application by reference.

Claims

A user query input unit for receiving a query from a user;

A query parsing unit which extracts a keyword included in an input user query;

An index information unit for indexing natural tokens and mathematical tokens including semantic information; And

A semantic distance extracting unit which obtains a similarity by measuring the semantic distance between the extracted keyword and indexed semantic information.

Apparatus for classifying mathematical sentences by semantic distance extraction and semantic distance of mathematical sentences comprising a.
The method of claim 1,

The semantic distance extraction and the semantic distance classification apparatus of the mathematical sentence,

An information input unit for receiving a compound sentence including natural language and expressions; And

A semantic parsing unit that separates the natural language and the formula from the compound sentence, and generates semantic information by analyzing the separated natural language and the respective configuration information constituting the formula, and generates a natural language token and a mathematical token.

Apparatus for classifying mathematical sentences by semantic distance extraction and semantic distance of the mathematical sentence further comprises a.
The method of claim 2,

The semantic parser,

And a semantic distance extraction and semantic distance classification apparatus for generating a semantic information after converting the compound sentence into a logical combination of simple sentences.
The method of claim 2,

The semantic parser,

Generating a natural word token obtained by tokenizing the natural language, generating stop word filtering data filtering the stop word based on the natural language token, and performing deduplication filtering on the stop word filtering data to generate deduplication filtering data And extracting semantic distances of the mathematical sentences and classifying the mathematical sentences by the semantic distances by matching the motion information to which the predefined meaning is added to the deduplication filtering data and extracting the semantic information.
The method of claim 2,

The semantic parser,

Converting the formula into a tree form, performing a traversal process on the formula transformed into the tree form, generating a formula token that performs tokenization on the formula on which the traversal process is performed, and extracting the formula token as the semantic information An apparatus for classifying mathematical sentences by semantic distance extraction and semantic distance of mathematical sentences.
The method of claim 1,

The semantic information is,

Refers to a rule combining a combination of a natural language and a formula and motion information corresponding to the combination, and includes the motion information of the compound sentence extracted by comparing the natural token and the mathematical token with the rule. Extraction of semantic distance and classification of mathematical sentences by semantic distance.
The method of claim 6,

The operation information,

Apparatus for classifying mathematical sentences by semantic distance extraction and semantic distance, characterized in that it includes a structural meaning of the natural token, the direction of the natural token and the influence of the natural token.
The method of claim 7, wherein

The directionality,

Apparatus for classifying mathematical sentences based on semantic distance extraction and semantic distances, wherein the motion information indicates whether the motion information is related to the equations before the natural token, the equations after the natural token, or is independent. .
The method of claim 1,

The semantic information is,

And a semantic distance extraction and semantic distance classification apparatus according to the semantic distance, characterized in that it comprises a mathematical object generated by matching the equation that is the target of the natural token among the mathematical token.
The method of claim 1,

The query parsing unit,

The natural language and the formula are separated from the user query, and the semantic information is generated by analyzing the respective configuration information constituting the separated natural language and the formula, and extracting a keyword including the natural language token and the mathematical token. An apparatus for classifying mathematical sentences by semantic distance extraction and semantic distance of mathematical sentences.
The method of claim 1,

The semantic distance is,

The semantic distance extraction and the classification of the mathematical sentence by the semantic distance are generated in proportion to the number of semantic elements common to the semantic element of the extracted keyword and the semantic element of the indexed semantic information. Device.
The method of claim 11,

The semantic element,

An apparatus for classifying mathematical sentences based on semantic distance extraction and semantic distance, characterized in that weights are set for each semantic element.
The method of claim 1,

The semantic distance is,

The more semantic elements that are common to the semantic elements of the extracted keywords and the entire semantic elements of the indexed semantic information, the shorter the semantic distance, and the smaller the semantic elements compared to the overall semantic elements, the longer the semantic distance is. An apparatus for classifying mathematical sentences by semantic distance extraction and semantic distance of sentences.
The method of claim 1,

The semantic distance is,

And inversely proportional to a sum of weights of semantic elements that are equally present in the extracted keyword and the indexed semantic information, and are proportional to a sum of weights of all the semantic elements included in the extracted keyword and the indexed semantic information. An apparatus for classifying mathematical sentences by semantic distance extraction and semantic distance of mathematical sentences.
A user query input step of receiving a query from a user;

A query parsing step of extracting keywords included in the input user query; And

Semantic distance extraction step of obtaining a similarity by measuring the semantic distance between the extracted keyword and the indexed semantic information with reference to the indexed information natural and tokens containing semantic information

Method for classifying mathematical sentences by semantic distance extraction and semantic distance of a mathematical sentence comprising a.
The method of claim 15,

The semantic information is,

An information input step of receiving a compound sentence including a natural language and a formula; And

The semantic parsing step of separating the natural language and the formula from the compound sentence, respectively, and analyzing semantic information constituting the separated natural language and the formula to generate semantic information to generate natural language tokens and mathematical tokens.

Method of extracting the semantic distance and classification of the mathematical sentence by the semantic distance, characterized in that generated by.
The method of claim 16,

The semantic parsing step,

And extracting semantic distances and classifying mathematical sentences by semantic distances after converting the compound sentences into logical combinations of simple sentences and generating semantic information.
The method of claim 16,

The semantic parsing step,

Generating a natural word token obtained by tokenizing the natural language, generating stop word filtering data filtering the stop word based on the natural language token, and performing deduplication filtering on the stop word filtering data to generate deduplication filtering data And extracting semantic distances of the mathematical sentences and classifying the mathematical sentences by the semantic distances by matching the motion information to which the predefined meaning is added to the deduplication filtering data and extracting the semantic information.
The method of claim 16,

The semantic parsing step,

Converting the formula into a tree form, performing a traversal process on the formula transformed into the tree form, generating a formula token that performs tokenization on the formula on which the traversal process is performed, and extracting the formula token as the semantic information A semantic distance extraction and a semantic distance classification method of the mathematical sentence.
The method of claim 16,

The semantic information is,

Refers to a rule combining a combination of a natural language and a formula and motion information corresponding to the combination, and includes the motion information of the compound sentence extracted by comparing the natural token and the mathematical token with the rule. Semantic distance extraction and classification method of mathematical sentence by semantic distance.
The method of claim 20,

The operation information,

The semantic distance extraction and the semantic distance classification method of the mathematical sentence, characterized in that it comprises a structural meaning of the natural token, the direction of the natural token and the influence of the natural token.
The method of claim 16,

The semantic information is,

And a semantic distance extraction method and a semantic distance classification method according to claim 1, further comprising a mathematical object generated by matching the equation targeted for the natural language token among the mathematical tokens.
The method of claim 15,

The query parsing step,

The natural language and the formula are separated from the user query, and the semantic information is generated by analyzing the respective configuration information constituting the separated natural language and the formula, and extracting a keyword including the natural language token and the mathematical token. Semantic distance extraction of mathematical sentences and classification method of mathematical sentences by semantic distance.
The method of claim 15,

The semantic distance is,

The semantic distance extraction and the classification of the mathematical sentence by the semantic distance are generated in proportion to the number of semantic elements common to the semantic element of the extracted keyword and the semantic element of the indexed semantic information. Way.
The method of claim 15,

The semantic distance is,

The more semantic elements that are common to the semantic elements of the extracted keywords and the entire semantic elements of the indexed semantic information, the shorter the semantic distance, and the smaller the semantic elements compared to the overall semantic elements, the longer the semantic distance is. Method of classifying mathematical sentence by semantic distance extraction and semantic distance of sentence.
A computer-readable recording medium having recorded thereon a program for executing each step of the method of extracting the semantic distance of a mathematical sentence according to any one of claims 15 to 25 and the method of classifying the mathematical sentence by the semantic distance.