WO2015023031A1

WO2015023031A1 - Method for supporting search in specialist fields and apparatus therefor

Info

Publication number: WO2015023031A1
Application number: PCT/KR2013/011920
Authority: WO
Inventors: 이수원; 백종범
Original assignee: 숭실대학교산학협력단
Priority date: 2013-08-14
Filing date: 2013-12-20
Publication date: 2015-02-19
Also published as: KR101515413B1; KR20150019474A

Abstract

A method and an apparatus for supporting a search in specialist fields are disclosed. A method for supporting a search in specialist fields comprises the steps of: (a) collecting query-answer data from web pages; (b) extracting words from the query-answer data by analyzing and classifying the query-answer data into a query part and an answer part; (c) calculating a general term-technical term mapping probability by analyzing the correlation between words respectively included in the query part and the answer part and creating a general term-technical term mapping table; and (d) extracting and providing a technical term including the word included in a query using the term mapping table.

Description

Specialized field search support method and device

The present invention uses the Q & A data or precedent data collected from the web site to learn mapping probabilities between general terms and jargon or statutes, and then use them to predict and provide the terminology or statutes for the query. It relates to a field search support method and an apparatus thereof.

It is very difficult for ordinary people without prior knowledge of specializations to search and utilize knowledge of various specialties. For example, it is almost impossible for a general person without medical knowledge to search for and understand medical knowledge about his or her physical condition, and a general person without legal knowledge can find and use laws and regulations that correspond to his or her difficulties without the help of a lawyer. It is also very difficult to do. Especially in the field of law, the law consists only of legal terminology which is rarely used in everyday life. For this reason, there is a limit to the legal information retrieval by using only queries written by the general public who do not know legal terminology.

The Best-Project, which has been conducted since 2005 in the Netherlands, is typical of research to support public legislation or precedent search. BEST-Project is developing a case search system that inputs the current situation to the general public and maps it with the general term user ontology to provide the user with the corresponding case search result. However, the construction and maintenance of such terminology ontology and terminology ontology requires a lot of time and money. In addition, in Korean, it is difficult to find a distinct line between a general term and a terminology.

The present invention uses the Q & A data or precedent data collected from the web site to learn mapping probabilities between general terms and jargon or statutes, and then use them to predict and provide the terminology or statutes for the query. It is to provide a field search support method and apparatus.

According to an aspect of the present invention, by providing a method and apparatus for supporting specialized field search, after learning the mapping probability between general terms and jargon or statutes using Q & A data or precedent data collected from the website, Provided is a method for predicting and providing a terminology or a law for a query.

According to one embodiment of the invention, (a) collecting the query-answer data in a web document; (b) extracting a word by analyzing the question part and the answer part in the question-answer data; (c) generating a general term-terminology mapping table by calculating a general term-terminology mapping probability by analyzing correlations between the words included in the question unit and the answer unit; And (d) extracting and providing a terminology including a word included in a query using the term mapping table.

In the step (c), the general term-terminology mapping probability may be calculated using the frequency of the words that appear simultaneously in the question unit and the answer unit.

The general term-terminology mapping probability is calculated using pairwise mutual information (PMI),

here,

Represents a generic term set,

Represents a set of statutory words. Also,

Represents a set of terminology,

Denotes words that are included in the general term set but not in the statutory word set,

Denotes words that are included in the terminology and also included in statute keywords.

In the step (d), a mapping probability may be calculated and predicted on the n (natural number) terminologies that match words included in the query using the mapping table.

In step (d), the mapping probability is calculated using a naïve Bayesian classifier.

The mapping probability is calculated by the following formula,

here,

This is

Represents the mapping probabilities included in the mapping table,

Denotes the terminology and X denotes the query.

According to another embodiment of the present invention, (a) analyzing the case data to extract each word; (b) generating a word-statute mapping table by calculating mapping probabilities between words and statutes using the words; And (c) predicting a law for a query using the word-law mapping table.

The word-law mapping table includes a reliability according to the mapping between words and the law, the reliability is calculated by the following formula,

here,

Represents the set of words that appear in the case data,

Represents a set of statutory names,

Represents words that are not included in the legal name among words included in the set of occurrence words in the case data,

Represents statutes included in a set of statutory names among words appearing in case data.

According to another aspect of the present invention, using the Q & A data or precedent data collected from the website to learn the mapping probability between the general term and the term or statute, by using it to predict the term or statute for the query Provided are devices that can provide.

According to an embodiment of the present invention, a collection unit for collecting query-answer data in a web document; An extracting unit that extracts a word by analyzing the question unit and the answer unit in the question-answer data; A mapping table generator configured to generate a general term-terminology mapping table by calculating a general term-terminology mapping probability by analyzing correlations between the words included in the question unit and the answer unit; And a prediction unit for extracting and providing a term including a word included in a query using the term mapping table.

According to another embodiment of the present invention, an extracting unit for extracting each word by analyzing the case data; A mapping table generator for generating a word-statute mapping table by calculating a mapping probability between a word and a law using the word; And a prediction unit that predicts a law for a query using the word-law mapping table.

By providing a method and apparatus for supporting specialized field search according to an embodiment of the present invention, after learning mapping probabilities between general terms and specialized terms or statutes using Q & A data or precedent data collected from a web site, It can be used to predict the terminology or law for the query.

For this reason, the present invention has an advantage that it is possible to provide a search convenience to a user who is relatively knowledgeable in terms of technical terms or statutes.

1 is a diagram schematically illustrating a configuration of a specialized field search support apparatus according to an embodiment of the present invention.

2 illustrates a mapping table in accordance with an embodiment of the present invention.

3 is a block diagram schematically illustrating an internal configuration of an apparatus for searching a specialty field according to another exemplary embodiment of the present invention.

4 illustrates a word-statute mapping table according to another embodiment of the present invention.

5 is a view showing the results of the laws and regulations predicted for the query according to an embodiment of the present invention.

FIG. 6 is a flowchart illustrating a method of providing a specialized term for a query by a specialized field search support apparatus according to an exemplary embodiment of the present invention. FIG.

7 is a flowchart illustrating a method for providing a specialized field search support apparatus according to another embodiment of the present invention to provide a statute for a query word.

As the invention allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description. However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all transformations, equivalents, and substitutes included in the spirit and scope of the present invention. In the following description of the present invention, if it is determined that the detailed description of the related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted.

Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

The present invention is to provide a legislative search service to the general public who is relatively weak in legal knowledge, it can be provided by mapping the general terms commonly used by the general person to specialized terms.

In addition, the present invention may generate a word-statute mapping table based on the probability of mapping between words and laws based on the precedent data, and use this to predict and provide a law for a query.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

FIG. 1 is a diagram schematically illustrating a structure of a specialized field search support apparatus according to an embodiment of the present invention, and FIG. 2 is a diagram illustrating a mapping table according to an embodiment of the present invention. In FIG. 1, Q & A data is collected from a website, the Q & A data is divided into a question part and an answer part, and each word is extracted. Let's discuss the possible devices.

Referring to FIG. 1, the apparatus for searching a specialty field according to an embodiment of the present invention 100 includes a collector 110, an extractor 115, a mapping table generator 120, a memory 130, and a controller ( 135).

Collecting unit 110 is a means for collecting the question and answer data through the communication network.

For example, the collection unit 110 may collect Q & A data (eg, Naver intellectuals) from a web document and store it in a database.

The extraction unit 115 is a means for analyzing the Q & A data to classify the question unit and the answer unit, and extracting words from the question unit and the answer unit, respectively.

For example, the extractor 115 may extract each word in morpheme units by morphologically analyzing the Q & A data. That is, suppose that the question of Q & A data is the same as "I ask about the image copyright of the product." The extraction unit 115 can extract "product", "image", "copyright", and "question" as words. have.

In addition, due to the nature of Q & A data, people who answer questions are more likely to be experts in the field than the questioner, and certain websites operate services that can be answered by experts in the field, such as doctors, lawyers, patent attorneys have.

Accordingly, in this specification, the words extracted from the question unit by analyzing the Q & A data may be referred to as general terms according to the phrase, and the words extracted from the answer unit may be described as terminology according to the phrase.

The mapping table generator 120 is a means for calculating a mapping probability by analyzing correlations between the words extracted from the Q & A data and generating a term mapping table based on the correlation probability.

To this end, the mapping table generator 120 may calculate mapping probabilities using the frequencies of words that appear in the question and answer units, respectively, for the words extracted from the question and answer units.

For example, the mapping table generator 120, that is, the mapping candidate extractor 115 may calculate a mapping probability by calculating pairwise mutual information (PMI) between general terms and jargon.

The mutual information amount (PMI) may be calculated using Equation 1 below.

here,

Represents the set of occurrence words in the question,

Represents a set of occurrence words within the decree,

Represents the set of occurrence words in the answer.

Also,

Denotes words that are not included in the word set in the law among words included in the word set in the question,

Indicates a word that is included in the appearance word set in the answer and is also included in the appearance word set in the law.

2 illustrates a mapping table according to an embodiment of the present invention. Referring to FIG. 2, the mapping table includes the number of simultaneous occurrences of a word included in an answer unit corresponding to a word included in a question unit and a mapping probability calculated based on the same.

The prediction unit 125 is a means for calculating a terminology mapping probability for a query input by a user using a mapping table to predict and provide related terminology. The prediction unit 125 may extract and provide the top n (natural numbers) terminologies having a high mapping probability among the terminology corresponding to the words included in the query by referring to the mapping table.

For example, the predictor 125 may calculate a probability of what the law implies by the user's input query using the naive Bayesian classifier, which will be described in more detail with reference to FIG. 3 below. .

The memory 130 is a means for storing various algorithms, mapping tables, and the like necessary for operating the specialty field search support apparatus 100 according to an embodiment of the present invention.

The controller 135 is an internal component (eg, the collector 110, the extractor 115, the mapping table generator 120) of the apparatus for searching a specialty field according to an embodiment of the present invention. , The memory 130 and the like.

In FIG. 1, the mapping probability between the general term and the terminology is calculated and the mapping table for the term is generated according to an embodiment of the present invention. In FIG. 2, an apparatus capable of predicting and providing a law for a query based on a mapping probability between a query and a law according to another embodiment of the present invention will be described.

3 is a block diagram schematically illustrating an internal configuration of an apparatus for searching a specialty field according to another embodiment of the present invention, and FIG. 4 is a diagram illustrating a word-law mapping table according to another embodiment of the present invention. FIG. 5 is a diagram illustrating a result of a predicted law for a query word according to an exemplary embodiment of the present invention. FIG.

Referring to FIG. 3, the apparatus for searching a specialty field according to another exemplary embodiment of the present invention may include a collector 310, an extractor 315, a mapping table generator 320, a predictor 325, and a memory. 330 and the control unit 335 is configured.

Collecting unit 310 is a means for collecting the case data.

For example, the collector 310 may collect the case data from a specific web site that provides the case data.

The extraction unit 315 is a means for analyzing the case data and extracting words, respectively.

For example, the extractor 315 may extract the words by analyzing the case data in morpheme units. Since the method itself for extracting words in morpheme units from a specific sentence is already known to those skilled in the art, a separate description thereof will be omitted.

The mapping table generator 320 is a means for generating a mapping table by calculating a mapping probability between a word extracted from the precedent data and a law. Hereinafter, to distinguish from the mapping table described in FIG. 1, the word-law mapping table will be described.

For example, the mapping table generator 320 may use the reliability as a measure for calculating the mapping probability between the words extracted from the precedent data and the law.

The mapping table generator 320 may calculate a mapping probability between a word and a law using Equation 2 below.

here,

Represents the set of occurrence words in the case data,

Represents a set of statutory names,

4 illustrates a word-law mapping table. As shown in FIG. 4, the word-law mapping table includes each word, a law mapped to the word, and a reliability thereof. That is, the higher the reliability of the law mapped to each word, the higher the reliability of the law mapped to the word.

The prediction unit 325 is a means for predicting a legal mapping probability for the user's input query using the keyword-statute table.

For example, the prediction unit 325 may calculate a probability of what the law implies by the user's input query using the naive Bayesian classifier.

In general, since each query statement is often associated with a number of laws, the prediction unit 325 may calculate the query-law mapping probability by using the number 3 from which the MAX function is removed from the naive Bayesian classifier.

If this is expressed as an expression, it is equal to the number 3.

here,

Denotes the mapping probability, which is equal to number 2. The prediction unit 325 refers only to the top n keyword-law mappings that have a high probability of being mapped to the law C _i by referring to the keyword-law table in order to prevent the mapping probability from being excessively reduced when predicting the query statement-law mapping probability. Can be used to calculate the mapping probability. If this is expressed as an expression, it is equal to number 4.

FIG. 5 shows the laws and the mapping probabilities predicted corresponding to the query statements.

As illustrated in FIG. 5, the prediction unit 325 calculates a mapping probability of a law for a query using a word-law mapping table, and then provides a law for the query in the order of the highest mapping probability according to the calculated mapping probability. Can be predicted and provided.

The memory 330 is a means for storing various algorithms, mapping tables, etc. required for operating the specialty field search support apparatus 300 according to an embodiment of the present invention.

The controller 335 is an internal component (eg, the collector 310, the extractor 315, the mapping table generator 320) of the apparatus for searching a specialty field according to an embodiment of the present invention. , A predictor, a memory 330, etc.).

FIG. 6 is a flowchart illustrating a method of providing a specialized term for a query by the apparatus for supporting a specialized field search according to an embodiment of the present invention.

In operation 610, the specialization search support apparatus collects Q & A data from a web site and stores the database.

In operation 615, the specialized field search support apparatus divides the collected Q & A data into a question unit and an answer unit, and analyzes each to extract a word.

For example, as already described above, the specialized field search support apparatus may extract the words in the morpheme units by analyzing the question unit and the answer unit in the morpheme units. Since this is already apparent to those skilled in the art, a separate description thereof will be omitted.

In operation 620, the specialized field search support apparatus generates a mapping table through correlation analysis between words extracted from the question unit and the answer unit.

As described above, a mapping table may be generated by calculating a mapping probability through correlation analysis between words extracted from the question unit and the answer unit (words extracted from the question unit and words extracted from the answer unit).

In operation 625, the specialized field search support apparatus predicts and provides a term corresponding to a query using a mapping table.

For example, the specialized field search support apparatus may calculate a mapping probability of the input query based on the naive Bayesian classifier with reference to the mapping table, and then predict and provide the top n terminologies having a high mapping probability.

7 is a flowchart illustrating a method of providing a law for a query by a specialized field search support apparatus according to another exemplary embodiment of the present invention.

In operation 710, the specialization search support device collects case data. For example, the specialty search support device may collect and store case data from a web site.

In operation 715, the specialized field search support apparatus extracts words by analyzing case data. For example, the specialized field search support apparatus may extract the words in morpheme units by analyzing the case data in the form of units. Since this is already apparent to those skilled in the art, a separate description thereof will be omitted.

In operation 720, the specialized field search support apparatus generates a word-statute mapping table by calculating a mapping probability between words extracted from case data and the laws.

Since this is the same as described with reference to FIG. 3, redundant descriptions will be omitted.

In operation 725, the specialized field search support apparatus predicts a law for a query using a word-law mapping table.

That is, the apparatus for searching a specialized field may calculate a law mapping probability for a query by referring to a word-law mapping table and predict a law for a query based on the word-law mapping table. To this end, the specialized field search support apparatus may use a naïve Bayesian classifier, which is a concept classification technique, which is the same as described using the number 4 in FIG.

On the other hand, the specialized field search support method according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various electronic means for processing information may be recorded in the storage medium. The storage medium may include program instructions, data files, data structures, etc. alone or in combination.

The program instructions recorded in the storage medium may be those specially designed and constructed for the present invention, or may be known and available to those skilled in the software art. Examples of storage media include magnetic media such as hard disks, floppy disks and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic-optical media such as floppy disks. hardware devices specifically configured to store and execute program instructions such as magneto-optical media and ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also devices that process information electronically using an interpreter, for example, high-level language code that can be executed by a computer.

The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

Although the above has been described with reference to a preferred embodiment of the present invention, those skilled in the art to which the present invention pertains without departing from the spirit and scope of the present invention as set forth in the claims below It will be appreciated that modifications and variations can be made.

Claims

(a) collecting query-answer data from the web document;
(b) extracting a word by analyzing the question part and the answer part in the question-answer data;
(c) generating a general term-terminology mapping table by calculating a general term-terminology mapping probability by analyzing correlations between the words included in the question unit and the answer unit; And
(d) extracting and providing a term that includes a word included in a query by using the term mapping table.
The method of claim 1, wherein step (c) comprises:
And calculating the general term-terminology mapping probabilities using the frequency of the words that appear simultaneously in the question unit and the answer unit.
The method of claim 1,
The general term-terminology mapping probability is calculated using pairwise mutual information (PMI).

here,
Represents a generic term set,
Represents a set of statutory words. Also,
Represents a set of terminology,
Denotes words that are included in the general term set but not in the statutory word set,
Refers to words that are included in jargon and in statute keywords.
The method of claim 1,
In step (d),
And a mapping probability is calculated and predicted for the n (natural numbers) terminology matching the word included in the query using the mapping table.
The method of claim 4, wherein
In step (d),
The mapping probability is calculated using a naïve Bayesian classifier.
The method of claim 5,
The mapping probability is calculated by the following formula.

here,
This is
Represents the mapping probabilities included in the mapping table,
Denotes a terminology and X denotes a query.
(a) extracting words by analyzing case data;
(b) generating a word-statute mapping table by calculating mapping probabilities between words and statutes using the words; And
(c) predicting a law for a query using the word-law mapping table.
The method of claim 8,
The word-law mapping table includes a reliability according to the mapping between words and the law, wherein the reliability is calculated by the following formula.

here,
Represents the set of words that appear in the case data,
Represents a set of statutory names,
Represents words that are not included in the legal name among words included in the set of occurrence words in the case data,
Represents statutes included in a set of statute names among words appearing in case data.
Collecting unit for collecting the query-answer data from the web document;
An extracting unit that extracts a word by analyzing the question unit and the answer unit in the question-answer data;
A mapping table generator configured to generate a general term-terminology mapping table by calculating a general term-terminology mapping probability by analyzing correlations between the words included in the question unit and the answer unit; And
And a prediction unit for extracting and providing a term including a word included in a query using the term mapping table.
An extraction unit for extracting words by analyzing case data;
A mapping table generator for generating a word-statute mapping table by calculating a mapping probability between a word and a law using the word; And
And a prediction unit for predicting a law for a query using the word-law mapping table.