CN102214189B

CN102214189B - Data mining-based word usage knowledge acquisition system and method

Info

Publication number: CN102214189B
Application number: CN 201010147993
Authority: CN
Inventors: 方高林
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2010-04-09
Filing date: 2010-04-09
Publication date: 2013-04-24
Anticipated expiration: 2030-04-09
Also published as: CN102214189A

Abstract

The invention provides a data mining-based word usage knowledge acquisition system and method. The system comprises an input device, a search analysis device, a multi-input mode processing device, a webpage analysis device, a usage knowledge extraction device and an output device, wherein the input device is used for inputting a word or a phrase to be searched; the search analysis device analyzesa keyword in the word or phrase to be searched, and processes the word and the phrase to be searched in the corresponding input mode processing device according to the analysis result; the multi-input mode processing device analyzes and expands the word or the phrase to be searched by utilizing semantic knowledge and dictionaries to form a search item, and searches the webpage information according to the search item so as to acquire a webpage related to the word or the phrase to be searched; the webpage analysis device analyzes the searched webpage, and converts the webpage into a candidate text; the usage knowledge extraction device processes the candidate text, and extracts context information and typical sentences of the word or the phrase to be searched; and the output device outputsthe context information and the typical sentences. By adopting the device and the method, the word usage knowledge can be acquired accurately.

Description

System and method for acquiring word usage knowledge based on data mining

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of computer information processing, in particular to a system and a method for acquiring word usage knowledge based on data mining.

[ background of the invention ]

When people read, write and translate foreign languages, people often encounter words and phrases which are not included in a dictionary, and the translated texts of the same word or phrase in different contexts are different, so that how to write out the genuine words and phrases is a problem for every person who uses the foreign languages. For Chinese students, the problem of how to write out the genuine sentences is more prominent due to the difference between Chinese and English culture and language style and the lack of knowledge about English collocation (such as form and name collocation, mobile name collocation and mobile media collocation).

The development of the internet provides unprecedented rich resources including electronic documents, online periodicals, magazines, newspapers, scientific and technical literature and the like, and with the rapid development of networks and information technologies, network resources become richer and richer. Usually, the knowledge of the usage of a word or phrase can be found by web search, however, the result obtained by only relying on a general search engine is difficult to be effective as the knowledge we need, because the search result only lists the web pages related to the word, and does not consider whether the word or phrase is related in linguistic role. In addition, the large amount of redundant information in the search results makes it difficult for the user to find instances of correct word usage. Therefore, mining useful knowledge in a large number of resources has become an important issue for network applications. The word usage system based on Web obtains collocation information and example sentences of words on the Internet so as to assist users in writing out genuine foreign language articles correctly.

[ summary of the invention ]

Based on this, there is a need for a system for obtaining word usage knowledge based on data mining that can more accurately obtain word usage knowledge.

A system for obtaining word usage knowledge based on data mining, the system comprising: the input device is used for inputting words or phrases to be searched; the query analysis device is used for analyzing the keywords in the words or phrases to be searched and sending the words or phrases to be searched into the corresponding input mode processing device for processing according to the analysis result; the multi-input mode processing device analyzes and expands the words or phrases to be searched by utilizing semantic knowledge and a dictionary to form query terms, and searches webpage information according to the query terms to obtain webpages related to the words or phrases to be searched; the webpage analysis device is used for analyzing the searched webpage and converting the webpage into a candidate text; the usage knowledge extraction device is used for processing the candidate text and extracting the context information and the typical example sentence of the word or phrase to be searched; and the output device outputs the context information and the typical example sentences.

Wherein the multiple input mode processing apparatus includes the following multiple input mode units: the system comprises a comparison mode unit, a category mode unit, a target language collocation mode unit and a single word mode unit, and also comprises a search engine retrieval module for retrieving a webpage;

the comparison mode unit adopts logic words to combine words or phrases into query terms, the category mode unit analyzes and expands input central words and category information to form query terms, the target word collocation mode unit translates and expands input collocation words to form query terms, the single word mode unit forms query terms according to the input single words, and the search engine retrieval module retrieves webpage information according to the query terms to acquire webpages related to the input words or phrases.

The webpage analysis device can further analyze the searched webpage information, remove repeated webpages, analyze each webpage into a document model tree form, remove non-text labels in the webpages in the document model tree, and reserve useful labels, so that the webpages are converted into candidate texts in a text form.

And the usage knowledge extraction device comprises: and the context information extraction unit is used for processing the candidate text into a single sentence through boundary identification, acquiring candidate words in the single sentence through keyword search, counting each candidate text by using a statistical algorithm to obtain the occurrence frequency of the candidate words, and outputting a candidate list of context information according to the occurrence frequency of the candidate words.

Further, the context extraction unit further ranks the candidate words according to the occurrence frequency of the candidate words, selects a preset number of candidate words according to the ranking, and removes functional words and non-semantic words according to a stop word list to obtain a candidate list containing context information of the selected candidate words.

Wherein, the usage knowledge extraction device further comprises a typical example sentence extraction unit, and the typical example sentence extraction unit comprises: the candidate example sentence extraction module is used for extracting sentences containing the context information in the webpage candidate texts as candidate example sentences; the clustering module is used for clustering the candidate example sentences by utilizing a sentence clustering method based on characteristics; and the typical example sentence extraction module selects a sentence which is taken as a clustering center from the clustered sentences as a typical example sentence.

In addition, it is necessary to provide a method for acquiring word usage knowledge based on data mining, which can acquire word usage knowledge more accurately.

A method for acquiring word usage knowledge based on data mining comprises the following steps: A. receiving a word or phrase to be searched input by a user; B. analyzing the keywords in the words or phrases to be searched, and sending the words or phrases to be searched into a corresponding input mode for processing according to the analysis result; C. analyzing and expanding the words or phrases to be searched by utilizing semantic knowledge and a dictionary to form query terms, and searching webpage information according to the query terms to obtain webpages related to the input words or phrases; D. analyzing the searched web page, and converting the web page into a candidate text; E. processing the candidate text, and extracting context information and typical example sentences of the words or phrases; F. and outputting the context information and the typical example sentence.

Wherein the input modes include one or more of the following modes: a comparison mode, a category mode, a target language collocation mode and a single word mode.

When the input mode is the comparison mode, the step C may specifically be: and combining the words or phrases into a query term by adopting the logic words, retrieving webpage information according to the query term, and acquiring the webpage related to the input words or phrases.

When the input mode is a category mode, the step C may specifically be: analyzing and expanding the input central word and category information according to semantic knowledge to form a query term, retrieving webpage information according to the query term, and acquiring a webpage related to the input word or phrase.

When the input mode is the target language collocation mode, the step C may specifically be: analyzing and expanding the input collocation words according to the dictionary to form query terms, retrieving webpage information according to the query terms, and acquiring webpages related to the input words or phrases.

When the input mode is a single word mode, the step C may specifically be: and forming a query term according to the input single word, retrieving webpage information according to the query term, and acquiring a webpage related to the input word or phrase.

And the step D may specifically be: analyzing the searched webpage information, removing repeated webpages, and analyzing each webpage into a form of a document model tree; and in the document model tree, removing non-text labels in the webpage, reserving useful labels, and converting the webpage into candidate texts in a text form.

Wherein, step E includes: processing the candidate text into a single sentence through boundary identification, obtaining candidate words in the single sentence through keyword search, counting each candidate text by utilizing a statistical algorithm to obtain the occurrence frequency of the candidate words, and outputting a candidate list of context information according to the occurrence frequency of the candidate words.

Step E may further comprise: and sorting the candidate words according to the occurrence frequency of the candidate words, selecting preset data candidate words according to the sorting, and removing functional words and non-real words according to a stop word list to obtain a candidate list containing the context information of the selected candidate words.

Wherein, step E may further comprise: extracting sentences containing the context information from the single sentence as candidate example sentences; clustering the candidate example sentences by using a sentence clustering method based on characteristics; and selecting a sentence as a clustering center from the clustered sentences as a typical example sentence.

According to the system and the method for acquiring word usage knowledge based on data mining, the keywords of the words or phrases to be searched are analyzed and sent to the corresponding input mode processing device for processing, and compared with the method for searching by only using a single word, the information matched with the words or phrases to be searched can be acquired more accurately; the retrieved web pages are converted into candidate texts, and the context information and the typical example sentences of the words or phrases to be searched are extracted after the candidate texts are processed. The extracted context information and the typical example sentence can effectively reflect the usage of the words, can be conveniently used for obtaining the usage knowledge of the words, and improve the user experience requirements.

In addition, multiple input modes such as a comparison mode, a category mode, a target language collocation mode and the like can effectively limit retrieval conditions, so that more accurate word collocation knowledge can be mined under the condition of counting the same number of webpages; candidate example sentences are clustered through a sentence clustering method based on characteristics, and retrieved redundant example sentences are analyzed and clustered, so that the extracted typical example sentences are representative and can better meet the requirements of users.

[ description of the drawings ]

FIG. 1 is a block diagram that illustrates a system for obtaining word usage knowledge based on data mining, according to one embodiment;

FIG. 2 is a schematic diagram of a multi-input mode processing apparatus according to an embodiment;

FIG. 3 is a schematic diagram of the structure of a usage knowledge extraction apparatus in one embodiment;

FIG. 4 is a diagram illustrating an exemplary sentence extraction unit in accordance with one embodiment;

FIG. 5 is a flow diagram of a method for obtaining word usage knowledge based on data mining, under an embodiment;

FIG. 6 is a flow diagram of a method for processing multiple input modes in one embodiment;

FIG. 7 is a flow diagram of a method for extracting a representative example sentence in one embodiment;

FIG. 8 is a flow diagram of a clustering method based on key features in one embodiment.

[ detailed description ] embodiments

Fig. 1 shows a system for acquiring word usage knowledge based on data mining in one embodiment, which includes an input device 10, a query analysis device 20, a multiple input pattern processing device 30, a web page analysis device 40, a usage knowledge extraction device 50, and an output device 60. Wherein:

the input device 10 is used for inputting a word or phrase to be searched. In one embodiment, the word or phrase to be searched input by the input device 10 has multiple modes, for example, a knowledge of the usage of the word "solve" needs to be searched, and the search can be performed by using a single word input mode (e.g., "solve"), a target matching mode (e.g., "solve problem"), a category mode (e.g., "l < solve > difficulty, that," < l < solve > n. "etc.), a comparison mode (e.g.," solveprophlem/issue "), and the like.

The query analysis device 20 is configured to analyze the keywords in the word or phrase to be searched, and send the word or phrase to be searched into the corresponding input mode processing device for processing according to the analysis result. For the multiple input modes, the words or phrases input through different input modes are processed by corresponding different input mode processing devices, the query analysis device 20 analyzes the keywords in the input words or phrases, and when only a single word in the words or phrases is analyzed, the words or phrases are sent to a single word mode unit for processing; when the word or phrase contains the character "< >", the word or phrase is sent to a category mode unit for processing; when the words or phrases contain Chinese, the words or phrases are sent to a target language collocation mode unit for processing; when the word or phrase contains the character "/", it is sent to the comparison mode unit for processing.

The multi-input mode processing device 30 analyzes and expands the word or phrase to be searched by using semantic knowledge and a dictionary to form a query term, and searches the web page information according to the query term to obtain the web page related to the word or phrase to be searched. In one embodiment, as shown in fig. 2, the multiple input mode processing device 30 includes the following multiple input mode units: a comparison mode unit 301, a category mode unit 302, a target word collocation mode unit 303, and a single word mode unit 304, and further includes a search engine retrieval module 305 for retrieving web pages. The following describes the processing procedure in these input modes:

in the compare mode, for example, when the user inputs "lay/make foundation", the compare mode unit 301 needs to compare which phrase is the most common (i.e. the most tunnel usage) with "make foundation". The comparison mode unit 301 preferably combines words or phrases (i.e., candidate words in the input words or phrases) into query terms by using logical words, i.e., forms a new query, and then performs a search for related web pages through the search engine retrieval module 305. For example, for the "lay/make foundation", a new query term composed of logical words (OR, AND, etc.) is "lay OR make" AND foundation ", the query term is sent to the search engine retrieving module 305, AND the search engine retrieving module 305 can search for AND download the web pages that match the query term. In addition, the occurrence frequencies of the candidate words "lay", "make", and "foundation" can be counted, and the web pages can be sorted according to the occurrence frequencies. Since there are many web pages that may be retrieved, a limit to the number of web pages downloaded may be preset, e.g., the top 300 ordered web pages may be downloaded. The comparison mode can obtain the statistics of various collocation information only by one-time query, and is particularly suitable for the condition that various combinations appear after semantic expansion; it can find new collocation information, for example, when searching for "solution issue/query", it can also count it out because "recipe" is often sent together with "issue"; the searched web pages are ranked according to the candidate frequency of the candidate words, and the preset number of web pages can be selected to be more representative.

In the category mode, the category mode unit 302 analyzes and expands the input core word and category information to form a query term. Category patterns include two types, one is the entry of a core word and part-of-speech, e.g., "< solution > n."; one is to enter a core word and its synonyms, such as "< solution > difficulty, this". The part of speech and the synonym are used for indicating the category information of the candidate word collocated with the central word. In the category mode, because the candidate words matched with the central words are restrained through the category information, the candidate words matched with the central words can be obtained more accurately. The collocation here is generally divided into two categories: grammar collocation and dictionary collocation. Grammar collocation refers to the collocation connection between core words (names, adjectives and verbs), core words and prepositions or core words and other grammar structures, and comprises adjective-prepositions, noun indefinite forms, noun clauses, adjective-prepositions, verb indefinite forms and the like. Dictionary collocations typically include verbs-nouns, adjectives-nouns, verbs-adverbs, noun-prepositions, and verbs-prepositions. Words in the collocation process can be generally divided into 5 parts of speech: adjectives, verbs, nouns, adverbs, and prepositions, these 5 parts of speech may be used as category restrictions.

In order to further accurately describe the category information, synonyms can be used as upper and lower vectors for limitation, and search results can be reduced. Because the synonym needs to be provided by the user, and the amount of information which can be provided by the user is small, the synonym can be automatically expanded by utilizing the hypernym information in the WordNet semantic dictionary. WordNet is a dictionary database that organizes words into a network of synonym sets, each connection representing a relationship between them. For example: superior, inferior, synonymous, affiliation, etc. Based on the principle that words with similar meanings or belonging to the same class are always possible to occur together, the words in the upper-level relation in WordNet are used for expanding the query options so as to obtain possible meanings. For example, "< solution > thingqueous", with "thingqueous" as the context vector, and for expansion, the hypernym "difficuty" of "queous" is also added as the context vector, forming a new query term. Thus, the keyword "solution" and the contextual relevance vector "that is composed of a group of related words and reflects a detailed category information" that "will be sent to the search engine retrieval module 305 for retrieval of related web pages.

In the target matching mode, the target matching mode unit 303 translates and expands the input matching language to form a new query term, and the search engine retrieval module 305 retrieves web page information according to the query term to obtain a web page related to the input word or phrase. In one embodiment, inputting "solution question" to search the usage knowledge of "solution", the target word collocation pattern unit 303 performs restriction by collocation information of chinese to obtain the relevant web page. In this mode, the Chinese part is first translated according to the Chinese-English knowledge base. Because the translation options provided by the universal Chinese dictionary are single and cannot meet the requirement of Chinese semantic expansion, the problem can be solved by synonym expansion. Therefore, after the Chinese part is translated, the synonym is expanded to form characteristic word vectors as much as possible. For example, after inputting "solution question", translating AND synonym expanding, the formed new query term is "solution AND (issue OR matrix OR query)". The web pages retrieved by the search engine retrieval module 305 based on the query terms will be limited to the category of "question". In addition, the query term can be further expanded by combining with a WordNet semantic dictionary, and the query term is expanded by the words in the upper-level relation in the WordNet. After the above query terms are further expanded, a new query is formed as "solvent AND (issue OR machine OR protocol OR sensitivity)", where "sensitivity" is the hypernym of "issue".

In a single word mode, such as entering the single word "solve," the single word mode unit 304 forms a query term from the single word, and the search engine retrieval module 305 retrieves a web page containing the single word.

The web page analyzing device 40 is configured to analyze the searched web pages and convert the web pages into candidate texts. In one embodiment, the web page analyzing device 40 further analyzes the searched web pages to remove duplicate web pages, and analyzes each web page into a document model tree in which non-text labels in the web page are removed and useful labels (such as boundary symbols) are retained, thereby converting the web page into text candidates. The candidate text is used in a subsequent usage knowledge extraction process.

The usage knowledge extracting device 50 is used for processing the candidate text and extracting the context information and the typical example sentence of the word or phrase to be searched. In one embodiment, as shown in fig. 3, the usage knowledge extraction apparatus 50 includes a context information extraction unit 501 and a typical example sentence extraction unit 502, where:

the context extraction unit 501 processes the candidate text into a single sentence through boundary identification, obtains candidate words in the single sentence through keyword search, obtains the occurrence frequency of the candidate words by performing statistics on each candidate text through a statistical algorithm, and outputs a candidate list of context information according to the occurrence frequency of the candidate words. In one embodiment, the candidate words in the single sentence searched by the context extraction unit 501 are the input word and the word matched with the input word or the word group, and after the occurrence frequency of the candidate words is counted by using a statistical algorithm, the candidate words can be sorted according to the occurrence frequency. In the statistics, only the co-occurrence information in one grammar sentence, that is, the sentence in which the candidate word appears in the same sentence is counted as the statistical content, and if the candidate word does not appear in the same sentence, the statistical content is not considered, so that the statistical result is more representative. After all the single sentences are counted, the single sentences are ranked according to the frequency of the counted word candidates, a preset number of candidate words are selected according to the ranking result, for example, the first 5 candidate words are selected, the candidate words with low frequency are removed, functional words (such as ' a ', ' an ', ' the ', ' and the like) and some non-ambiguous words are removed according to the stop word list, and a candidate list containing the context information of the selected candidate words is obtained. The candidate list can be divided according to the front and rear position information of the candidate words, and finally, the upper information (all possible words in front of the candidate words or phrases) and the lower information (all possible words behind the candidate words or phrases) of the candidate words or phrases are output.

The typical-example-sentence extracting unit 502 is used to extract a typical example sentence. As shown in fig. 4, in one embodiment, the exemplary sentence extraction unit 502 includes a candidate exemplary sentence extraction module 5021, a clustering module 5022, and an exemplary sentence extraction module 5023. Wherein: the candidate example sentence extraction module 5021 is used for extracting sentences containing the context information from the webpage candidate texts as candidate example sentences; the clustering module 5022 is used for clustering the candidate example sentences by using a sentence clustering method based on characteristics; the exemplary sentence extraction module 5023 is used for selecting a sentence which is a clustering center from the clustered sentences as the exemplary sentence.

In one embodiment, the candidate example sentence extraction module 5021 parses the web page candidate text into a single sentence. Specifically, a document may be divided into individual sentences according to punctuation marks of the sentences (e.g., ", etc.), and when distinguishing between" a period and a point following the abbreviation, "a list of abbreviations may be constructed and rules may be specified to determine whether the period is a period. In addition, the length of a separated single sentence can be limited, for example, a sentence containing more than 5 words and less than 30 words is used as a candidate example sentence.

In one embodiment, the clustering module 5022 clusters the candidate example sentences by using a sentence clustering method based on features as follows:

(1) and (5) initializing. Taking all the candidate example sentences obtained above as data segment samples, and calculating the matching distance d (O) between every two data segment samples by a characteristic distance-based method_i，O_j) Thus forming a distance matrix, and when the distance matrix is used later, the distance matrix can directly obtain the distance by using a table look-up method.

The method comprises the steps of utilizing a stop word list to analyze a sentence S into a sentence S with only main components, wherein words in the stop word list are removed, different word forms are restored, and a synonym dictionary is utilized to remove classes with similar semantemes in the sentence, so that each sentence represents the characteristics which are not related semanteme, and the method is similar to the main component analysis in pattern recognition. Let the two sentences after analysis be respectively expressed as: o is₁＝w₁w₂…w_m，O₂＝w₁w₂…w_nThe distance between them is defined as:

wherein,

representing semantic similarity between two words, if the semantics are similar or the two words are the same, defining the semantic similarity as 1, otherwise defining the semantic similarity as 0; m represents the number of sentences composed of the main words, and n represents the number of the main words in the sentences.

The number of clusters C to be expected and the threshold value theta of the inter-class distance for class merging are set in advance_CMinimum number of samples in each class θ_NAnd the maximum number of iterations t_max(ii) a And c represents the number of classes, and t represents the number of iterations.

(2) Initializing cluster centers

And respectively selecting sentences containing more words from the c webpages from different sources as initial clustering centers. Here, a threshold value of the number of candidate words contained in the sentence in the initial clustering center may be set in advance, and when the number of candidate words contained reaches the threshold value, the corresponding sentence serves as the initial clustering center.

(3) Sample classification

And dividing the data segment samples into various categories according to the principle of minimum distance, and recording the number of the samples of each category. For any O e.g. n, if

Then O is e Γ_j. Wherein m (gamma)_j) Representing a gamma-like_jIs a space containing all sentences, j represents a class number, and Γ is_jIs the jth class all sample space. Checking the number of samples in each class simultaneously, if the number of samples is less than theta_NThen the class is dropped, let c be c-1, and the samples in the class are re-sorted into a new class.

(4) Recalculating cluster centers

Recalculating the cluster center m (Γ) for each class_j) J is 1, 2, …, c. The calculation method of the clustering center is as follows:

finding the pseudo center O', which is Γ_jAnd satisfies the number of elements whose distance to it is less than a certain threshold. Is provided with

And σ_dAre each d (O)_k，O_l) Mean and variance of (1), wherein O_k，O_l∈Γ_jAnd then:

\overset{&OverBar;}{d} = \frac{2}{N_{j} (N_{j} - 1)} Σ_{k = 1}^{N_{j} - 1} Σ_{l = k + 1}^{N_{j}} d (O_{k}, O_{l})

σ_{d}^{2} = \frac{2}{N_{j} (N_{j} - 1)} Σ_{k = 1}^{N_{j} - 1} Σ_{l = k + 1}^{N_{j}} d^{2} (O_{k}, O_{l}) - {\overset{&OverBar;}{d}}^{2}

wherein, the threshold is defined as follows:

if only one element meets the above condition, taking the sample as a pseudo center; if two or more elements simultaneously satisfy the condition, then the gamma is adjusted_jAll samples with matching distances smaller than a threshold value are taken as subclasses of the class, average intra-class distances between each element in the subclasses and other elements are calculated, and the element with the minimum average intra-class distance is selected as a pseudo center. The pseudo center obtained by calculation is the sample closest to the actual clustering center, and can replace the actual clustering center.

(5) If this is an even number of iterations or C ≧ 2C, then step (8) is diverted, otherwise continue.

(6) Calculating intra-class distance

Calculating gamma_jOverall within-class distance λ of_j ^∑And average intra-class distance

j＝1，2，…，c。

(7) Class splitting

The class with the largest intra-class distance is split into two classes. The maximum intra-class distance can be chosen in two ways: the overall intra-class distance and the average intra-class distance. Let the selected class be Γ_jmaxIf | F_jmax‖≥2θ_NOr C is less than or equal to C/2, gamma_jmaxWill be split as follows to find two sample data O in the class_p1And O_p2So that for any sample pair O in the class_p3And O_p4Satisfies d (O)_p1，O_p2)≥d(O_p3，O_p4)，O_p1And O_p2And (4) replacing the original clustering center with two new clustering centers, and turning to the step (9) when c is equal to c + 1.

(8) Calculating inter-class distance

Calculating the distances between every two clustering centers by using the characteristic distance calculation method based on the principal components: d (m (gamma)_i)，m(Γ_j))，1≤i，j≤c。

(9) Class merging

Find all d (m (gamma)_i)，m(Γ_j) Minimum value d (m (Γ)) of_p)，m(Γ_q) If d (m (Γ)_p)，m(Γ_q))＜θ_CThen would be like Γ_pAnd gamma-like_qMerging, and calculating a new clustering center by using the step (4)

And let c be c-1.

(10) t is t +1, if t < t_maxAnd (3) turning to the step (3), otherwise, storing the data related to the clustering: cluster number c, cluster center, and the sample closest to the cluster center (i.e., pseudo center), end.

After the cluster center and the sample closest to the cluster center (i.e., the pseudo center, which may also be used as the cluster center) are obtained through calculation, the typical example sentence extraction module 5023 extracts the sentences serving as the cluster centers (including the actual cluster center and the sample close to the actual cluster center) and outputs the sentences serving as the typical example sentences.

The output device 60 is used for outputting the obtained context information and the typical example sentence.

FIG. 5 shows a flow of a method for obtaining word usage knowledge based on data mining in an embodiment, which includes the following specific processes:

and step S10, receiving the word or phrase to be searched input by the user. In one embodiment, the word or phrase to be searched can be input in a plurality of input modes, for example, the usage knowledge for the word "solution" that needs to be searched, and a plurality of modes such as "solution", "solution question", "< solution > difficity, that", "< solution > n.", "solution epiblemem/issue" can be input for searching.

And step S20, analyzing the keywords in the word or phrase to be searched, and sending the word or phrase to be searched into a corresponding input mode for processing according to the analysis result. . For the multiple input modes, the words or phrases input through different input modes are processed by corresponding different input mode processing devices, and keywords in the input words or phrases are analyzed. When only a single word is analyzed in the word or the word group, the word or the word group is sent to the single word mode unit 304 for processing; when the word or phrase contains the character "< >", the word or phrase is sent to the class mode unit 302 for processing; when the words or phrases contain Chinese, the words or phrases are sent to the target language collocation mode unit 303 for processing; when a word or phrase contains the character "/", it is sent to the comparison mode unit 301 for processing.

Step S30, analyzing and expanding the word or phrase to be searched by semantic knowledge and dictionary to form a search term, searching the web page information according to the search term to obtain the web page related to the input word or phrase. In one embodiment, as shown in fig. 6, the specific process of step S30 is as follows:

in step S301, when the input mode is the comparison mode, a word or a phrase is combined into a query term by using a logical word, so as to form a new query. For example, for "lay/make foundation", a new query term composed of logical words (OR, AND, etc.) is "(lay OR make) AND foundation", the query term is sent to the search engine retrieving module 305, AND the search engine retrieving module 305 can search for a web page matching the query term AND download the web page. In addition, the occurrence frequencies of the candidate words "lay", "make", and "fountain" can be counted, and the web pages can be sorted according to the occurrence frequencies. Since there are many web pages that may be retrieved, a limit to the number of web pages downloaded may be preset, such as the top 300 ordered web pages that may be downloaded. The comparison mode can obtain statistics of various collocation information only by one-time query, and is particularly suitable for the condition that many combinations appear after semantic expansion; it can find new collocation information, for example, when searching for "solvaissue/query", it can also count it out because "recipe" is often sent together with "issue"; the searched web pages are ranked according to the candidate frequency of the candidate words, and the preset number of web pages can be selected to be more representative.

In step S302, when the input mode is the category mode, the input core word and the category information are analyzed and expanded to form a query term. Category patterns include two types, one is the entry of a core word and part-of-speech, e.g., "< solution > n."; one is to enter a core word and its synonyms, such as "< solution > difficulty, this". The part of speech and the synonym are used for indicating the category information of the candidate word collocated with the central word.

In order to further accurately describe the category information, synonyms can be used as upper and lower vectors for limitation, and search results can be reduced. Because the synonym needs to be provided by the user, and the amount of information which can be provided by the user is small, the synonym can be automatically expanded by utilizing the hypernym information in the WordNet semantic dictionary. For example, "< solution > this query", with "this query" as the context vector, and for expansion, the hypernym "difficuty" of "query" is also added as the context vector, forming a new query term.

In step S303, when the input mode is the target language mode, the input collocations are translated and expanded to form new query terms. In one embodiment, inputting "solution question" wants to find knowledge of the usage of "solution", and the relevant web page is obtained by limiting the collocation information of Chinese. In this mode, the Chinese part is first translated according to the Chinese-English knowledge base. Because the translation options provided by the universal Chinese dictionary are single and cannot meet the requirement of Chinese semantic expansion, the problem can be solved by synonym expansion. Therefore, after the Chinese part is translated, the synonym is expanded to form characteristic word vectors as much as possible. In addition, the query term can be further expanded by combining with a WordNet semantic dictionary, and the query term is expanded by the words in the upper-level relation in the WordNet.

In step S304, when the input mode is a single word mode, a query term is formed from the single word.

In step S305, web page information is retrieved according to the generated query term, and a web page related to the input word or phrase is acquired.

Step S40, analyzing the web page obtained by the search, and converting the web page into a candidate text. In one embodiment, the specific process of step S40 is: analyzing the searched webpage information, removing repeated webpages, and analyzing each webpage into a form of a document model tree; and in the document model tree, removing non-text labels in the webpage, reserving useful labels, and converting the webpage into candidate texts in a text form.

And step S50, processing the candidate text, and extracting context information and typical example sentences of the words or phrases. In one embodiment, step S50 includes extracting context information of a word or a phrase and extracting two parts of a typical example sentence of the word or the phrase, where the process of extracting the context information of the word or the phrase is specifically as follows: processing the candidate text into a single sentence through boundary identification, obtaining candidate words in the single sentence through keyword search, counting each candidate text by utilizing a statistical algorithm to obtain the occurrence frequency of the candidate words, and outputting a candidate list of context information according to the occurrence frequency of the candidate words. In this embodiment, the candidate words may be further sorted according to the occurrence frequency of the candidate words, a preset number of candidate words are selected according to the sorting, and the functional words and the non-semantic words are removed according to the stop word list, so as to obtain a candidate list including context information of the selected candidate words.

In one embodiment, as shown in fig. 7, the process of extracting the typical example sentence is specifically as follows:

in step S501, sentences including the context information in the single sentence are extracted as candidate example sentences. In one embodiment, the specific process of step S501 is: the candidate texts of the web page are analyzed into a single sentence. Specifically, a document may be divided into individual sentences according to punctuation marks of the sentences (e.g., ", etc.), and when distinguishing between" a period and a point following the abbreviation, "a list of abbreviations may be constructed and rules may be specified to determine whether the period is a period. In addition, the length of a separated single sentence can be limited, for example, a sentence containing more than 5 words and less than 30 words is used as a candidate example sentence.

In step S502, the candidate example sentences are clustered using a feature-based sentence clustering method. In one embodiment, as shown in fig. 8, the specific process of step S502 is as follows:

Wherein the feature distance calculation based on the principal component analyzes the sentence S into only the stem component by using the stop word table, which is composed ofThe method comprises the steps of removing words in a stop word list, restoring different word forms, and removing classes with similar semantemes in sentences by using a synonym dictionary, so that each sentence represents characteristics which are not related semanteme, and the method is similar to principal component analysis in pattern recognition. Let the two sentences after analysis be respectively expressed as: o is₁＝w₁w₂…w_m，O₂＝w₁w₂…w_nThe distance between its doors is defined as:

wherein,representing semantic similarity between two words, if the semantics are similar or the two words are the same, defining the semantic similarity as 1, otherwise defining the semantic similarity as 0; m represents the number of sentences composed of the main words, and n represents the number of the main words in the sentences.

(2) Initializing cluster centers

(3) Sample classification

(4) Recalculating cluster centers

\overset{&OverBar;}{d} = \frac{2}{N_{j} (N_{j} - 1)} Σ_{k = 1}^{N_{j} - 1} Σ_{l = k + 1}^{N_{j}} d (O_{k}, O_{l})

σ_{d}^{2} = \frac{2}{N_{j} (N_{j} - 1)} Σ_{k = 1}^{N_{j} - 1} Σ_{l = k + 1}^{N_{j}} d^{2} (O_{k}, O_{l}) - {\overset{&OverBar;}{d}}^{2}

wherein the thresholdThe values are defined as follows:

(6) Calculating intra-class distance

j＝1，2，…，c。

(7) Class splitting

The class with the largest intra-class distance is split into two classes. The maximum intra-class distance can be chosen in two ways: the overall intra-class distance and the average intra-class distance. Let the selected class be Γ_jmaxIf | F_jmax‖≥2θ_NOr C is less than or equal to C/2, gamma_jmaxWill be split as follows to find two sample data O in the class_p1And O_p2So that for any sample pair O in the class_p3And O_p4Satisfies d (O)_p1，O_p2)≥d(O_p3，O_p4)，O_p1And O_p2As twoAnd (4) replacing the original clustering center with the new clustering center, and turning to the step (9) when c is equal to c + 1.

(8) Calculating inter-class distance

(9) Class merging

And let c be c-1.

In step S503, a sentence that is the center of the cluster is selected as a typical example sentence among the clustered sentences. Specifically, a sentence with an actual cluster center and a pseudo center closest to the actual cluster center may be selected as a typical example sentence.

And step S60, outputting the context information and the typical example sentence.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A system for obtaining word usage knowledge based on data mining, the system comprising:

the input device is used for inputting words or phrases to be searched;

the query analysis device is used for analyzing the keywords in the words or phrases to be searched and sending the words or phrases to be searched into the corresponding input mode processing device for processing according to the analysis result;

the multi-input mode processing device analyzes and expands the words or phrases to be searched by utilizing semantic knowledge and a dictionary to form query terms, and searches webpage information according to the query terms to obtain webpages related to the words or phrases to be searched;

the webpage analysis device is used for analyzing the searched webpage and converting the webpage into a candidate text;

the usage knowledge extraction device is used for processing the candidate text and extracting the context information and the typical example sentence of the word or phrase to be searched;

the output device outputs the context information and the typical example sentences;

the multiple input mode processing apparatus includes the following multiple input mode units: the system comprises a comparison mode unit, a category mode unit, a target language collocation mode unit and a single word mode unit, and also comprises a search engine retrieval module for retrieving a webpage;

2. The system for obtaining word usage knowledge based on data mining as claimed in claim 1, wherein the web page analysis means further analyzes the searched web page information to remove duplicate web pages, analyzes each web page in a form of a document model tree in which non-text tags in the web page are removed and useful tags are retained, thereby converting the web page into candidate text in a text form.

3. The system for obtaining word usage knowledge based on data mining as claimed in claim 1 or 2, wherein the usage knowledge extraction means comprises:

and the context information extraction unit is used for processing the candidate text into a single sentence through boundary identification, acquiring candidate words in the single sentence through keyword search, counting each candidate text by using a statistical algorithm to obtain the occurrence frequency of the candidate words, and outputting a candidate list of context information according to the occurrence frequency of the candidate words.

4. The system for obtaining word usage knowledge based on data mining of claim 3, wherein the context extraction unit further ranks the candidate words according to their frequency of occurrence, selects a preset number of candidate words according to the ranking, and removes functional words and non-semantic words according to a stop word list to obtain a candidate list containing context information of the selected candidate words.

5. The system for acquiring word usage knowledge based on data mining as claimed in claim 3, wherein the usage knowledge extraction device further comprises a typical example sentence extraction unit, the typical example sentence extraction unit comprising:

the candidate example sentence extraction module is used for extracting sentences containing the context information in the webpage candidate texts as candidate example sentences;

the clustering module is used for clustering the candidate example sentences by utilizing a sentence clustering method based on characteristics;

and the typical example sentence extraction module selects a sentence which is taken as a clustering center from the clustered sentences as a typical example sentence.

6. A method for acquiring word usage knowledge based on data mining comprises the following steps:

A. receiving a word or phrase to be searched input by a user;

B. analyzing the keywords in the words or phrases to be searched, and sending the words or phrases to be searched into a corresponding input mode for processing according to the analysis result;

C. analyzing and expanding the words or phrases to be searched by utilizing semantic knowledge and a dictionary to form query terms, and searching webpage information according to the query terms to obtain webpages related to the input words or phrases;

D. analyzing the searched web page, and converting the web page into a candidate text;

E. processing the candidate text, and extracting context information and typical example sentences of the words or phrases;

F. outputting the context information and the typical example sentence;

the input modes include one or more of the following modes: comparing the mode, the category mode, the target language collocation mode and the single word mode;

when the input mode is a comparison mode, the step C specifically includes: adopting logic words to combine words or phrases into query terms, retrieving webpage information according to the query terms, and acquiring webpages related to the input words or phrases;

when the input mode is a category mode, the step C specifically includes: analyzing and expanding the input central word and category information according to semantic knowledge to form a query term, retrieving webpage information according to the query term, and acquiring a webpage related to the input word or phrase;

when the input mode is a target language collocation mode, the step C specifically comprises the following steps: analyzing and expanding the input collocation words according to the dictionary to form query terms, retrieving webpage information according to the query terms, and acquiring webpages related to the input words or phrases;

when the input mode is a single word mode, the step C specifically includes: and forming a query term according to the input single word, retrieving webpage information according to the query term, and acquiring a webpage related to the input word or phrase.

7. The method for obtaining knowledge of word usage based on data mining of claim 6, wherein the step D is specifically:

analyzing the searched webpage information, removing repeated webpages, and analyzing each webpage into a form of a document model tree;

and in the document model tree, removing non-text labels in the webpage, reserving useful labels, and converting the webpage into candidate texts in a text form.

8. The method for obtaining knowledge of word usage based on data mining as claimed in claim 7, wherein said step E comprises:

processing the candidate text into a single sentence through boundary identification, obtaining candidate words in the single sentence through keyword search, counting each candidate text by utilizing a statistical algorithm to obtain the occurrence frequency of the candidate words, and outputting a candidate list of context information according to the occurrence frequency of the candidate words.

9. The method for obtaining knowledge of word usage based on data mining as claimed in claim 8, wherein said step E further comprises:

and sorting the candidate words according to the occurrence frequency of the candidate words, selecting preset data candidate words according to the sorting, and removing functional words and non-real words according to a stop word list to obtain a candidate list containing the context information of the selected candidate words.

10. The method for obtaining knowledge of word usage based on data mining as claimed in claim 8, wherein said step E further comprises:

extracting sentences containing the context information from the single sentence as candidate example sentences;

clustering the candidate example sentences by using a sentence clustering method based on characteristics;

and selecting a sentence as a clustering center from the clustered sentences as a typical example sentence.