Method and system for extracting knowledge from Chinese language material library
Technical Field
The invention relates to the field of word segmentation, in particular to a method and a system for extracting knowledge from a source language material base mainly written in Chinese, which are used for generating a Chinese ontology base through automatic word segmentation, part of speech (POS) labeling, Chinese noun phrase collocation and frequency calculation.
Background
In the information technology age, large amounts of data are uploaded and downloaded daily from networks, enterprise computer networks, or other databases. Data users always want to search for information they want from a network, enterprise computer network, or other database, but sometimes the information returned is not just good. An ontology library is a representation of specific similarities and associations between different concepts, where each concept has its own unique semantic information to improve the accuracy and predictive relevance of the search.
The ontology base may be generated with knowledge in different languages. Regardless of the language used, the corpus in such language must be processed and key phrases extracted for ontology generation. Some languages, such as chinese, have no obvious separating words between words, may be more difficult or complex in language processing than english, and may make knowledge extraction difficult. Therefore, it is difficult to have an efficient segmentation method to segment a corpus of chinese text into meaningful phrases.
Traditionally, text segmentation of chinese text corpora is achieved by Conditional Random Fields (CRFs) or Hidden Markov Models (HMMs). Both of these methods are statistical modeling methods based on pattern recognition and prediction. However, the basic unit of these segmentation methods is a word or word rather than a phrase, and thus the chinese phrases in all kanji strings are segmented into words or words for derivation of semantic similarity. Thus, the prior art algorithms unnecessarily increase the overall count for recognition and result in a reduction of meaningful results for further generating the chinese ontology library. For example, rather than extracting the entire phrase, chinese phrases such as "financial crisis" are segmented into "finance" and "crisis," where the most relevant information or knowledge may not be perceived due to the segmentation.
US20090313243 a1 discloses a method to calculate relevance scores for phrases in semantic data sources of a domain and to calculate weights for the semantic data sources based on the relevance scores for the phrases. The relevance score is calculated based on the frequency of a phrase in the domain corpus and the expected frequency of the phrase. This approach has some of the features of the present invention, but has the disadvantage of being inefficient and incapacitating in processing chinese phrases without explicit separators or spaces between words.
CN 1011699780A discloses a search system based on semantic ontology base. The text index processing unit is a conventional processing unit for establishing a text index by analyzing text content, extracting keywords and file identification information. Semantic searching in this publication focuses on the relationships and attributes of keywords without recognizing the importance of word segmentation, labeling, and word frequency weighting to identify relevant information.
US 7680648B 2 discloses a method and system for improving text segmentation. A series of characters can be segmented into a combination of segmented strings, and the disclosed method introduces a frequency of occurrence to identify and select the best operable segmentation result therein. The method has a good segmentation effect on search queries without definite separators, but has no concept of collocation or noun phrase identification, and has an unobvious processing effect on Chinese sentences.
Thus, there is a need for a more efficient, accurate method and system, preferably an automated computer-implementable method and system, for extracting knowledge from a chinese corpus to better enable chinese ontology library generation.
Disclosure of Invention
Since chinese is written continuously without explicit separators or spaces between words, it is difficult for an automated computer system to perform text segmentation and related information extraction for chinese ontology library generation. The accuracy of knowledge extraction always depends on the way the sentence is segmented and the choice of extracting the word tokens. In a chinese corpus, phrases and compounds containing two or more characters are often used to express a particular meaning, rather than an individual meaning for each word or word. This leads to complexity and divergence in the segmentation process. Traditional word segmentation methods can identify most words or words in a corpus, e.g., words such as "knowledge" and "property" can be identified rather than categorizing them as "known", "produced" and "rights". However, the combination of these two words "intellectual property" is difficult to identify. The invention aims to solve the problem and provides a method for extracting meaningful information from a corpus.
Embodiments of the present invention include methods and systems for improving Chinese word segmentation. It includes a collocation module that uses a Chinese dictionary as a reference corpus to identify and collocate frequently co-occurring words or phrases. The reference corpus may be automatically built by extracting article titles from structured network knowledge, which is a database of structured information stored on a network. For example, there are several Chinese cyber encyclopedias such as Baidu encyclopedia (Baidu Baike) and Chinese Wikipedia (Chinese Wikipedia), etc., which are common basic knowledge containing millions of articles. The large number of common phrases and compound words are contained, and necessary resources for improving word segmentation can be provided.
Described below are a method, system, and computer-readable medium encoded with instructions that, when executed by a processor, perform the method for automatic word segmentation and POS tagging of a corpus of Chinese text generated by a Chinese ontology. The method comprises the following steps: obtaining character strings from the source language material libraries, wherein each source language material library represents a concept; separating the character string into segmented words or phrases; applying POS labels to the segmented words or phrases; building up individual Chinese words or words into meaningful phrases or compound words from the segmented words or words; extracting Chinese noun phrases, words or words from the separated phrases, words or words; calculating the word frequency of the extraction result; and storing the extraction result and the word frequency weighting vector of the corresponding concept for generating another Chinese ontology library.
Preferably, the step of obtaining the character string from the source language material library comprises: topics, titles, and main textual content are received from source material libraries, where each source material library represents a concept. Titles and topics are very useful for determining the name of a concept, while the main text provides a description of the concept.
Preferably, the source material library is written primarily in Chinese, sometimes also containing numeric characters, punctuation, English, and other language characters, without significant separation between words. The source material library includes electronic documents in networks and other systems such as the internet, WAN, LAN, private network, or a single computer.
Preferably, separating the character string into segmented words or words comprises the steps of: the segmentation result is confirmed by word segmentation, wherein the segmentation result may be a word or a word.
Further, the segmentation of the string into isolated words or words comprises the steps of: applying one or more word segmentation models, wherein the word segmentation models are Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs).
Preferably, tagging segmented words or words with POS comprises the steps of: extracting POS information related to the segmented words or expressions; and assigning POS labels to the segmented words or phrases.
Preferably, the extracting of the POS information related to the segmented words or phrases is extracting the POS information from a mature POS tagging model for Chinese, wherein the mature POS tagging model for Chinese is "Chinese TreeBank (CTB)" obtained from the linguistic data consortium.
Further, assigning POS labels to the isolated words or terms is by mapping the POS features to the isolated words or terms in a vector space, wherein the mapping can be done by building an index or table.
Preferably, collocating a single chinese word or word into a meaningful phrase or compound comprises the steps of: grouping co-occurring Chinese words or words; finding potential Chinese phrases or compound words from the Chinese words or word groups; searching the potential Chinese phrases or compound words from a reference corpus; storing the confirmed Chinese phrases or compound words with POS labels; and corresponding co-occurring chinese words or words are removed.
Preferably, the grouping of co-occurring chinese noun words or words is performed by identifying a series of two or more chinese words or words labeled as a noun group.
Preferably, the finding of potential chinese phrases or compound words from the chinese words or word groups is performed by using an n-gram (n-gram) model to find potential phrases, wherein the n-gram model determines a co-occurrence probability distribution for each potential chinese word or word combination.
Preferably, the reference corpus is a commonly used Chinese dictionary that can be constructed by extracting frequently co-occurring words from a structured knowledge network, wherein the structured knowledge network is an encyclopedia based on a Chinese network.
Preferably, the structured knowledge network uses public knowledge to extract article titles from an encyclopedia, a Chinese Wikipedia, or any other suitable online database.
Further, extracting chinese noun phrases, words or words includes the steps of: all numeric characters, punctuation, english and other language characters are filtered out.
Further, the word frequency of the extraction result is estimated by the following equation:
wherein the word frequency weight is more than or equal to 0 and less than or equal to 1.
Preferably, storing the extraction results and the word frequency weighting vectors of the corresponding concepts for generating another chinese ontology library comprises the steps of: mapping the Chinese noun phrases, words or words in the network ontology library language by using the corresponding word frequency weighting calculation result; and constructs an index of the word-frequency weighted vector for generating the concept of another chinese ontology library.
Preferably, the network ontology library language is RDF.
When the source material base is of a large scale, an alternative method of extracting knowledge from the source material base comprises the steps of: obtaining character strings from the source language material libraries, wherein each source language material library represents a concept; separating the character string into segmented words or phrases; applying POS labels to the segmented words or phrases; extracting Chinese noun words or words from the segmented words or words; the separated noun words or words are used for building up the single Chinese noun words or words into meaningful phrases or compound words; calculating the word frequency of the extraction result; and storing the extraction result and the word frequency weighting vector of the corresponding concept for generating another Chinese ontology library.
Drawings
FIG. 1 is a flow diagram illustrating a system and data content for an extraction system;
FIG. 2 is a flow diagram illustrating an alternative embodiment of the knowledge extraction system when the source material base is large in scale;
FIG. 3 is a flow chart illustrating a term frequency weighted counting system;
FIG. 4 is a flow diagram illustrating data content for a learning extraction system showing, in one embodiment, steps for converting a string of characters into an index having a word-frequency weighted vector;
FIG. 5 is a flow chart illustrating the data content of a Chinese phrase collocation unit, showing the steps for determining a Chinese noun phrase or compound from a word or word, in one embodiment.
Detailed Description
The present invention will now be described in detail with reference to exemplary embodiments thereof as illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout.
All figures and the following description relate to preferred embodiments by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily made without departing from the principles claimed herein, and should not be considered as limited to the embodiments described herein.
Embodiments of the systems, methods, and computer-readable media disclosed herein provide knowledge extraction for a corpus of chinese text generated by a chinese ontology. The method comprises the following steps: obtaining character strings from source language material libraries, wherein each source language material library represents a concept; separating the character string into segmented words or phrases; applying POS labels to the segmented words or phrases; building up individual Chinese words or words into meaningful phrases or compound words from the segmented words or words; extracting Chinese noun phrases, words or words from the separated phrases, words or words; calculating the word frequency of the extraction result; and storing the extraction result and the word frequency weighting vector of the corresponding concept for generating another Chinese ontology library.
Referring now to the drawings, FIG. 1 is a flow diagram illustrating a knowledge extraction system 102 for Chinese ontology generation that extracts knowledge 103 from a source corpus 101, which includes an acquisition module 111, a word segmenter 112, mature POS tags 113, a Chinese phrase collocation unit 114, a Chinese name selector 115, and a word frequency weighted counter 116. In one embodiment, the knowledge extraction system 102 may be implemented by an alternative to the flowchart shown in FIG. 2, which includes an acquisition module 111, a word segmenter 112, mature POS tags 113, a Chinese name word selector 115, a Chinese phrase collocation unit 114, and a word frequency weighted counter 116. As can be seen in FIG. 3, the flow chart illustrates the structure of the word frequency weighting counter 116, which generates an index 149 with word frequency weighting vectors from Chinese noun phrases, words or words 148 as the knowledge 103 for Chinese ontology generation. FIG. 4 is also a flow chart illustrating how knowledge is extracted from a source corpus using word frequency weighting vectors, in one embodiment. FIG. 5 is also a flow chart illustrating how the Chinese phrase collocation unit 114 determines Chinese phrases or compounds from POS tagged words or words 143, in one embodiment.
The method for extracting knowledge from the source language material library 101 disclosed in the present application can be implemented by a flowchart as shown in fig. 1, which includes the steps of: acquiring character strings from the source language material library; separating the character strings; applying POS marking; collocating single Chinese words or words; extracting Chinese noun phrases, words or words from the Chinese noun phrases, words or words; calculating word frequency and storing the result.
In generating a Chinese ontology, a text processing method or system is needed to extract useful information from source material libraries 101, where each source material library 101 represents a concept. The concept is an abstract concept. Since nouns contain the most representative knowledge, one can understand a concept by extracting and browsing all relevant noun words in a corpus of text describing the concept, thereby saying some events, people, things, places, times, features, and characteristics that are relevant to the concept. All of the above information may be referred to as knowledge of the concept. The data user may obtain further understanding of the source corpus 101 by determining these important noun terms through the corresponding word frequency weighting vectors.
The source material library 101 may be an electronic document, such as an HTML page, Portable Document Format (PDF) file, or other computer readable medium from the Internet, WAN, LAN, private network, a single computer or other transmitting device or channel. The electronic document is written primarily in chinese and sometimes it also contains numeric characters, punctuation, english and other language characters without significant separation between words. Knowledge extraction system 102 is the core system of the present invention and can perform textual content analysis to determine the most important knowledge in the source corpus for Chinese ontology generation 103.
The retrieval module 111 retrieves character strings 141 from the source corpus 101, wherein the character strings 141 may be retrieved from the content of topics, headings, text, footers, and other textual content in computer readable media. Examples of characters may include chinese, english, or other language characters; CJK symbols, emoticons, Unicode, ASCII, or other character sets. In one embodiment, where the source material library 101 is primarily written in chinese or other asian languages with no significant separation or spacing between words, the acquisition module 111 may acquire all characters from the source material library 101 as input data for further extracting meaningful knowledge therefrom.
The word segmenter 112 segments the string 141 into separate words or words 142 by word segmentation, wherein the separate words or words 142 may be a word or a combination of Chinese words (Chinese words). Word segmentation is a common method of word segmentation, which is an operation for determining the boundaries of combined words, whereby the resulting words may have different meanings when put together. In one embodiment, word segmentation may be performed by applying one or more word segmentation models, wherein the word segmentation models are Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs). In fig. 4, this embodiment demonstrates the operation of the word slicer 112. The title and character string 161 from the source material library 101 is separated by slashes (/) into entities 162 of words or phrases.
The mature POS tag 113 can apply the POS tag to the segmented word or word 142 to determine the part of speech of the corresponding word or word. This module extracts POS information from a mature POS tagging model for chinese. In one embodiment, the mature POS annotation model for chinese is an HTTP link by the linguistic data consortium as follows: https:// category. ldc. upenn. edu/ldc2004t 05. Embodiments of a sophisticated POS annotation model include, but are not limited to, the POS annotations associated with the corresponding segmented words or terms 142 being mapped together in a vector space, where the mapping can be done by building an index, table, database, array, or any other computer-readable indexing medium. In the embodiment of FIG. 4, step 163 demonstrates the determination of POS annotations and storage with words or words.
In the prior art, the automatic computer system has difficulty in extracting the relevant information generated by the Chinese ontology library. Traditional word segmentation methods, including HMMs, CRFs, and word lattices, can only identify a large percentage of words or words in a corpus, and because of word segmentation ambiguities, these methods cannot effectively identify meaningful chinese phrases or compounds. Advantageously, embodiments of the present invention implement a Chinese phrase collocation unit 114 that can identify potential Chinese phrases 145 by POS tagged words or words 143 and searching common co-occurring words or words in the reference corpus 123. FIG. 5 illustrates the internal modules of the Chinese phrase collocation unit 114. From the POS tagged words or words 143, the grouping system can identify a series of two or more Chinese words or words having the same POS tag and store the results as a co-occurrence Chinese word or word group 144, where the grouping system can search through the string of POS tagged words or words and determine the boundaries of the groups by storing the words or words together as a group if they have the same POS tag adjacent to each other. Preferably, the grouping system may include an input counter to determine the number of words or words in each group. If the result of the input counter for a group is 1, such a group does not need any collocation, and the remaining collocation steps can be skipped. The input counter may provide information on the number of iterations required for the n-gram model 121. The n-gram model 121 is an exhaustive iterative method for identifying all potential chinese phrases or compound words 145 from the co-occurring chinese word or group of words 144, wherein the n-gram model iterates and joins together every group of adjacent "n" or less words or words based on the results of the input counters for each group. As with the n-gram model presented in FIG. 5, group 1 in module 164 has three words or words:
intellectual property, property right and arrangement
By applying an n-gram model, we have 6 potential Chinese noun phrases or compound words for group 1 in module 164, as follows:
intellectual property, department, intellectual property, department of property, intellectual property arrangement.
The Chinese
phrase collocation module 122 can search a
reference corpus 123 containing a universally accepted Chinese dictionary for each potential phrase or
compound 145, where the
reference corpus 123 can be constructed by extracting frequently co-occurring words or words from a structured knowledge network. In one embodiment, the structured knowledge network may be a network encyclopedia with public knowledge,
Or any other suitable online database. Each article in the encyclopedia consists of a topic. The chinese
phrase collocation module 122 may search for potential phrases from the headings in the encyclopedia to determine which co-occurring words or words are commonly used adjacently. Preferably, computer-implemented mathematical methods may be used to determine the probability of occurrence of each determined phrase or compound from the
reference corpus 123 to determine the most appropriate overlapAnd (6) matching the results. In one embodiment, if the phrase or compound is also found elsewhere in the segmented text, such phrase or compound will be selected as a suitable result in addition to the potential Chinese phrase or compound 145 determined by the n-
gram model 121. The confirmed chinese phrase or
compound word 146 can be stored with its POS callouts and replace the respective co-occurring chinese word or word.
The Chinese noun selector 115 may extract a Chinese noun phrase, word or word 148 from the POS tagged phrases, words or words 147. The source corpus 101 is primarily written in chinese and sometimes contains numeric characters, punctuation, english and other language characters, with no obvious separation between words. The Chinese characters include traditional Chinese characters and simplified Chinese characters. There are many ways to extract Chinese nouns. One approach is to filter out all other characters or punctuation marks that are not encoded according to the national standard (GB), the BIG5 standard or the CJK standard.
In an alternative embodiment, when the source material library 101 is of a large scale, the knowledge extraction system 102 may be implemented according to the flow of FIG. 2, which includes the steps of: acquiring character strings from the source language material library; separating the character strings; applying POS marking; extracting Chinese noun words or phrases; collocating independent Chinese noun words or phrases as noun phrases; calculating word frequency and storing the result. Large-scale source corpus 101 has more words or phrases that produce significantly more iterations in n-gram model 121 for searching in reference corpus 123. By placing the Chinese name word selector 115 before the Chinese phrase collocation unit 114, the number of iterations can be reduced, reducing the time required for phrase collocation. In this embodiment, the Chinese noun words or phrases are filtered by determining the encoding criteria and grouping into co-occurring Chinese words or phrases 144 at the same time. Such groups may utilize an n-gram model 121 for determining potential chinese phrases or compounds 145.
The word frequency weighting counter 116 may derive word frequency weighting vectors 169 from the chinese nameword phrases, words or words 148 and store the results in an index for the chinese ontology generation 103. The word frequency (TF) weighting of the extracted chinese nouns is calculated as follows:
wherein the word frequency weight is more than or equal to 0 and less than or equal to 1.
If the term frequency weighting vector of the extracted Chinese nouns is close to 1, the extracted Chinese nouns have a high incidence, which is a more representative knowledge associated with the source corpus 101. On the contrary, if the word frequency weighting vector of the extracted chinese noun is close to 0, the occurrence rate of the extracted chinese noun is small, which is a less representative knowledge. Since nouns contain the most representative knowledge, the word frequency weighting vector helps to generate quantitative knowledge for the subsequent chinese ontology by identifying the most important noun phrases, words or words.
The word frequency weighting calculation result and the corresponding Chinese noun phrase and word 133 are mapped in the network ontology library language. The main ontology base may be encoded by a formal language such as OWL, RDF or RDFs. Other ontology languages may also be used. In one embodiment, Chinese noun phrases, words or words and word frequency weightings are recorded in RDF triples. A further visualization interface or user interface may be used to display a table containing the RDF data. Other implementations of the database storing results may also be used without departing from the invention. The RDF formatted index 134 provides the knowledge extraction results for the Chinese ontology generation 103.
The invention has been described above with particular reference to exemplary embodiments and examples, but it will be understood that variations and modifications can be effected within the spirit and scope of the claims. The above embodiments illustrate the possible scope of the invention, but do not limit the scope of the invention.