CN108319583B - Method and system for extracting knowledge from Chinese language material library - Google Patents

Method and system for extracting knowledge from Chinese language material library Download PDF

Info

Publication number
CN108319583B
CN108319583B CN201810016373.6A CN201810016373A CN108319583B CN 108319583 B CN108319583 B CN 108319583B CN 201810016373 A CN201810016373 A CN 201810016373A CN 108319583 B CN108319583 B CN 108319583B
Authority
CN
China
Prior art keywords
words
chinese
phrases
word
segmented
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810016373.6A
Other languages
Chinese (zh)
Other versions
CN108319583A (en
Inventor
李应樵
张英辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Marvel Digital Ai Ltd
Original Assignee
Marvel Digital Ai Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Marvel Digital Ai Ltd filed Critical Marvel Digital Ai Ltd
Publication of CN108319583A publication Critical patent/CN108319583A/en
Application granted granted Critical
Publication of CN108319583B publication Critical patent/CN108319583B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/221Parsing markup language streams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

A method, system and computer readable medium for extracting knowledge (103) from a source corpus (101) written primarily in Chinese for generating a Chinese ontology library is disclosed. The method comprises the following steps: obtaining character strings (141) from source material libraries (101), wherein each source material library (101) represents a concept; segmenting the character string (141) into segmented words or words (142); applying part-of-speech (POS) tags (113) to the segmented words or words (142); building up individual Chinese words or words into meaningful phrases or compound words from the segmented words or words; extracting chinese noun phrases, words or words (148) from the segmented phrases, words or words (142); calculating word frequency for the extraction result; and storing the extraction results and the word frequency weighting vector (149) of the concept for generating another Chinese ontology library.

Description

Method and system for extracting knowledge from Chinese language material library
Technical Field
The invention relates to the field of word segmentation, in particular to a method and a system for extracting knowledge from a source language material base mainly written in Chinese, which are used for generating a Chinese ontology base through automatic word segmentation, part of speech (POS) labeling, Chinese noun phrase collocation and frequency calculation.
Background
In the information technology age, large amounts of data are uploaded and downloaded daily from networks, enterprise computer networks, or other databases. Data users always want to search for information they want from a network, enterprise computer network, or other database, but sometimes the information returned is not just good. An ontology library is a representation of specific similarities and associations between different concepts, where each concept has its own unique semantic information to improve the accuracy and predictive relevance of the search.
The ontology base may be generated with knowledge in different languages. Regardless of the language used, the corpus in such language must be processed and key phrases extracted for ontology generation. Some languages, such as chinese, have no obvious separating words between words, may be more difficult or complex in language processing than english, and may make knowledge extraction difficult. Therefore, it is difficult to have an efficient segmentation method to segment a corpus of chinese text into meaningful phrases.
Traditionally, text segmentation of chinese text corpora is achieved by Conditional Random Fields (CRFs) or Hidden Markov Models (HMMs). Both of these methods are statistical modeling methods based on pattern recognition and prediction. However, the basic unit of these segmentation methods is a word or word rather than a phrase, and thus the chinese phrases in all kanji strings are segmented into words or words for derivation of semantic similarity. Thus, the prior art algorithms unnecessarily increase the overall count for recognition and result in a reduction of meaningful results for further generating the chinese ontology library. For example, rather than extracting the entire phrase, chinese phrases such as "financial crisis" are segmented into "finance" and "crisis," where the most relevant information or knowledge may not be perceived due to the segmentation.
US20090313243 a1 discloses a method to calculate relevance scores for phrases in semantic data sources of a domain and to calculate weights for the semantic data sources based on the relevance scores for the phrases. The relevance score is calculated based on the frequency of a phrase in the domain corpus and the expected frequency of the phrase. This approach has some of the features of the present invention, but has the disadvantage of being inefficient and incapacitating in processing chinese phrases without explicit separators or spaces between words.
CN 1011699780A discloses a search system based on semantic ontology base. The text index processing unit is a conventional processing unit for establishing a text index by analyzing text content, extracting keywords and file identification information. Semantic searching in this publication focuses on the relationships and attributes of keywords without recognizing the importance of word segmentation, labeling, and word frequency weighting to identify relevant information.
US 7680648B 2 discloses a method and system for improving text segmentation. A series of characters can be segmented into a combination of segmented strings, and the disclosed method introduces a frequency of occurrence to identify and select the best operable segmentation result therein. The method has a good segmentation effect on search queries without definite separators, but has no concept of collocation or noun phrase identification, and has an unobvious processing effect on Chinese sentences.
Thus, there is a need for a more efficient, accurate method and system, preferably an automated computer-implementable method and system, for extracting knowledge from a chinese corpus to better enable chinese ontology library generation.
Disclosure of Invention
Since chinese is written continuously without explicit separators or spaces between words, it is difficult for an automated computer system to perform text segmentation and related information extraction for chinese ontology library generation. The accuracy of knowledge extraction always depends on the way the sentence is segmented and the choice of extracting the word tokens. In a chinese corpus, phrases and compounds containing two or more characters are often used to express a particular meaning, rather than an individual meaning for each word or word. This leads to complexity and divergence in the segmentation process. Traditional word segmentation methods can identify most words or words in a corpus, e.g., words such as "knowledge" and "property" can be identified rather than categorizing them as "known", "produced" and "rights". However, the combination of these two words "intellectual property" is difficult to identify. The invention aims to solve the problem and provides a method for extracting meaningful information from a corpus.
Embodiments of the present invention include methods and systems for improving Chinese word segmentation. It includes a collocation module that uses a Chinese dictionary as a reference corpus to identify and collocate frequently co-occurring words or phrases. The reference corpus may be automatically built by extracting article titles from structured network knowledge, which is a database of structured information stored on a network. For example, there are several Chinese cyber encyclopedias such as Baidu encyclopedia (Baidu Baike) and Chinese Wikipedia (Chinese Wikipedia), etc., which are common basic knowledge containing millions of articles. The large number of common phrases and compound words are contained, and necessary resources for improving word segmentation can be provided.
Described below are a method, system, and computer-readable medium encoded with instructions that, when executed by a processor, perform the method for automatic word segmentation and POS tagging of a corpus of Chinese text generated by a Chinese ontology. The method comprises the following steps: obtaining character strings from the source language material libraries, wherein each source language material library represents a concept; separating the character string into segmented words or phrases; applying POS labels to the segmented words or phrases; building up individual Chinese words or words into meaningful phrases or compound words from the segmented words or words; extracting Chinese noun phrases, words or words from the separated phrases, words or words; calculating the word frequency of the extraction result; and storing the extraction result and the word frequency weighting vector of the corresponding concept for generating another Chinese ontology library.
Preferably, the step of obtaining the character string from the source language material library comprises: topics, titles, and main textual content are received from source material libraries, where each source material library represents a concept. Titles and topics are very useful for determining the name of a concept, while the main text provides a description of the concept.
Preferably, the source material library is written primarily in Chinese, sometimes also containing numeric characters, punctuation, English, and other language characters, without significant separation between words. The source material library includes electronic documents in networks and other systems such as the internet, WAN, LAN, private network, or a single computer.
Preferably, separating the character string into segmented words or words comprises the steps of: the segmentation result is confirmed by word segmentation, wherein the segmentation result may be a word or a word.
Further, the segmentation of the string into isolated words or words comprises the steps of: applying one or more word segmentation models, wherein the word segmentation models are Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs).
Preferably, tagging segmented words or words with POS comprises the steps of: extracting POS information related to the segmented words or expressions; and assigning POS labels to the segmented words or phrases.
Preferably, the extracting of the POS information related to the segmented words or phrases is extracting the POS information from a mature POS tagging model for Chinese, wherein the mature POS tagging model for Chinese is "Chinese TreeBank (CTB)" obtained from the linguistic data consortium.
Further, assigning POS labels to the isolated words or terms is by mapping the POS features to the isolated words or terms in a vector space, wherein the mapping can be done by building an index or table.
Preferably, collocating a single chinese word or word into a meaningful phrase or compound comprises the steps of: grouping co-occurring Chinese words or words; finding potential Chinese phrases or compound words from the Chinese words or word groups; searching the potential Chinese phrases or compound words from a reference corpus; storing the confirmed Chinese phrases or compound words with POS labels; and corresponding co-occurring chinese words or words are removed.
Preferably, the grouping of co-occurring chinese noun words or words is performed by identifying a series of two or more chinese words or words labeled as a noun group.
Preferably, the finding of potential chinese phrases or compound words from the chinese words or word groups is performed by using an n-gram (n-gram) model to find potential phrases, wherein the n-gram model determines a co-occurrence probability distribution for each potential chinese word or word combination.
Preferably, the reference corpus is a commonly used Chinese dictionary that can be constructed by extracting frequently co-occurring words from a structured knowledge network, wherein the structured knowledge network is an encyclopedia based on a Chinese network.
Preferably, the structured knowledge network uses public knowledge to extract article titles from an encyclopedia, a Chinese Wikipedia, or any other suitable online database.
Further, extracting chinese noun phrases, words or words includes the steps of: all numeric characters, punctuation, english and other language characters are filtered out.
Further, the word frequency of the extraction result is estimated by the following equation:
Figure GDA0003060654280000041
wherein the word frequency weight is more than or equal to 0 and less than or equal to 1.
Preferably, storing the extraction results and the word frequency weighting vectors of the corresponding concepts for generating another chinese ontology library comprises the steps of: mapping the Chinese noun phrases, words or words in the network ontology library language by using the corresponding word frequency weighting calculation result; and constructs an index of the word-frequency weighted vector for generating the concept of another chinese ontology library.
Preferably, the network ontology library language is RDF.
When the source material base is of a large scale, an alternative method of extracting knowledge from the source material base comprises the steps of: obtaining character strings from the source language material libraries, wherein each source language material library represents a concept; separating the character string into segmented words or phrases; applying POS labels to the segmented words or phrases; extracting Chinese noun words or words from the segmented words or words; the separated noun words or words are used for building up the single Chinese noun words or words into meaningful phrases or compound words; calculating the word frequency of the extraction result; and storing the extraction result and the word frequency weighting vector of the corresponding concept for generating another Chinese ontology library.
Drawings
FIG. 1 is a flow diagram illustrating a system and data content for an extraction system;
FIG. 2 is a flow diagram illustrating an alternative embodiment of the knowledge extraction system when the source material base is large in scale;
FIG. 3 is a flow chart illustrating a term frequency weighted counting system;
FIG. 4 is a flow diagram illustrating data content for a learning extraction system showing, in one embodiment, steps for converting a string of characters into an index having a word-frequency weighted vector;
FIG. 5 is a flow chart illustrating the data content of a Chinese phrase collocation unit, showing the steps for determining a Chinese noun phrase or compound from a word or word, in one embodiment.
Detailed Description
The present invention will now be described in detail with reference to exemplary embodiments thereof as illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout.
All figures and the following description relate to preferred embodiments by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily made without departing from the principles claimed herein, and should not be considered as limited to the embodiments described herein.
Embodiments of the systems, methods, and computer-readable media disclosed herein provide knowledge extraction for a corpus of chinese text generated by a chinese ontology. The method comprises the following steps: obtaining character strings from source language material libraries, wherein each source language material library represents a concept; separating the character string into segmented words or phrases; applying POS labels to the segmented words or phrases; building up individual Chinese words or words into meaningful phrases or compound words from the segmented words or words; extracting Chinese noun phrases, words or words from the separated phrases, words or words; calculating the word frequency of the extraction result; and storing the extraction result and the word frequency weighting vector of the corresponding concept for generating another Chinese ontology library.
Referring now to the drawings, FIG. 1 is a flow diagram illustrating a knowledge extraction system 102 for Chinese ontology generation that extracts knowledge 103 from a source corpus 101, which includes an acquisition module 111, a word segmenter 112, mature POS tags 113, a Chinese phrase collocation unit 114, a Chinese name selector 115, and a word frequency weighted counter 116. In one embodiment, the knowledge extraction system 102 may be implemented by an alternative to the flowchart shown in FIG. 2, which includes an acquisition module 111, a word segmenter 112, mature POS tags 113, a Chinese name word selector 115, a Chinese phrase collocation unit 114, and a word frequency weighted counter 116. As can be seen in FIG. 3, the flow chart illustrates the structure of the word frequency weighting counter 116, which generates an index 149 with word frequency weighting vectors from Chinese noun phrases, words or words 148 as the knowledge 103 for Chinese ontology generation. FIG. 4 is also a flow chart illustrating how knowledge is extracted from a source corpus using word frequency weighting vectors, in one embodiment. FIG. 5 is also a flow chart illustrating how the Chinese phrase collocation unit 114 determines Chinese phrases or compounds from POS tagged words or words 143, in one embodiment.
The method for extracting knowledge from the source language material library 101 disclosed in the present application can be implemented by a flowchart as shown in fig. 1, which includes the steps of: acquiring character strings from the source language material library; separating the character strings; applying POS marking; collocating single Chinese words or words; extracting Chinese noun phrases, words or words from the Chinese noun phrases, words or words; calculating word frequency and storing the result.
In generating a Chinese ontology, a text processing method or system is needed to extract useful information from source material libraries 101, where each source material library 101 represents a concept. The concept is an abstract concept. Since nouns contain the most representative knowledge, one can understand a concept by extracting and browsing all relevant noun words in a corpus of text describing the concept, thereby saying some events, people, things, places, times, features, and characteristics that are relevant to the concept. All of the above information may be referred to as knowledge of the concept. The data user may obtain further understanding of the source corpus 101 by determining these important noun terms through the corresponding word frequency weighting vectors.
The source material library 101 may be an electronic document, such as an HTML page, Portable Document Format (PDF) file, or other computer readable medium from the Internet, WAN, LAN, private network, a single computer or other transmitting device or channel. The electronic document is written primarily in chinese and sometimes it also contains numeric characters, punctuation, english and other language characters without significant separation between words. Knowledge extraction system 102 is the core system of the present invention and can perform textual content analysis to determine the most important knowledge in the source corpus for Chinese ontology generation 103.
The retrieval module 111 retrieves character strings 141 from the source corpus 101, wherein the character strings 141 may be retrieved from the content of topics, headings, text, footers, and other textual content in computer readable media. Examples of characters may include chinese, english, or other language characters; CJK symbols, emoticons, Unicode, ASCII, or other character sets. In one embodiment, where the source material library 101 is primarily written in chinese or other asian languages with no significant separation or spacing between words, the acquisition module 111 may acquire all characters from the source material library 101 as input data for further extracting meaningful knowledge therefrom.
The word segmenter 112 segments the string 141 into separate words or words 142 by word segmentation, wherein the separate words or words 142 may be a word or a combination of Chinese words (Chinese words). Word segmentation is a common method of word segmentation, which is an operation for determining the boundaries of combined words, whereby the resulting words may have different meanings when put together. In one embodiment, word segmentation may be performed by applying one or more word segmentation models, wherein the word segmentation models are Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs). In fig. 4, this embodiment demonstrates the operation of the word slicer 112. The title and character string 161 from the source material library 101 is separated by slashes (/) into entities 162 of words or phrases.
The mature POS tag 113 can apply the POS tag to the segmented word or word 142 to determine the part of speech of the corresponding word or word. This module extracts POS information from a mature POS tagging model for chinese. In one embodiment, the mature POS annotation model for chinese is an HTTP link by the linguistic data consortium as follows: https:// category. ldc. upenn. edu/ldc2004t 05. Embodiments of a sophisticated POS annotation model include, but are not limited to, the POS annotations associated with the corresponding segmented words or terms 142 being mapped together in a vector space, where the mapping can be done by building an index, table, database, array, or any other computer-readable indexing medium. In the embodiment of FIG. 4, step 163 demonstrates the determination of POS annotations and storage with words or words.
In the prior art, the automatic computer system has difficulty in extracting the relevant information generated by the Chinese ontology library. Traditional word segmentation methods, including HMMs, CRFs, and word lattices, can only identify a large percentage of words or words in a corpus, and because of word segmentation ambiguities, these methods cannot effectively identify meaningful chinese phrases or compounds. Advantageously, embodiments of the present invention implement a Chinese phrase collocation unit 114 that can identify potential Chinese phrases 145 by POS tagged words or words 143 and searching common co-occurring words or words in the reference corpus 123. FIG. 5 illustrates the internal modules of the Chinese phrase collocation unit 114. From the POS tagged words or words 143, the grouping system can identify a series of two or more Chinese words or words having the same POS tag and store the results as a co-occurrence Chinese word or word group 144, where the grouping system can search through the string of POS tagged words or words and determine the boundaries of the groups by storing the words or words together as a group if they have the same POS tag adjacent to each other. Preferably, the grouping system may include an input counter to determine the number of words or words in each group. If the result of the input counter for a group is 1, such a group does not need any collocation, and the remaining collocation steps can be skipped. The input counter may provide information on the number of iterations required for the n-gram model 121. The n-gram model 121 is an exhaustive iterative method for identifying all potential chinese phrases or compound words 145 from the co-occurring chinese word or group of words 144, wherein the n-gram model iterates and joins together every group of adjacent "n" or less words or words based on the results of the input counters for each group. As with the n-gram model presented in FIG. 5, group 1 in module 164 has three words or words:
intellectual property, property right and arrangement
By applying an n-gram model, we have 6 potential Chinese noun phrases or compound words for group 1 in module 164, as follows:
intellectual property, department, intellectual property, department of property, intellectual property arrangement.
The Chinese phrase collocation module 122 can search a reference corpus 123 containing a universally accepted Chinese dictionary for each potential phrase or compound 145, where the reference corpus 123 can be constructed by extracting frequently co-occurring words or words from a structured knowledge network. In one embodiment, the structured knowledge network may be a network encyclopedia with public knowledge,
Figure GDA0003060654280000093
Figure GDA0003060654280000092
Or any other suitable online database. Each article in the encyclopedia consists of a topic. The chinese phrase collocation module 122 may search for potential phrases from the headings in the encyclopedia to determine which co-occurring words or words are commonly used adjacently. Preferably, computer-implemented mathematical methods may be used to determine the probability of occurrence of each determined phrase or compound from the reference corpus 123 to determine the most appropriate overlapAnd (6) matching the results. In one embodiment, if the phrase or compound is also found elsewhere in the segmented text, such phrase or compound will be selected as a suitable result in addition to the potential Chinese phrase or compound 145 determined by the n-gram model 121. The confirmed chinese phrase or compound word 146 can be stored with its POS callouts and replace the respective co-occurring chinese word or word.
The Chinese noun selector 115 may extract a Chinese noun phrase, word or word 148 from the POS tagged phrases, words or words 147. The source corpus 101 is primarily written in chinese and sometimes contains numeric characters, punctuation, english and other language characters, with no obvious separation between words. The Chinese characters include traditional Chinese characters and simplified Chinese characters. There are many ways to extract Chinese nouns. One approach is to filter out all other characters or punctuation marks that are not encoded according to the national standard (GB), the BIG5 standard or the CJK standard.
In an alternative embodiment, when the source material library 101 is of a large scale, the knowledge extraction system 102 may be implemented according to the flow of FIG. 2, which includes the steps of: acquiring character strings from the source language material library; separating the character strings; applying POS marking; extracting Chinese noun words or phrases; collocating independent Chinese noun words or phrases as noun phrases; calculating word frequency and storing the result. Large-scale source corpus 101 has more words or phrases that produce significantly more iterations in n-gram model 121 for searching in reference corpus 123. By placing the Chinese name word selector 115 before the Chinese phrase collocation unit 114, the number of iterations can be reduced, reducing the time required for phrase collocation. In this embodiment, the Chinese noun words or phrases are filtered by determining the encoding criteria and grouping into co-occurring Chinese words or phrases 144 at the same time. Such groups may utilize an n-gram model 121 for determining potential chinese phrases or compounds 145.
The word frequency weighting counter 116 may derive word frequency weighting vectors 169 from the chinese nameword phrases, words or words 148 and store the results in an index for the chinese ontology generation 103. The word frequency (TF) weighting of the extracted chinese nouns is calculated as follows:
Figure GDA0003060654280000101
wherein the word frequency weight is more than or equal to 0 and less than or equal to 1.
If the term frequency weighting vector of the extracted Chinese nouns is close to 1, the extracted Chinese nouns have a high incidence, which is a more representative knowledge associated with the source corpus 101. On the contrary, if the word frequency weighting vector of the extracted chinese noun is close to 0, the occurrence rate of the extracted chinese noun is small, which is a less representative knowledge. Since nouns contain the most representative knowledge, the word frequency weighting vector helps to generate quantitative knowledge for the subsequent chinese ontology by identifying the most important noun phrases, words or words.
The word frequency weighting calculation result and the corresponding Chinese noun phrase and word 133 are mapped in the network ontology library language. The main ontology base may be encoded by a formal language such as OWL, RDF or RDFs. Other ontology languages may also be used. In one embodiment, Chinese noun phrases, words or words and word frequency weightings are recorded in RDF triples. A further visualization interface or user interface may be used to display a table containing the RDF data. Other implementations of the database storing results may also be used without departing from the invention. The RDF formatted index 134 provides the knowledge extraction results for the Chinese ontology generation 103.
The invention has been described above with particular reference to exemplary embodiments and examples, but it will be understood that variations and modifications can be effected within the spirit and scope of the claims. The above embodiments illustrate the possible scope of the invention, but do not limit the scope of the invention.

Claims (41)

1. A method of extracting knowledge from a source corpus written in chinese and/or non-chinese for use in chinese ontology generation, the method comprising the steps of:
obtaining character strings from the source language material libraries, wherein each source language material library represents a concept;
separating the character string into segmented words or phrases;
applying POS labels to the segmented words or phrases;
building up the single Chinese words or words into meaningful phrases or compound words by the segmented words or words;
extracting Chinese noun phrases, words or words from the separated phrases, words or words;
calculating the word frequency of the extraction result; and
and storing the extraction result and the word frequency weighting vector of the corresponding concept for generating another Chinese ontology library.
2. The method of claim 1, wherein the step of obtaining a string from a source material library comprises the steps of: and obtaining the subject, the title and the main text content from the source material library.
3. The method of claim 2, wherein said source material library is written in chinese and/or non-chinese, containing numeric characters, punctuation, english, and other language characters, with no apparent separation between said words, including electronic documents in the internet, WAN, LAN, private network, or a single computer.
4. The method of claim 1, wherein the step of separating the character string into segmented words or phrases comprises the steps of: the segmentation result is confirmed by word segmentation, wherein the segmentation result is a word or a series of words in the form of words.
5. The method of claim 4, wherein said word segmentation comprises the steps of: applying one or more word segmentation models, wherein the word segmentation models are hidden Markov models and conditional stochastic domains.
6. The method of claim 1, wherein the step of applying POS annotations to the segmented words or words comprises the steps of:
extracting POS information related to the segmented words or expressions; and
assigning a POS tag to the segmented word or phrase.
7. The method of claim 6, wherein the step of extracting POS information related to the segmented words or phrases is extracting POS information from a mature POS tagging model for chinese, wherein the mature POS tagging model for chinese is a "chinese treebank" obtained from the linguistic data consortium.
8. The method of claim 6, wherein the step of assigning POS labels to the isolated words or phrases is accomplished by constructing an index or table and mapping the POS features to the isolated words or phrases in a vector space.
9. The method of claim 1, wherein collocating individual chinese words or phrases into meaningful phrases or compounds comprises the steps of:
grouping co-occurring Chinese words or words;
finding potential Chinese phrases or compound words from the Chinese words or word groups;
searching the potential Chinese phrases or compound words from a reference corpus;
storing the confirmed Chinese noun phrases or compound words by POS labeling; and
the corresponding co-occurring Chinese noun words or phrases are removed.
10. The method of claim 9, wherein the grouping of co-occurring chinese noun words or words is performed by identifying a series of two or more chinese words or words labeled as noun groups.
11. The method of claim 9, wherein the step of finding potential chinese noun phrases or compound words from the chinese noun words or word groups is performed by using an n-gram model to identify potential phrases, wherein the n-gram model determines a co-occurrence probability distribution for each potential chinese word or word combination.
12. The method of claim 11, wherein the step of identifying potential phrases by using an n-gram model is performed by searching results of the n-gram model.
13. The method of claim 9, wherein the reference corpus is a common chinese dictionary that is constructed by extracting frequently co-occurring words from a structured knowledge network, wherein the structured knowledge network is a chinese network-based encyclopedia.
14. The method of claim 13, wherein the structured knowledge network is an online database.
15. The method of claim 1, wherein the step of extracting chinese noun phrases, words or words comprises the steps of: all numeric characters, punctuation, english and other language characters are filtered out.
16. The method of claim 1, wherein the step of deriving the extracted term frequency is performed by the following equation:
Figure FDA0003282437550000031
wherein the word frequency weight is more than or equal to 0 and less than or equal to 1.
17. The method of claim 1, wherein the step of storing the extraction results and the word-frequency weighting vectors of the corresponding concepts for generating another chinese ontology library comprises the steps of:
mapping the Chinese noun phrases, words or words in the network ontology library language by using the corresponding word frequency weighting calculation result; and
an index of word-frequency weighted vectors for the concepts used to generate another Chinese ontology library is constructed.
18. The method of claim 17, wherein the network ontology library language is RDF.
19. An alternative method of extracting knowledge from a source corpus written in chinese and/or non-chinese for chinese ontology generation, the method comprising the steps of:
obtaining character strings from the source language material libraries, wherein each source language material library represents a concept;
separating the character string into segmented words or phrases;
applying POS labels to the segmented words or phrases;
extracting Chinese noun phrases, words or words from the segmented phrases, words or words;
building up individual Chinese words or words into meaningful phrases or compound words from the segmented words or words;
calculating the word frequency of the extraction result; and
and storing the extraction result and the word frequency weighting vector of the corresponding concept for generating another Chinese ontology library.
20. A system for extracting knowledge from a source corpus for chinese ontology generation, comprising:
an acquisition module for acquiring character strings from a source language material library;
a word segmenter for segmenting said string of characters into segmented words or phrases;
a mature POS tag for applying the POS tag to the segmented words or phrases;
an n-gram model for finding potential Chinese noun phrases or compound words;
a Chinese phrase collocation module for collocating the single words or phrases into meaningful phrases or compound words;
a Chinese noun selector for extracting Chinese noun phrases, words or words;
a word frequency weighting counter for calculating the word frequency of the extraction result; and
and the database is used for storing the extraction result and the word frequency weighting vector of the corresponding concept in the language of the network ontology library and is used for generating the Chinese ontology library.
21. The system of claim 20, wherein the source material library comprises an electronic document in the internet, WAN, LAN, private network, or a single computer.
22. The system of claim 20, further comprising a visualization interface for displaying a table containing chinese noun phrases, words or phrases and corresponding word frequency weighting vectors for the concepts.
23. The system of claim 20, wherein the extraction result and the word frequency weighting vector are encoded by RDF.
24. A computer readable medium encoded with instructions that, when executed by a processor, cause the processor to perform a method of extracting knowledge from a source corpus, the method comprising the steps of:
acquiring character strings from the source language material library;
separating the character string into segmented words or phrases;
applying POS labels to the segmented words or phrases;
building up individual Chinese words or words into meaningful phrases or compound words from the segmented words or words;
extracting Chinese noun phrases, words or words from the separated phrases, words or words;
calculating the word frequency of the extraction result; and
and storing the extraction result and the word frequency weighting vector of the corresponding concept for generating another Chinese ontology library.
25. The computer readable medium of claim 24, wherein the step of obtaining a string from a source material library comprises the steps of: the subject, title and main text content are obtained from a source language material base.
26. The computer-readable medium of claim 25, wherein said source material library is written in chinese and/or non-chinese, containing numeric characters, punctuation, english, and other language characters, without obvious separation between words, and comprises electronic documents on the internet, WAN, LAN, private network, or a single computer.
27. The computer readable medium of claim 24, wherein the step of separating the character string into segmented words or phrases comprises the steps of: the segmentation result is confirmed by word segmentation, wherein the segmentation result is a word or a word.
28. The computer readable medium of claim 27, further comprising the step of applying one or more word segmentation models, wherein the word segmentation models are hidden markov models and conditional stochastic domains.
29. The computer readable medium of claim 24, wherein the step of applying POS annotations to the segmented words or words comprises the steps of:
extracting POS information related to the segmented words or words; and
assigning a POS tag to the segmented word or phrase.
30. The computer-readable medium of claim 29, wherein the step of extracting POS information related to the segmented words or phrases is extracting POS information from a mature POS tagging model for chinese, wherein the mature POS tagging model for chinese is a "chinese treebank" obtained from a linguistic data consortium.
31. The computer readable medium of claim 29, wherein the step of assigning POS labels to the isolated words or terms is accomplished by mapping the POS features to the isolated words or terms in a vector space, wherein the mapping is accomplished by building an index or table.
32. The computer readable medium of claim 24, wherein collocating individual chinese words or phrases into meaningful phrases or compounds comprises the steps of:
grouping co-occurring Chinese words or words;
finding potential Chinese phrases or compound words from the Chinese words or word groups;
searching the potential Chinese phrases or compound words from a reference corpus;
storing the confirmed Chinese noun phrases or compound words by POS labeling; and
the corresponding co-occurring Chinese noun words or phrases are removed.
33. The computer readable medium of claim 32, wherein the grouping of co-occurring chinese noun words or words is performed by identifying a series of two or more chinese words or words labeled as a noun group.
34. The computer readable medium of claim 32, wherein the step of finding potential chinese noun phrases or compound words from the chinese noun words or groups of words is performed by using an n-gram model to identify potential phrases, wherein the n-gram model determines a co-occurrence probability distribution for each potential chinese word or combination of words.
35. The computer-readable medium of claim 32, wherein the reference corpus is a common chinese dictionary that is constructed by extracting frequently co-occurring words from a structured knowledge network, wherein the structured knowledge network is an encyclopedia based on a chinese network.
36. The computer-readable medium of claim 35, wherein the structured knowledge network is an online database.
37. The computer readable medium of claim 24, wherein the step of extracting chinese noun phrases, words or words comprises the steps of: all numeric characters, punctuation, english and other language characters are filtered out.
38. The computer readable medium of claim 24, wherein the step of deriving the extracted term frequency is performed by the following equation:
Figure FDA0003282437550000061
wherein the word frequency weight is more than or equal to 0 and less than or equal to 1.
39. The computer readable medium as recited in claim 24, wherein the step of storing the extraction results and the word frequency weighting vectors of the corresponding concepts for generating another chinese ontology library comprises the steps of:
mapping the Chinese noun phrases, words or words in the network ontology library language by using the corresponding word frequency weighting calculation result; and
an index of word-frequency weighted vectors for the concepts used to generate another Chinese ontology library is constructed.
40. The computer-readable medium of claim 39, wherein the network ontology library language is RDF.
41. A computer readable medium encoded with instructions that, when executed by a processor, cause the processor to perform an alternative method of extracting knowledge from a source corpus, the method comprising the steps of:
acquiring character strings from a source language material library;
separating the character string into segmented words or phrases;
applying POS labels to the segmented words or phrases;
extracting Chinese noun phrases, words or words from the segmented phrases, words or words;
building up individual Chinese words or words into meaningful phrases or compound words from the segmented words or words;
calculating the word frequency of the extraction result; and
and storing the extraction result and the word frequency weighting vector of the corresponding concept for generating another Chinese ontology library.
CN201810016373.6A 2017-01-06 2018-01-08 Method and system for extracting knowledge from Chinese language material library Active CN108319583B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
HK17100180.1 2017-01-06
HK17100180 2017-01-06

Publications (2)

Publication Number Publication Date
CN108319583A CN108319583A (en) 2018-07-24
CN108319583B true CN108319583B (en) 2021-11-26

Family

ID=62893215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810016373.6A Active CN108319583B (en) 2017-01-06 2018-01-08 Method and system for extracting knowledge from Chinese language material library

Country Status (3)

Country Link
CN (1) CN108319583B (en)
HK (1) HK1258818A1 (en)
TW (1) TWI656450B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020198855A1 (en) * 2019-03-29 2020-10-08 Knowtions Research Inc. Method and system for mapping text phrases to a taxonomy
CN113221553A (en) * 2020-01-21 2021-08-06 腾讯科技(深圳)有限公司 Text processing method, device and equipment and readable storage medium
EP3885962A1 (en) 2020-03-28 2021-09-29 Tata Consultancy Services Limited Method and system for extraction of key-terms and synonyms for the key-terms
CN112163421B (en) * 2020-10-09 2022-05-17 厦门大学 Keyword extraction method based on N-Gram
CN113268565B (en) * 2021-04-27 2022-03-25 山东大学 Method and device for quickly generating word vector based on concept text

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device
CN104484433A (en) * 2014-12-19 2015-04-01 东南大学 Book body matching method based on machine learning
CN106066866A (en) * 2016-05-26 2016-11-02 同方知网(北京)技术有限公司 A kind of automatic abstracting method of english literature key phrase and system
CN107480155A (en) * 2016-06-08 2017-12-15 北京新岸线网络技术有限公司 A kind of video searching system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040024755A1 (en) * 2002-08-05 2004-02-05 Rickard John Terrell System and method for indexing non-textual data
CN101261623A (en) * 2007-03-07 2008-09-10 国际商业机器公司 Word splitting method and device for word border-free mark language based on search
CN101377770B (en) * 2007-08-27 2017-03-01 微软技术许可有限责任公司 The method and system of Chinese Text Chunking
CN102193912B (en) * 2010-03-12 2013-11-06 富士通株式会社 Phrase division model establishing method, statistical machine translation method and decoder
CN102479191B (en) * 2010-11-22 2014-03-26 阿里巴巴集团控股有限公司 Method and device for providing multi-granularity word segmentation result
TWI563478B (en) * 2015-06-05 2016-12-21 Shu-Ming Hsieh Method of displaying architecture of English sentence

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device
CN104484433A (en) * 2014-12-19 2015-04-01 东南大学 Book body matching method based on machine learning
CN106066866A (en) * 2016-05-26 2016-11-02 同方知网(北京)技术有限公司 A kind of automatic abstracting method of english literature key phrase and system
CN107480155A (en) * 2016-06-08 2017-12-15 北京新岸线网络技术有限公司 A kind of video searching system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Word-Representation-Based Method for Extracting Organizational Events from Online Media";Jun-Qiang Zhang 等;《Journal of Electronic Science and Technology》;20170505;第15卷(第4期);第407-412页 *
"基于本体分割的本体映射算法";李志明 等;《模式识别与人工智能》;20110415;第24卷(第2期);第15-20页 *

Also Published As

Publication number Publication date
TWI656450B (en) 2019-04-11
TW201826145A (en) 2018-07-16
CN108319583A (en) 2018-07-24
HK1258818A1 (en) 2019-11-22

Similar Documents

Publication Publication Date Title
Weiss et al. Fundamentals of predictive text mining
Jung Semantic vector learning for natural language understanding
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
Schubotz et al. Semantification of identifiers in mathematics for better math information retrieval
Saad The impact of text preprocessing and term weighting on arabic text classification
Chen et al. A two-step resume information extraction algorithm
Avasthi et al. Techniques, applications, and issues in mining large-scale text databases
Yalcin et al. An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding
Golpar-Rabooki et al. Feature extraction in opinion mining through Persian reviews
Scharkow Content analysis, automatic
Zhang et al. Event-based summarization method for scientific literature
Nasim et al. Evaluation of clustering techniques on Urdu News head-lines: A case of short length text
Zheng et al. Multi-dimensional sentiment analysis for large-scale E-commerce reviews
Al-Sultany et al. Enriching tweets for topic modeling via linking to the wikipedia
CN114255067A (en) Data pricing method and device, electronic equipment and storage medium
WO2014049310A2 (en) Method and apparatuses for interactive searching of electronic documents
Narayanasamy et al. Effective preprocessing and normalization techniques for covid-19 twitter streams with pos tagging via lightweight hidden markov model
Pertsas et al. Ontology-driven information extraction from research publications
Mekki et al. Tokenization of Tunisian Arabic: a comparison between three Machine Learning models
Uma et al. A survey paper on text mining techniques
Baishya et al. Present state and future scope of Assamese text processing
Takale et al. An intelligent web search using multi-document summarization
Anley et al. Opinion Mining of Tourists' Sentiments: Towards a Comprehensive Service Improvement of Tourism Industry
Chaabene et al. Semantic annotation for the “on demand graphical representation” of variable data in Web documents
Khadilkar et al. A Knowledge Graph Based Approach for Automatic Speech and Essay Summarization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1258818

Country of ref document: HK

CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 311-313, 3 / F, block 12W, 12 science and technology Avenue West, Science Park, Shatin, New Territories, Hong Kong, China

Applicant after: World wide digital intelligence Co., Ltd

Address before: Room 311-313, 3 / F, block 12W, 12 science and technology Avenue West, Science Park, Shatin, New Territories, Hong Kong, China

Applicant before: Light News Network Technology Co., Ltd.

GR01 Patent grant
GR01 Patent grant