WO2015043066A1 - 关键词扩展方法及系统、及分类语料标注方法及系统 - Google Patents

关键词扩展方法及系统、及分类语料标注方法及系统 Download PDF

Info

Publication number
WO2015043066A1
WO2015043066A1 PCT/CN2013/088586 CN2013088586W WO2015043066A1 WO 2015043066 A1 WO2015043066 A1 WO 2015043066A1 CN 2013088586 W CN2013088586 W CN 2013088586W WO 2015043066 A1 WO2015043066 A1 WO 2015043066A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
search
keywords
current
word
Prior art date
Application number
PCT/CN2013/088586
Other languages
English (en)
French (fr)
Inventor
叶茂
汤帜
徐剑波
雷超
金立峰
Original Assignee
北大方正集团有限公司
北京方正阿帕比技术有限公司
北京大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北大方正集团有限公司, 北京方正阿帕比技术有限公司, 北京大学 filed Critical 北大方正集团有限公司
Priority to JP2016518124A priority Critical patent/JP6231668B2/ja
Priority to EP13894407.9A priority patent/EP3051431A4/en
Priority to US15/025,573 priority patent/US20160232211A1/en
Publication of WO2015043066A1 publication Critical patent/WO2015043066A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3322Query formulation using system suggestions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion

Definitions

  • Keyword expansion method and system Keyword expansion method and system, and classification corpus annotation method and system
  • the present invention relates to a keyword expansion method and a classification corpus automatic labeling method, and belongs to the technical field of electrical digital data processing. Background technique
  • Keyword one For the concentrated expression of a class of related terms, in order to improve the comprehensiveness of the content of the expression, the general keywords will have a variety of meanings.
  • a set initial key will generally be set.
  • the word is expanded to obtain a plurality of related words corresponding to the keyword, and the search is performed at the same time.
  • a keyword expansion method is provided. First, a database is established: the database includes keywords, vocabulary and identification codes; then the keywords are corresponding to at least one vocabulary; and the related keywords are corresponding to an identification code.
  • the keywords corresponding to the related keywords are searched for by the keywords input by the user; the words corresponding to each related keyword are queried.
  • the keyword automatic extension query method provided in the scheme is established on the basis of a preset word library.
  • the corpus annotation is the category characteristic information of the recorded corpus, and it is also the main content of the shallow analytic analysis of the corpus. It can be applied to the fields of information retrieval, machine translation, subject content analysis and text processing.
  • the accuracy of corpus annotation is directly related to the text. The correctness of analysis and text processing.
  • supervised text categorization techniques such as SVM (Support Vector Machine) for classification
  • SVM Small Vector Machine
  • the labeling of the classification corpus is usually implemented by manual methods, that is, the corpus labeling personnel judge the classification to which the corpus belongs according to their own knowledge.
  • the manual classification of corpus usually has the following problems: First, the labor cost is high; Second, the manual labeling period is long; Third, the manual labeling has subjective factors, that is, for the same corpus , different people may classify them into different categories; It is because the number of corpora is huge, because the labeling is fatigued, which is easy to cause labeling.
  • the prior art discloses a corpus annotation system based on BP neural network, a packet library memory, a corpus buffer memory to be labeled, a corpus annotation result comparator, and a BP neural network processing unit.
  • the BP The neural processing unit labels the corpus to be marked in the corpus memory, and stores the labeling result in the corpus buffer memory to be labeled, and the corpus labeling result comparator compares the results in the buffer.
  • the BP neural network processing unit includes at least two classification processors, and when the labeling result is processed, only when at least two classification processors treat the labeling result of the corpus according to the setting If a certain comparison coefficient is satisfied, the labeled corpus can be marked and stored in the corpus memory.
  • the technical scheme is based on the BP neural network algorithm. The algorithm is complex, with a large amount of computation, slow convergence, and consumes a large amount of corpus information. The duration is also required; at least two classification processors are required for classification processing, which takes up a large amount of memory; meanwhile, in order to train the neural network, a large number of large-scale annotation corpora need to be prepared in advance, and the cost of preparing the annotation corpus is still high. Summary of the invention
  • a technical problem to be solved by the present invention is that the keyword expansion in the prior art is subjective, the workload of the lexicon is large, and the accuracy of keyword expansion is low, and an objective, simple, and accurate keyword expansion is proposed.
  • Program Another technical problem to be solved by the present invention is that a corpus annotation method based on BP neural network algorithm is adopted in the prior art, and the algorithm is complicated, the calculation amount is large, the receiving degree is slow, and the memory is large; at the same time, it is required in the feeding labeling Manually preparing a batch of large-scale annotation corpus to train the classification processor, the cost of preparing the required annotation corpus is still high, thus providing a classification method for automatic classification of corpus using machine assistance.
  • a keyword expansion method includes: searching according to a predetermined initial keyword, and retrieving and obtaining a current keyword; using the current keyword obtained by the retrieval as a basis of the next retrieval, and performing a cyclic retrieval by means of keyword iteration; When the keyword obtained by the current search and the keyword obtained by the previous search are within the preset error threshold, the search ends, and the current time will be The keywords obtained by the search are determined as the expanded keywords.
  • the process of retrieving the current keyword is: respectively counting the number of times the word obtained by the search occurs, and using the word whose number is greater than the preset threshold as the current keyword obtained by the search.
  • the process of retrieving the current keyword is: counting the number of words obtained by the statistical search and the number of occurrences of each word, sorting in descending order of the number of times, and using a certain proportion of words arranged in the front as the current key of the retrieval. word.
  • the method for obtaining the word obtained by the searching is: using a preset keyword to perform a search in the article library, obtaining an article with high relevance, and then classifying the articles with high relevance, and then dividing the result of the word segmentation The words obtained as a search.
  • the keyword expansion method further performs a de-stop operation after the word segmentation, and then obtains a co-occurrence word that appears at the same time as the preset keyword, and uses the co-occurrence word as a word obtained by the retrieval.
  • the keyword expansion method, the error of the keyword currently obtained by the current search and the keyword obtained by the previous search is: the keyword of the current search is different from the keyword of the previous search.
  • the keyword expansion method, the keywords obtained in the current search and the keywords obtained in the previous search respectively take the first n keywords, and perform error statistics, 5 ⁇ n ⁇ 10.
  • the preset error threshold is less than 20%.
  • the keyword obtained by the current search is the same as the keyword obtained by the previous search, the keyword obtained by the current search is determined as the key word of the expansion.
  • the method includes: determining one or more initial core keywords for each classification; obtaining an extension of each classification by using the keyword expansion method by using the initial core keyword Key words; using the extended keyword corresponding to the classification to search, selecting the classification corpus from the selection, and marking.
  • a keyword expansion system comprising: an obtaining unit: searching according to a predetermined initial keyword, and retrieving and obtaining a current keyword; and a loop searching unit: using the current keyword obtained by the search as a basis for the next search, by keywords
  • the iterative method performs a loop search; the judging unit: judges whether the keyword obtained by the current search and the keyword error obtained by the previous search are within a preset error threshold, and if so, causes the loop search of the loop search unit to end, The keyword obtained by the current search is determined as the expanded keyword.
  • the obtaining unit includes: a search term obtaining module, configured to perform a search in the article library by using a preset keyword, obtain an article with high relevance, and then segment the article with high degree of relevance, and segment the word segmentation.
  • the obtaining unit includes: a search term obtaining module, configured to perform a search in the article library by using a preset keyword, obtain an article with high relevance, and then segment the article with high degree of relevance, and segment the word segmentation.
  • the latter result is obtained as a search term;
  • the search comparison obtains the keyword module: the number of words obtained by statistical search and the number of occurrences of each word are arranged in descending order of the number of times, and a certain proportion of words arranged in the front are obtained as a search.
  • Current keywords are obtained in the keyword expansion system.
  • the search term obtaining module searches the article library by using the preset keyword, obtains an article with high relevance, and then segments the article with high relevance, and then segments the word segmentation.
  • the keyword expansion system the error of the currently searched keyword and the previous searched keyword is: the keyword of the current search is different from the keyword obtained by the previous search. The number of the number of keywords in the current search.
  • the keyword expansion system the keywords obtained by the current search and the keywords obtained by the previous search, respectively take the first n keywords, and perform error statistics, 5 ⁇ n ⁇ 10.
  • the preset error threshold is less than 20%.
  • the keyword expansion system determines, when the keyword currently obtained by the current search is the same as the keyword obtained by the previous search, the keyword obtained by the current search is determined as the expanded keyword.
  • a classification corpus annotation system using the keyword expansion system comprising: determining a keyword unit: determining one or more initial core keywords for each category; a keyword expansion unit: adopting the initial core keyword The keyword expansion system obtains the extended keyword of each category; the labeling unit: searches by using the extended keyword corresponding to the category, selects the classified corpus from the label, and performs labeling.
  • An embodiment of the keyword expansion method according to the present disclosure which performs retrieval by using an initial keyword, retrieves a keyword as a basis of a next retrieval, searches by keyword iteration, and performs two subsequent searches.
  • the searched keyword is used as the extended keyword of the initial keyword.
  • various expressions of the initial keyword and meanings of various meanings are obtained, and the initial key is The word has been effectively and reasonably expanded, which solves the problem of the need to manually establish a thesaurus in the prior art, and is a keyword expansion method that is convenient and accurate.
  • the keyword expansion method, the number of occurrences of the words obtained by the statistical search, and the words whose number is greater than the preset threshold are used as keywords obtained by the retrieval. Or the number of words obtained by statistical search and the number of occurrences of each word are arranged in descending order of the number of times, and a certain proportion of words arranged in the front are used as keywords for retrieval, and the keyword is obtained in this way.
  • the statistical significance of the word is easy to find and the keywords contain: U-related words.
  • the keyword expansion method the method for obtaining the words obtained by the retrieval is obtained by searching in an article library, obtaining an article with high relevance, performing word segmentation, de-stopping words, and obtaining co-occurrence words. The words after the search. Through the above step-by-step filtering, unnecessary words are removed to obtain valid words.
  • the keyword expansion method when the keyword obtained by the current search is the same as the keyword obtained by the previous search, the keyword obtained by the current search is determined to be expanded; the keyword of ⁇ , at this time The expanded keywords are more accurate.
  • the present invention also provides a classification corpus annotation method, which is searched by the expanded keywords to obtain a classification corpus, which improves the efficiency and accuracy of the classification corpus annotation.
  • the above-mentioned classification corpus automatic labeling method effectively avoids the prior art corpus labeling method based on BP neural network algorithm, and the algorithm is complicated, the calculation amount is large, the convergence speed is slow, and it takes a long time to process a large amount of corpus information;
  • the two classification processors perform classification processing and occupy large memory.
  • FIG. 1 is a flow chart of one embodiment of the keyword expansion method of the present invention
  • Fig. 3 is a structural diagram of an embodiment of a keyword expansion system of the present invention
  • Fig. 4 is a structural diagram of an embodiment of a classification corpus annotation system of the present invention.
  • the present embodiment provides a keyword expansion method, and the ';51 ⁇ 2 diagram is as shown in FIG. 1 , and includes the following step 102 .
  • the retrieval is performed according to a predetermined initial keyword, and the current keyword is retrieved.
  • the initial keyword is used to search in the article library, and the articles with high relevance are obtained, and then the articles with high relevance are segmented, and the results after segmentation are used as the words obtained by the retrieval.
  • the number of occurrences of the words obtained by the statistical search is a keyword obtained by searching for a number of times greater than the preset threshold of 50 times (the number of times here is set according to the size of the article library and the general use degree of the retrieved keywords). In this way, the keywords will be obtained, which has certain statistical significance, and is easy to find and contain various words of the keyword: U-related words.
  • step 104 the keywords obtained by the retrieval are used as the basis of the next retrieval, and the loop retrieval is performed by means of keyword iteration.
  • the process of retrieval is similar to the specific process in step 102.
  • the keyword obtained in the previous search is searched as the keyword of the current search, and the keyword obtained after the search is used as the keyword of the next search, and the keyword is iteratively searched. .
  • Step 106 After each search, if the keyword obtained by the current search and the keyword obtained by the previous search are within a preset threshold, the loop search ends, and the keyword obtained by the search is used as the expanded keyword. . For example, comparing the keywords obtained by the current search with the keywords obtained by the previous search, when the keywords of the two searches are the same, the keywords obtained by the current search are determined as the expanded keywords, and the expansion is performed. The latter keywords are more accurate.
  • the retrieval is performed by using the initial keyword, and the keyword is retrieved as the basis of the next retrieval, and the keyword is iteratively searched.
  • the keyword error of the current two retrievals is within a certain range.
  • the searched keyword is used as the extended keyword of the initial keyword.
  • various expressions of the initial keyword and meanings of various meanings are obtained, and the initial keyword is The effective and reasonable expansion solves the problem that the manual needs to establish a thesaurus in the prior art, and is a keyword expansion method with convenient implementation and high accuracy.
  • the keyword obtained by the current search is compared with the keyword obtained by the previous search, and when the difference of the keyword accounts for less than a predetermined threshold, such as 20%, Set the keyword currently searched for as the expanded keyword.
  • a predetermined threshold such as 20%
  • the retrieval method is as follows: Searching in the article library using the preset keywords, obtaining articles with high relevance, and then classifying the articles, and then performing the word segmentation To stop the word operation, and then obtain the co-occurrence word that appears at the same time as the preset keyword, the co-occurrence word can be obtained by sliding window method, and the co-occurrence word is used as the word obtained by the retrieval.
  • the retrieved words are obtained by means of word segmentation, de-stopping words, and co-occurrence words. After the above-mentioned step-by-step filtering, unnecessary words are removed to obtain valid words.
  • the number of words obtained by statistical search and the number of occurrences of each word are arranged in descending order according to the number of times, and the words ranked before, for example, 50% (where the ratio can be set according to specific conditions) are obtained as retrieval. For example, if the number of words obtained by the search is 100, the top 20% of the times are taken as the keywords obtained by the search.
  • the number of times can be normalized in advance.
  • the normalization method is to calculate the cumulative value sum of each word for a sequence of words obtained by searching, and to make the secondary t/sum of one of the words as the normalized value of the word, and then pass the normalized value. Sort in descending order, taking a certain amount or a certain proportion of the value as the key.
  • the error between the keyword obtained by the current search and the keyword obtained by the previous search is defined as follows:
  • the number of keywords that are different from the keyword obtained in the previous search is the current one.
  • the search is considered to be completed, and the keyword obtained by the current search is a keyword of 1 ⁇ .
  • the first n keywords can also be compared for comparison. Calculate the error, such as taking the first 5 keywords or the first 10 keywords to calculate the error. When the error is less than 20%, it is considered that the search ends and the extended keyword is obtained.
  • the keyword of ⁇ obtains the key required by iteratively searching for the keyword convergence. Words speed up processing and increase work efficiency.
  • FIG. 3 is a block diagram showing an embodiment of a keyword expansion system of the present invention.
  • a keyword expansion system as shown in FIG. 3 includes:
  • (1) Acquisition unit 31 Search is performed according to a predetermined initial keyword, and the current keyword is obtained by the retrieval.
  • the obtaining unit includes: a search term obtaining module, configured to perform a search in the article library by using a preset keyword, obtain an article with high relevance, and then perform segmentation of the article with high relevance, and segment the word segmentation The subsequent result is obtained as a word retrieved; the search obtains the keyword module: the number of occurrences of the word obtained by the statistical search, and the word whose number is greater than the preset threshold is used as the current keyword obtained by the search.
  • the obtaining unit includes: a search term obtaining module, configured to perform a search in the article library by using a preset keyword, obtain an article with high relevance, and then perform the article with high degree of visibility Word segmentation, the result of the word segmentation is used as the word obtained by the search; the search and comparison obtains the keyword module: the number of words obtained by statistical search and the number of occurrences of each word, arranged in descending order according to the number of times, will be arranged in a certain proportion Words are used as current keywords for retrieval.
  • a search term obtaining module configured to perform a search in the article library by using a preset keyword, obtain an article with high relevance, and then perform the article with high degree of visibility Word segmentation, the result of the word segmentation is used as the word obtained by the search
  • the search and comparison obtains the keyword module: the number of words obtained by statistical search and the number of occurrences of each word, arranged in descending order according to the number of times, will be arranged in a certain proportion Words are used as
  • the loop search unit 32 uses the current keyword obtained by the search as a basis for the next search, and performs loop search by means of keyword iteration.
  • the above retrieval process is: using the preset keywords to search in the article library, obtaining articles with high relevance, then classifying the articles, and using the results of the word segmentation as the words obtained by the retrieval.
  • the keyword expansion system further performs a de-stop operation after the word segmentation, and then acquires a co-occurrence word that appears at the same time as the preset keyword, and uses the co-occurrence word as a word obtained by the retrieval. Then, obtaining a keyword module by searching or retrieving a comparison to obtain a keyword module The obtained words are counted and the keywords after the search are obtained.
  • determining unit 33 determining whether the keyword obtained by the current search and the keyword error obtained by the previous search are within a preset error threshold, the preset error threshold is, for example, less than 10%, and if so, The loop search of the loop search unit ends, and the keyword obtained by the current search is determined as the expanded keyword.
  • the error between the keyword obtained by the current search and the keyword obtained by the previous search is: the number of keywords that are different between the keyword obtained in the current search and the keyword obtained in the previous search account for the current search. The ratio of the number of keywords.
  • the first n keywords can be taken separately to perform error statistics, such as 5 ⁇ n ⁇ 10.
  • the keyword obtained by the current search is determined as the expanded keyword. .
  • the word “cup” is used to search in the article library (500 articles), and the above-mentioned retrieval method and the method of obtaining the retrieved keyword are used to obtain a series of key points. Words: water, kettle, teacup, drink ⁇ drink.
  • the search is performed again using a series of keywords obtained by the above search.
  • the key words obtained after the search are: water, teacup, kettle, kettle, bucket.
  • the two results were compared with an error of 40%, so the search results were continued as the key words.
  • the results of the search were: water, teacup, water glass, glass, kettle.
  • the error is 40%, and the threshold is not satisfied. Therefore, the search is continued, and the search results are re-searched with the above keywords, and the obtained search results are water, teacup, ⁇ Mf, Glass, water ⁇
  • Embodiment 5 This embodiment provides a method for labeling corpus using the keyword expansion method, and the flowchart is as shown in FIG. The method includes: Step 202: Determine one or more initial core keywords for each category; Step 204, obtain, by using the keyword expansion method, the extended keywords of each category by using the initial core keywords;
  • Step 206 Searching by using the extended keyword corresponding to the classification, selecting a classification corpus from the label, and marking.
  • a classified corpus annotation system using a keyword expansion system includes:
  • Determining the keyword unit 41 determining one or more initial core keywords for each category; the keyword expansion unit 42: acquiring the extended keywords of each category by using the keyword expansion system by the initial core keyword, including :
  • Obtaining a subunit Searching according to a predetermined initial core keyword, and retrieving the current keyword;
  • Cyclic retrieval sub-unit The current keyword obtained by the retrieval is used as the basis of the next retrieval, and the loop retrieval is performed by means of keyword iteration; the judgment sub-unit: judging the keyword obtained by the current retrieval and the keyword error obtained by the previous retrieval Whether it is within the preset error threshold, if yes, the loop search of the loop search unit is ended, and the keyword obtained by the current search is determined as the expanded keyword.
  • the labeling unit 43 searches by using the extended keyword corresponding to the classification, selects a classification corpus from the label, and performs labeling.
  • Example 7 Combined with an application example, the method for classifying corpus annotation using the keyword expansion method described above is described.
  • S1 Determine one or more initial core keywords for each category.
  • the step S2 uses an iterative method to perform repeated retrieval to obtain extended keywords of each category, including the following steps:
  • S21 Take an initial core keyword in a category, and obtain a candidate extended keyword of the category by searching.
  • the number of articles in the article is n, where n ⁇ 2, n is an integer, and the value of n is 30 ⁇ n ⁇ 2000.
  • the n can select different values such as 50, 100, 500, 700, 1200, 1700, 2000, etc., according to the user's needs and the category characteristics of the classified information.
  • the NLPIR word breaker is used for segmentation and de-stopping of the n articles, and the stop words can be filtered by using the deactivation dictionary after the word segmentation.
  • the NLPIR tokenizer is selected, including Chinese word segmentation, part-of-speech tagging, named entity recognition, user dictionary, microblog word segmentation, new word discovery and keyword extraction, and supports GBK encoding, UTF8 encoding, BIG5 encoding, etc.
  • the tokenizer is fully functional. The operation speed is fast, stable and reliable.
  • the word segmentation and de-stop words are used for the n articles to use the CJK tokenizer or the IK tokenizer, and the stop words can be filtered by using the deactivation dictionary after the word segmentation.
  • the CJK tokenizer can be selected.
  • the tokenizer is specially used for processing Chinese documents, and the operation speed is fast, stable and reliable.
  • Disable dictionary filter stop words or configure stop-word filtering by configuring the IK tokenizer's deactivated dictionary, which enables forward-reverse full-segment based on dictionary segmentation, and forward-reverse maximum matching segmentation.
  • the word segmenter optimizes dictionary storage, takes up less memory, is faster, and is stable and reliable.
  • S213 Obtain a word with a sliding window window size of 7 near the keyword as the candidate extended keyword by using a sliding window method. Taking the first three words and the last three words of the core keyword and the core keyword itself as the candidate extended keyword; if there are less than three words before or after the core keyword, All words before or after the core keyword.
  • the core keyword and the core keyword itself may be taken as the candidate extended keyword; or the first 4 words and the last 2 words of the keyword and the core key may be taken.
  • the word itself is used as the candidate extended keyword; or the first two words and the last four words of the core keyword and the core keyword itself are taken as the candidate extended keyword. If the number of words before or after the core keyword is less than the number of words taken, all words before or after the core keyword are taken.
  • the sliding window window size is S, where S ⁇ 2, and S is an integer.
  • the sliding window window size S has a value of 3 ⁇ S ⁇ 10.
  • the sliding window window size S can take different values such as 4, 5, 6, 8, 9, 10, etc., and is selected according to the needs of the user.
  • the automatic categorization method for classification corpus according to the present invention obtains key words by sliding window method, and the method is controlled by limiting the maximum number of words that can be received in the window, the algorithm is simple, the processing speed is fast, and the accuracy is high.
  • S22 retrieve the new core keyword by using the candidate extended keyword obtained each time, until the obtained candidate extended keyword does not change, and save as a keyword set.
  • S221 Count the number of occurrences of the candidate extended keywords, and arrange the candidate extended keywords in reverse order according to the number of times;
  • the first 10 candidate extension keywords are taken as a new core keyword, where m ⁇ 2, m is an integer, and the value of m is 5 ⁇ m ⁇ 30, and the m can be In order to take different values such as 5, 7, 13, 17, 25, 27, 30, etc., according to the user's needs and the category characteristics of the classified information.
  • step S223 Returning to step S211, the search is performed using the new core keyword until the new core keyword does not change and converges to a specific keyword set.
  • the 10 keywords obtained for the classification of military use initial core keyword expansion are extended keywords obtained through iterative methods based on initial core keywords ⁇ refugee, Iraq, war, Africa, homeland, forced, Afghanistan, Jordan, conflict, reception ⁇ .
  • S23 Check the keyword set, and delete the keyword that does not meet the category feature as the extended keyword of the category.
  • the keyword ⁇ home, reception ⁇ that does not meet the type characteristics can be deleted from it.
  • the obtained extended keyword set ⁇ is accurate.
  • S3 Searching by using the extended keyword corresponding to the classification, selecting a classification corpus from the selection, and marking the same. Including the following steps:
  • the first k articles are checked, wherein K ⁇ 10, the ⁇ is a positive integer, and the value of the ⁇ is 100 ⁇ ⁇ ⁇ 2000.
  • the ⁇ can select different values of 1500, 1700, 2000, etc., and select according to the corpus category characteristics of the classification.
  • the automatic categorization method for classification corpus limits the number of articles obtained after each retrieval, reduces the number of articles processed, improves the processing speed, and also has some phases.
  • the lower-level articles are filtered to make the new core keywords obtained more accurate.
  • the automatic categorization method for classified corpus according to the present invention can be matched from the full text of the article each time, and the full rate of the article can be matched, and the accuracy of the obtained corpus is high.
  • the automatic categorization method for classification corpus checks the corpus retrieved by the extended keyword, deletes some articles that do not meet the category characteristics, and marks the remaining articles that meet the category characteristics as the corpus of the classification, so that The annotated corpus is more accurate.
  • Embodiment 8 This embodiment provides another specific implementation manner of a classification corpus annotation method.
  • the first step there are three categories in the classification system ⁇ military, economic, sports ⁇ , and one or more core keywords are determined manually for each category. Take the military as an example, and determine the initial core keywords as ⁇ war, refugee, casualty ⁇ . Building a full-text library of articles, each article in the entire library comes from the newspaper database.
  • the second step for the classified military, uses the core keywords ⁇ war, refugee, casualty ⁇ to conduct a full-text search and get the first 1000 articles.
  • the 1000 articles obtained are divided into words and deactivated.
  • the fourth step is to obtain a keyword with a window size of 6 near the keyword by sliding window method.
  • the number of occurrences of the keywords is counted, and the keywords are arranged in reverse order.
  • the first 10 keywords are taken as the new core keywords.
  • the 10 keywords obtained are extended keywords obtained through iterative methods based on the initial core keywords ⁇ refugee, Iraq, war, Africa, home, forced, Afghanistan, Jordan, conflict, reception ⁇ .
  • the artificial keyword is checked against the extended keyword, and the keyword that does not conform to the type feature is deleted from the home ⁇ home, receive ⁇ .
  • the ninth step using the extended keywords corresponding to the classification ⁇ Refugee, Iraq, War, Africa, Forced, Afghanistan, Jordan, Conflict ⁇ is retrieved from the full-text library. Get the first 1000 articles, the 1000 articles as candidates for this classification corpus.
  • the tenth step is to manually check 1000 articles and select the classification corpus. In the eleventh step, repeat steps 2 through 10 for all categories. This results in an annotated corpus for each classification in the classification system.
  • Also provided herein is one or more computer readable media having computer executable instructions, the instructions, when executed by a computer, executing a keyword expansion method, the method comprising: performing a retrieval based on a predetermined initial keyword, Searching for the current keyword; using the current keyword obtained by the search as the basis of the next search, and performing a loop search by means of keyword iteration; if the keyword obtained by the current search and the keyword obtained by the previous search are in error Within the preset error threshold, the loop search ends, and the keyword obtained by the current search is determined as the expansion; the keyword of ⁇ .
  • Also provided herein is one or more computer readable media having computer executable instructions that, when executed by a computer, perform the above-described classified corpus annotation method.
  • embodiments of the present invention can be provided as a method, system, or computer program product.
  • the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware.
  • the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • the present invention is directed to a method, an apparatus, and a computational order in accordance with an embodiment of the present invention.
  • the flow chart and/or block diagram of the product is described. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device.
  • the apparatus is readable in a computer readable memory in a particular manner, such that instructions stored in the computer readable memory produce an article of manufacture comprising an instruction device implemented in one or more flows and/or block diagrams of the flowchart The function specified in the box or in multiple boxes.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

提供一种关键词扩展方法和系统,该方法通过初始关键词进行检索,检索获得关键词作为下一次检索的基础,通过关键词迭代的方式进行检索,当前后两次检索的关键词误差在一定范围内时,将检索后的关键词作为初始关键词的扩展关键词,通过这种方式,获得了初始关键词的多种表达方式以及多方面含义的词义,将初始关键词进行了有效并合理的扩展,解决了现有技术中需要人工建立词库的问题,是一种实现方便、准确率高的关键词扩展方法。还提供一种分类语料自动标注方法和系统,该方法为每个分类确定一个或多个初始核心关键词;通过初始核心关键词扩展获取每个分类的扩展关键词;利用分类对应的扩展关键词进行检索,从中选择分类语料,并进行标注。

Description

关键词扩展方法及系统、 及分类语料标注方法及系统
技术领域 本发明涉及一种关键词扩展方法及分类语料自动标注方法, 属于电数 字数据处理技术领域。 背景技术
关键词一 ^!对一类相关术语的集中表达的词语, 为了提高其表达内 容的全面性, 一般关键词会有相关的多种表达意思, 为了提高关键词检索 的命中率, 一般会将一个设定的初始关键词进行扩展, 获得多种该关键词 对应的相关的词语, 同时进行检索。 现有技术中提供了一种关键词扩展方 法,首先建立一数据库:该数据库包含关键词、词汇和识别码;然后将关键词 与至少一词汇对应;再将相关的关键词与一识别码对应;通过用户输入的关 对应相关的关键词;通t^关的关键词,查询出与每一相关的关键词对应的 词汇。 该方案中提供的关键词自动扩展查询方法, 建立在预先设置的词语 库的基础上, 当词库建立不当时,严重影响了关键词扩展的准确性,此外, 该词库的建立需要大量的人工经验, 主观性较大, 也影响了其分类的准确 性。 语料标注是记录语料的类别特征信息, 也是对语料进行浅层次分析的 主要内容, 它可以应用到信息检索、 机器翻译、 主题内容分析和文本处理 等领域, 语料标注的准确性直接关系到文本分析和文本处理的正确性。
在有监督的文本分类技术中,例如使用 SVM( Support Vector Machine, 支持向量机)做 分类, 在确定分类体系之后, 需要为分类体系中的每 个分类准备标注语料, 用于训练分类模型。 目前分类语料的标注通常采用 人工方法实现, 即语料标注人员根据自身的知识判断语料所属的分类。 然 而, 当需要标注的语料数目庞大时, 人工判定语料所属分类通常有以下几 个问题: 一是人工代价高; 二是人工标注周期长; 三是人工标注存在主观 因素影响, 即对于同一个语料, 不同的人可能会将其分到不同的类别; 四 是由于语料数目庞大, 因为标注疲劳, 容易导致标注 。
现有技术中公开了一种基于 BP神经网络的语料标注系统,包 料库 存储器、待标注语料緩冲存储器、语料标注结果比较器、 BP神经网络处理 单元,在标注的过程中,所述 BP神经处理单元对所述语料库存储器中的待 标注的语料进行标注, 并将标注结果存入所述待标注语料緩冲存储器, 所 述语料标注结果比较器对緩冲器中的结果进行比较。 在上述公开的技术方 案中,所述 BP神经网络处理单元中包括有至少两个分类处理器,在对标注 结果进行处理时, 只有当至少两个分类处理器对待标注语料的标注结果依 照设定满足一定比较系数, 才可以对待标注语料进行标注, 并存入语料库 存储器 ,该技术方案基于 BP神经网络算法 ,该算法的算法复杂 ,运算量大, 收敛速度慢, 在处理大量的语料信息时耗时长; 并且还需要至少两个分类 处理器进行分类处理, 占用内存大; 同时, 为了训练神经网络, 需要提前 准备一批大规模的标注语料, 这种准备标注语料的代价仍然很高。 发明内容
本发明所要解决的一个技术问题是现有技术中关键词扩展主观性较大、 词库建立工作量大、关键词扩展的准确性低的问题,提出一种客观、简便、 准确的关键词扩展方案。 本发明所要解决的另外一个技术问题是现有技术中采用基于 BP神经 网络算法的语料标注方法, 其算法复杂, 运算量大, 收^ ^度慢, 占用内 存大; 同时在进 料标注时需要人工提前准备一批大^ ^的标注语料来 训练分类处理器, 准备所需标注语料的代价仍然很高, 从而提供一种利用 机器协助实现的分类语料自动标注方案。
为解决上述技术问题, 本发明是通过以下技术方案实现的:
一种关键词扩展方法, 包括: 根据预先给定的初始关键词进行检索, 检索获得当前关键词; 将检索获得的当前关键词作为下一次检索的基础, 通过关键词迭代的方式进行循环检索; 当当前一次检索得到的关键词与前 一次检索得到的关键词误差在预设误差阈值内时, 检索结束, 将当前一次 检索得到的关键词确定为扩展后的关键词。
可选地, 检索获得当前关键词的过程为: 分别统计检索获得的词语出 现的次数, 将次数大于预设阈值的词语作为检索获得的当前关键词。
可选地, 检索获得当前关键词的过程为: 统计检索获得的词语的个数 以及各个词语出现的次数, 按照次数的多少降序排列, 将排列在前的一定 比例的词语作为检索获得的当前关键词。
可选地, 获取所述检索获得的词语的方法为: 使用预设关键词在文章 库中进行检索, 得到相关度高的文章, 然后将这些相关度高的文章进行分 词, 将分词后的结果作为检索获得的词语。
可选地, 所述的关键词扩展方法, 分词后还进行去停用词操作, 然后 获取与所述预设关键词同时出现的同现词, 将所述同现词作为检索获得的 词语。
可选地, 该关键词扩展方法, 当前一次检索得到的关键词与前一次检 索得到的关键词的误差为: 当前一次检索得到的关键词与前一次检索得到 的关键词存在差别的关键词的个数占当前一次检索得到的关键词的个数的 比值。
可选地, 该关键词扩展方法, 当前一次检索得到的关键词与前一次检 索得到的关键词, 分别取前 n个关键词, 进行误差的统计, 5≤n≤10。
可选地, 所述的关键词扩展方法, 所述预设误差阈值小于 20%。 所述的关键词扩展方法, 如果当前一次检索得到的关键词与前一次检 索得到的关键词相同, 则将当前一次检索得到的关键词确定为扩 的关 键词。
使用所述的关键词扩展方法进行分类语料标注方法, 步骤包括: 为每个分类确定一个或多个初始核心关键词; 通过所述初始核心关键 词采用上述关键词扩展方法获取每个分类的扩展关键词; 利用分类对应的 所述扩展关键词进行检索, 从中选择分类语料, 并进行标注。 一种关键词扩展系统, 包括: 获取单元: 根据预先给定的初始关键词 进行检索, 检索获得当前关键词; 循环检索单元: 将检索获得的当前关键 词作为下一次检索的基础, 通过关键词迭代的方式进行循环检索; 判断单 元: 判断当前一次检索得到的关键词与前一次检索得到的关键词误差是否 在预设误差阈值内, 如果是, 则使所述循环检索单元的循环检索结束, 将 当前一次检索得到的关键词确定为扩展后的关键词。
可选地, 所述获取单元包括: 检索词语获得模块, 用于使用预设关键 词在文章库中进行检索, 得到相关度高的文章, 然后将所 目关度高的文 章进行分词,将分词后的结果作为检索获得的词语;检索获得关键词模块: 分别统计检索获得的词语出现的次数, 将次数大于预设阈值的词语作为检 索获得的当前关键词。
可选地, 所述获取单元包括: 检索词语获得模块, 用于使用预设关键 词在文章库中进行检索, 得到相关度高的文章, 然后将所 目关度高的文 章进行分词, 将分词后的结果作为检索获得的词语; 检索比较获得关键词 模块: 统计检索获得的词语的个数以及各个词语出现的次数, 按照次数的 多少降序排列,将排列在前的一定比例的词语作为检索获得的当前关键词。 可选地, 所述的关键词扩展系统, 检索词语获得模块使用预设关键词 在文章库中进行检索, 得到相关度高的文章, 然后将所述相关度高的文章 进行分词, 分词后还进行去停用词操作, 然后获取与所述预设关键词同时 出现的同现词, 将所述同现词作为检索获得的词语。 可选地, 该关键词扩展系统, 当前一次检索得到的关键词与前一次检 索得到的关键词的误差为: 当前一次检索得到的关键词与前一次检索得到 的关键词存在差别的关键词的个数占当前一次检索得到的关键词的个数的 比值。
可选地, 所述的关键词扩展系统, 当前一次检索得到的关键词与前一 次检索得到的关键词, 分别取前 n个关键词, 进行误差的统计, 5≤n≤10。
可选地, 所述的关键词扩展系统, 所述预设误差阈值小于 20%。 可选地, 所述的关键词扩展系统, 当当前一次检索得到的关键词与前 一次检索得到的关键词相同时, 将当前一次检索得到的关键词确定为扩展 后的关键词。 一种使用所述的关键词扩展系统的分类语料标注系统, 包括: 确定关 键词单元:为每个分类确定一个或多个初始核心关键词;关键词扩展单元: 通过所述初始核心关键词采用上述的关键词扩展系统获取每个分类的扩展 关键词; 标注单元: 利用分类对应的所述扩展关键词进行检索, 从中选择 分类语料, 并进行标注。
本文中的技术方案相比现有技术具有以下一个或多个优点:
( 1 )本公开所述的关键词扩展方法的一个实施例,通过使用初始关键 词进行检索, 检索获得关键词作为下一次检索的基础, 通过关键词迭代的 方式进行检索, 当前后两次检索的关键词误差在一定范围内时, 将检索后 的关键词作为初始关键词的扩展关键词, 通过这种方式, 获得了初始关键 词的多种表达方式以及多方面含义的词义, 将初始关键词进行了有效并合 理的扩展,解决了现有技术中需要人工建立词库的问题,是一种实现方便、 准确率高的关键词扩展方法。
( 2 )所述的关键词扩展方法, 统计检索获得的词语出现的次数, 将次 数大于预设阈值的词语作为检索获得的关键词。 或者统计检索获得的词语 的个数以及各个词语出现的次数, 按照次数的多少降序排列, 将排列在前 的一定比例的词语作为检索获得的关键词, 通过这种方式将获得关键词, 具有一定的统计意义, 便于找到与关键词的各个含: U目关的词语。
( 3 )所述的关键词扩展方法, 获取所述检索获得的词语的方法为通过 在文章库中检索, 得到相关度高的文章, 进行分词、 去停用词、 获取同现 词的方式获得检索后的词语。 通过上述一步一步过滤, 将不需要的多余的 词语去除, 得到有效的词语。
( 4 )所述的关键词扩展方法, 当前一次检索得到的关键词与前一次检 索得到的关键词的误差在一定范围内则认为检索结束, 获得了扩; 11^的关 键词, 通过迭代检索关键词收敛的方式获得了所需的关键词, 加快了处理 速度, 提高了工作效率。
( 5 )所述的关键词扩展方法, 当当前一次检索得到的关键词与前一次 检索得到的关键词相同时, 将当前一次检索得到的关键词确定为扩; ιυ^的 关键词, 此时扩展后的关键词的准确度更高。
( 6 )本发明还提供一种分类语料标注方法, 通过扩展后的关键词进行 检索, 获得分类语料, 提高了分类语料标注的效率和准确度。 上述分类语 料自动标注方法有效避免了现有技术中采用基于 BP神经网络算法的语料 标注方法, 其算法复杂, 运算量大, 收敛速度慢, 在处理大量的语料信息 时耗时长; 并且还需要至少两个分类处理器进行分类处理, 占用内存大; 同时在进 料标注时需要人工提前准备一批大 ^^的标注语料来训练分 类处理器, 准备所需标注语料的代价仍然很高的问题。
附图说明 为了使本发明的内容更容易被清楚的理解, 下面结合附图, 对本发 明作进一步详细的说明, 其中, 图 1是本发明关键词扩展方法一个实施例的流程图; 图 2是本发明分类语料自动标注方法一个实施例的流程图; 图 3是本发明的关键词扩展系统的一个实施例的结构图; 图 4是本发明的分类语料标注系统的一个实施例的结构图。
具体实施方式
实施例 1:
本实施例提供一种关键词扩展方法, ';5½图如图 1所示, 包括如下步 步骤 102 ,根据预先给定的初始关键词进行检索 ,检索获得当前关键词。 此实例中, 使用初始关键词在文章库中进行检索, 得到相关度高的文章, 然后将这些相关度高的文章进行分词, 将分词后的结果作为检索获得的词 语。 统计检索获得的词语出现的次数, 将次数大于预设阈值 50次(此处的 次数根据文章库的大小以及检索的关键词的普遍使用程度来设置 )的词语 作为检索获得的关键词。 通过这种方式将获得关键词, 具有一定的统计意 义, 便于找到与关键词的各个含: U目关的词语。
步骤 104,将检索获得的关键词作为下一次检索的基础,通过关键词迭 代的方式进行循环检索。 检索的过程与步骤 102中的具体过程类似。 在此 步骤中, 将上一次检索得到的关键词, 作为当前一次检索的关键词, 进行 检索, 检索后获得的关键词又作为下一次检索的关键词, 通过这种关键词 迭代的方式进行检索。
步骤 106,在每次检索后,如果当前一次检索得到的关键词与前一次检 索得到的关键词误差在预设阈值内, 循环检索结束, 将本次检索得到的关 键词作为扩展后的关键词。 例如, 针对当前一次检索得到的关键词与前一 次检索得到的关键词进行比较, 当两次检索的关键词一致时, 将当前一次 检索得到的关键词确定为扩展后的关键词, 此时扩展后的关键词的准确度 更高。
上述实施例中的关键词扩展方法, 通过使用初始关键词进行检索, 检 索获得关键词作为下一次检索的基础, 通过关键词迭代的方式进行检索, 当前后两次检索的关键词误差在一定范围内时, 将检索后的关键词作为所 述初始关键词的扩展关键词, 通过这种方式, 获得了所述初始关键词的多 种表达方式以及多方面含义的词义, 将初始关键词进行了有效并合理的扩 展, 解决了现有技术中需要人工建立词库的问题, 是一种实现方便、 准确 率高的关键词扩展方法。
作为其他可以变换的实施方式, 针对当前一次检索得到的关键词与前 一次检索得到的关键词进行比较, 当有差别的关键词占所有关键词的比例 小于一个预设阈值时如 20%, 则设置当前一次检索的关键词为扩展后的关 键词。 实施例 2:
( 1 )才艮据预先给定的初始关键词进行检索, 检索获得当前关键词。
( 2 )将检索获得的当前关键词作为下一次检索的基础,通过关键词迭 代的方式进行循环检索。
在上述(1 )和(2 )的检索过程中, 检索的方式如下: 使用预设的关键词在文章库中进行检索,得到相关度高的文章, 然后将 这些文章进行分词, 分词后还进行去停用词操作, 然后获取与所述预设关 键词同时出现的同现词, 可以通过滑窗的方法获得同现词, 将所述同现词 作为检索获得的词语。
上述实施例中, 通过分词、去停用词、获取同现词的方式获得检索后的 词语, 经过上述一步一步过滤, 将不需要的多余的词语去除, 得到有效的 词语。
统计检索获得的词语的个数以及各个词语出现的次数, 按照次数的多 少降序排列, 将排列在前的一定比例如 50% (此处的比例可才艮据具体情况 设置)的词语作为检索获得的关键词, 如检索获得的词语为 100个, 则取 次数较多的前 20%个作为检索获得的关键词。 此处, 作为其他可以替换的实施方式, 也可以提前对次数做归一。 归 一的方法为,对一个检索获得的词语的序列,计算各个词语次数累加值 sum, 将其中一个词语出现的次 t/sum做为该词语归一后的值, 然后通过归一后 的值降序排列, 取在先的一定数量或一定比例的值作为关键词。 此过程中, 定义当前一次检索得到的关键词与前一次检索得到的关键 词的误差为: 当前一次检索得到的关键词与前一次检索得到的关键词存在 差别的关键词的个数占当前一次检索得到的关键词的个数的比值, 当误差 小于 10%时, 则认为检索结束, 当前一次检索获得的关键词为 1^的关键 词。
作为另外可以替换的实施方式, 也可以取前 n个关键词进行比较来计 算误差 ,如取前 5个关键词或前 10个关键词计算误差,当误差小于 20%时, 则认为检索结束, 获得了扩展关键词。
当前一次检索得到的关键词与前一次检索得到的关键词的误差在一定 范围内则认为检索结束, 获得了扩; ιυ^的关键词, 通过迭代检索关键词收 敛的方式获得了所需的关键词, 加快了处理速度, 提高了工作效率。
实施例 3:
图 3是本发明的关键词扩展系统的一个实施例的结构图。 如图 3所示 一种关键词扩展系统, 包括:
( 1 )获取单元 31:根据预先给定的初始关键词进行检索,检索获得当 前关键词。 在一个实施例中, 获取单元包括: 检索词语获得模块, 用于使 用预设关键词在文章库中进行检索, 得到相关度高的文章, 然后将所述相 关度高的文章进行分词, 将分词后的结果作为检索获得的词语; 检索获得 关键词模块: 统计检索获得的词语出现的次数, 将次数大于预设阈值的词 语作为检索获得的当前关键词。
作为其他可以替换的实施方式, 所述获取单元包括: 检索词语获得模 块, 用于使用预设关键词在文章库中进行检索, 得到相关度高的文章, 然 后将所 目关度高的文章进行分词,将分词后的结果作为检索获得的词语; 检索比较获得关键词模块: 统计检索获得的词语的个数以及各个词语出现 的次数, 按照次数的多少降序排列, 将排列在前的一定比例的词语作为检 索获得的当前关键词。
( 2 )循环检索单元 32:将检索获得的当前关键词作为下一次检索的基 础, 通过关键词迭代的方式进行循环检索。
上述检索的过程为: 使用预设的关键词在文章库中进行检索, 得到相 关度高的文章, 然后将这些文章进行分词, 将分词后的结果作为检索获得 的词语。 所述的关键词扩展系统, 分词后还进行去停用词操作, 然后获取 与所述预设关键词同时出现的同现词,将所述同现词作为检索获得的词语。 然后, 通过检索获得关键词模块或检索比较获得关键词模块对所述检索获 得的词语进行统计, 获得检索后的关键词。
( 3 )判断单元 33:判断当前一次检索得到的关键词与前一次检索得到 的关键词误差是否在预设误差阈值内, 所述预设误差阈值例如小于 10%, 如果是, 则使所述循环检索单元的循环检索结束, 将当前一次检索得到的 关键词确定为扩展后的关键词。 所述当前一次检索得到的关键词与前一次 检索得到的关键词的误差为: 当前一次检索得到的关键词与前一次检索得 到的关键词存在差别的关键词的个数占当前一次检索得到的关键词的个数 的比值。 作为替换的方式, 可以分别取前 n个关键词, 进行误差的统计, 如 5≤n≤10。 作为其他实施方式, 为了提高检索精度, 在判断单元中, 当当前一次 检索得到的关键词与前一次检索得到的关键词相同时, 才将当前一次检索 得到的关键词确定为扩展后的关键词。
实施例 4:
给出一个具体的应用实例。
如给定一个初始关键词 "杯子 "进行检索, 使用 "杯子 "这个词在文章库 ( 500篇文章)中检索,采用上述的检索方法和获得检索后的关键词的方法, 获得一系列的关键词: 水、 水壶、 茶杯、 饮^ 饮料。
使用上述检索得到的一系列关键词进行再次检索, 检索后得到的关键 词为: 水、 茶杯、 水壶、 暖壶、 水桶。
将两次结果进行比较, 误差为 40%, 因此继续以上述检索结果做为关 键词进行检索, 检索后的结果为: 水、 茶杯、 水杯、 玻璃杯、 水壶。
将此次检索的结果与上次检索的结果比较, 误差为 40%, 不满足阈值 20%的要求, 因此继续检索, 以上述关键词重新检索, 得到的检索结果为 水、 茶杯、 ^Mf, 玻璃杯、 水 ^
将当前一次检索的结果与上次检索的结果比较, 误差小于 20%, 满足 误差阈值的要求,因此检索结束,将当前一次检索的结果"水、茶杯、水杯、 玻璃杯、 水壶"作为关键词"杯子"的扩 的关键词。 实施例 5: 本实施例提供一种使用所述的关键词扩展方法进行分类语料标注方法, 流程图如图 2所示, 步骤包括: 步骤 202, 为每个分类确定一个或多个初始核心关键词; 步骤 204,通过所述初始核心关键词采用上述关键词扩展方法获取每个 分类的扩展关键词;
步骤 206,利用分类对应的所述扩展关键词进行检索,从中选择分类语 料, 并进行标注。
实施例 6:
图 4是本发明的分类语料标注系统的一个实施例的结构图。如图 4,一 种使用关键词扩展系统的分类语料标注系统, 包括:
确定关键词单元 41: 为每个分类确定一个或多个初始核心关键词; 关键词扩展单元 42: 通过所述初始核心关键词采用所述关键词扩展系 统获取每个分类的扩展关键词, 包括:
获取子单元: 根据预先给定的初始核心关键词进行检索, 检索获得当 前关键词;
循环检索子单元: 将检索获得的当前关键词作为下一次检索的基础, 通过关键词迭代的方式进行循环检索; 判断子单元: 判断当前一次检索得到的关键词与前一次检索得到的关 键词误差是否在预设误差阈值内, 如果是, 则使所述循环检索单元的循环 检索结束, 将当前一次检索得到的关键词确定为扩展后的关键词。
标注单元 43: 利用分类对应的所述扩展关键词进行检索, 从中选择分 类语料, 并进行标注。
实施例 7: 结合一个应用实例说明使用所述的关键词扩展方法进行分类语料标 注方法。
S1: 为每个分类确定一个或多个初始核心关键词。
设分类体系中有三个分类 {军事, 经济, 体育 }, 人工为每一个分类确 定一个或多个初始核心关键词。以军事为例,确定初始核心关键词为{战争, 难民, 伤亡 }。 建设文章全文库, 全文库中的每篇文章来自报刊数据库。
S2: 通过所述初始核心关键词扩展获取每个分类的扩展关键词。
所述步骤 S2采用迭代方法进行反复检索获取每个分类的扩展关键词, 包括如下步骤:
S21: 取一个分类中的初始核心关键词, 通过检索获取该分类的候 选扩展关键词。
S210: 取分类军事中的初始核心关键词 {战争, 难民, 伤亡 };
S211: 使用所述核心关键词 {战争, 难民, 伤亡 }进行检索,根据相 关度得到前 1000篇文章。
在其他实施例中, 所述文章的篇数为 n, 其中 n≥2, n为整数, 所 述 n的取值为 30≤n≤2000。所述 n可以选择 50、 100、 500、 700、 1200、 1700、 2000等不同的值, 根据用户的需求以及该分类信息的类别特征 择。
S212: 对分类军事得到的 1000篇文章进行分词和去停用词。
在本实施例中对所述 n篇文章进行分词和去停用词采用 NLPIR分 词器,可以通过分词后使用停用词典过滤停用词。选用 NLPIR分词器, 包括中文分词、 词性标注、 命名实体识别、 用户词典、 微博分词、 新 词发现与关键词提取的功能, 支持 GBK编码、 UTF8编码、 BIG5编 码等, 该分词器功能齐全, 运算速度快, 稳定可靠。
在其他的实施例中, 对所述 n篇文章进行分词和去停用词采用 CJK分词器或 IK分词器, 可以通过分词后使用停用词典过滤停用词。 针对中文语料库可以选用 CJK分词器, 该分词器专门用于处理中文文 档, 运算速度快, 稳定可靠。 也可以选用 IK分词器, 通过分词后使用 停用词典过滤停用词,或通过配置 IK分词器的停用词典实现停用词过 滤, 能够实现了以词典分词为基础的正反向全切分, 以及正反向最大 匹配切分, 该分词器优化了词典存储, 占用内存小, 运算速度快, 稳 定可靠。
S213: 通过滑窗方法得到关键词附近滑窗窗口大小为 7的词语作 为所述候选扩展关键词。 则取所述核心关键词前 3个词语和后 3个词 语以及所述核心关键词本身作为所述候选扩展关键词; 若所述核心关 键词前或后的词语不足 3个,则取所述核心关键词前或后的所有词语。 在其他实施例中, 可以取所述核心关键词前 ό个词语以及所述核 心关键词本身作为所述候选扩展关键词;或取关键词前 4个词语和后 2 个词语以及所述核心关键词本身作为所述候选扩展关键词; 或取所述 核心关键词前 2个词语和后 4个词语以及所述核心关键词本身作为所 述候选扩展关键词等方式进行取词。 若所述核心关键词前或后的词语 个数不足所取词的个数时, 则取所述核心关键词前或后的所有词语。
作为其他可替换的实施方式, 所述滑窗窗口大小为 S, 其中 S≥2, S为整数。 所述滑窗窗口大小 S的取值为 3≤S≤10。 所述滑窗窗口大小 S可以取 4、 5、 6、 8、 9、 10等不同的值, 根据用户的需求来选择。 本发明所述的分类语料自动标注方法, 通过滑窗的方法获取关键 词, 该方法是通过限制窗口内所能接收的最大词数进行控制, 算法简 单, 运算处理速度快, 准确率高。
S22: 利用每次获取的所述候选扩展关键词得到新的核心关键词进 行检索, 直到获取的所述候选扩展关键词不再变化, 并保存为关键词 集合。 S221: 统计所述候选扩展关键词出现次数, 按次数倒序排列 所述候选扩展关键词;
S222: 取出前 10个所述候选扩展关键词作为新的核心关键词。 在其他实施例中, 取出前 m个所述候选扩展关键词作为新的核心 关键词, 其中 m≥2, m为整数, 所述 m的取值为 5≤m≤30, 所述 m可 以取 5、 7、 13、 17、 25、 27、 30等不同的值, 根据用户的需求以及该 分类信息的类别特征来选择。
S223: 返回步骤 S211, 使用所述新的核心关键词进行检索, 直到 所述新的核心关键词不会变化, 收敛到特定的关键词集合。
对分类军事利用初始核心关键词扩展得到的 10个关键词是根据初 始核心关键词通过迭代方法获取的扩展关键词 {难民, 伊拉克, 战争, 非洲, 家园, 被迫, 阿富汗, 约旦, 冲突, 接收 }。
S23: 核对所述关键词集合, 删除不符合类别特征的关键词后作为 该分类的所述扩展关键词。
假设该用户是为了进行军事研究, 则可从中删除不符合类型特征 的关键词{家园, 接收 }。
通过核对所述关键词集合, 删除一些不符合类别特征的关键词, 得的扩展关键词集合^ 的准确。
S3: 利用分类对应的所述扩展关键词进行检索, 从中选择分类语料, 并进行标注。 包括如下步骤:
S31: 利用分类对应的扩展关键词 {难民, 伊拉克, 战争, 非洲, 被迫, 阿富汗, 约旦, 冲突 }从全^中检索, 根据相关度降序排序。
S32:取前 1000篇文章进行核对,从中选择分类语料,并标注为"军 事,,。
在其它的实施例中, 取前 k篇文章进行核对, 其中 K≥10, 所述 Κ 为正整数,所述 Κ的取值为 100≤Κ≤2000。所述 Κ可以选择 1500、 1700、 2000等不同的取值, 根据该分类的语料类别特征来选择。
在核对所述前 Κ篇文章时,从中删除一些不符合类别特征的文章, 将剩余符合类别特征的文章作为该分类的语料进行标注。 本发明所述的分类语料自动标注方法, 通过对每次检索后获得文章数 量进行限定, 减少了处理文章的数量, 提高了处理速度, 同时也对一些相 关度较低文章进行过滤, 使获取的新的核心关键词更加准确。 本发明所述的分类语料自动标注方法, 每次检索为全文检索, 能够从 文章的全文进行匹配, 查全率高, 使获得的标注语料准确率高。 本发明所述的分类语料自动标注方法, 对通过扩展关键词进行检索到 的语料进行核对, 从中删除一些不符合类别特征的文章, 将剩余符合类别 特征的文章作为该分类的语料进行标注, 使标注的语料更加准确。 实施例 8: 本实施例提供另外一种分类语料标注方法的具体实施方式。 第一步, 设分类体系中有三个分类 {军事, 经济, 体育 }, 人工为每一 个分类确定一个或多个核心关键词。 以军事为例, 确定初始核心关键词为 {战争, 难民, 伤亡 }。 建设文章全文库, 全文库中的每篇文章来自报刊数 据库。
第二步, 对于分类军事, 利用核心关键词{战争, 难民, 伤亡 }进行全 文检索, 得到前 1000篇文章。
第三步, 对得到的 1000篇文章进行分词和去停用词。 第四步, 通过滑窗方法得到关键词附近窗口大小为 6的关键词。 第五步, 统计关键词出现次数, 按次数倒序排列关键词。
第六步,从第五步的关键词中,取出前 10个关键词做为新的核心关键 词。
第七步, 重复第二步到第六步, 直到前 10个关键词不会变化, 即收敛 到特定的关键词集合。得到的 10个关键词是根据初始核心关键词通过迭代 方法获取的扩展关键词 {难民,伊拉克,战争,非洲, 家园,被迫,阿富汗, 约旦, 冲突, 接收 }。
第八步, 人工核对扩展关键词, 从中删除不符合类型特征的关键词 {家 园, 接收 }。 第九步, 利用分类对应的扩展关键词 {难民, 伊拉克, 战争, 非洲, 被 迫, 阿富汗, 约旦, 冲突 }从全文库中检索。 得到前 1000篇文章, 这 1000 篇文章做为候选的该分类语料。
第十步, 人工核对 1000篇文章, 从中选择分类语料。 第十一步, 对于所有分类, 重复第二步到第十步。 从而为分类体系中 的每个分类得到标注语料。
显然, 上述实施例仅仅是为清楚地说明所作的举例, 而并非对实施方 式的限定。 对于所属领域的普通技术人员来说, 在上述说明的基础上还可 以做出其它不同形式的变化或变动。 这里无需也无法对所有的实施方式予 以穷举。 而由此所引伸出的显而易见的变化或变动仍处于本发明创造的保 护范围之中。 本文中还提供一种或多种具有计算机可执行指令的计算机可读介质, 所述指令在由计算机执行时, 执行关键词扩展方法, 该方法包括: 根据预 先给定的初始关键词进行检索, 检索获得当前关键词; 将检索获得的所述 当前关键词作为下一次检索的基础,通过关键词迭代的方式进行循环检索; 如果当前一次检索得到的关键词与前一次检索得到的关键词误差在预设误 差阈值内, 循环检索结束, 将当前一次检索得到的关键词确定为扩; ιυ^的 关键词。 本文中还提供一种或多种具有计算机可执行指令的计算机可读介质, 所述指令在由计算机执行时, 执行上述分类语料标注方法。 本领域内的技术人员应明白, 本发明的实施例可提供为方法、 系统、 或计算机程序产品。 因此, 本发明可采用完全硬件实施例、 完全软件实施 例、 或结合软件和硬件方面的实施例的形式。 而且, 本发明可采用在一个 或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不 限于磁盘存储器、 CD-ROM、 光学存储器等)上实施的计算机程序产品的 形式。
本发明是参照根据本发明实施例的方法、 i殳备(系统)、和计算积 序 产品的流程图和 /或方框图来描述的。 应理解可由计算机程序指令实现流 程图和 /或方框图中的每一流程和 /或方框、 以及流程图和 /或方框图中 的流程和 /或方框的结合。 可提供这些计算机程序指令到通用计算机、 专 用计算机、 嵌入式处理机或其他可编程数据处理设备的处理器以产生一个 机器, 使得通过计算机或其他可编程数据处理设备的处理器执行的指令产 生用于实现在流程图一个流程或多个流程和 /或方框图一个方框或多个方 框中指定的功能的装置。 设备以特定方式工作的计算机可读存储器中, 使得存储在该计算机可读存 储器中的指令产生包括指令装置的制造品, 该指令装置实现在流程图一个 流程或多个流程和 /或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上, 使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现 的处理, 从而在计算机或其他可编程设备上执行的指令提供用于实现在流 程图一个流程或多个流程和 /或方框图一个方框或多个方框中指定的功能 的步骤。
尽管已描述了本发明的优选实施例, 但本领域内的技术人员一旦得知 了基本创造性概念, 则可对这些实施例作出另外的变更和修改。 所以, 所 附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和 修改。

Claims

& 利 要 求
1. 一种关键词扩展方法, 其特征在于, 包括:
才艮据预先给定的初始关键词进行检索, 检索获得当前关键词; 将检索获得的所述当前关键词作为下一次检索的基础, 通过关键词迭 代的方式进行循环检索;
如果当前一次检索得到的关键词与前一次检索得到的关键词误差在预 设误差阈值内, 循环检索结束, 将当前一次检索得到的关键词确定为扩展 后的关键词。
2.根据权利要求 1所述的关键词扩展方法,其特征在于,所述检索获得 当前关键词的过程为: 分别统计检索获得的词语出现的次数, 将次数大于 预设阈值的词语作为检索获得的当前关键词。
3.根据权利要求 1所述的关键词扩展方法, 其特征在于, 所述检索获 得当前关键词的过程为: 统计检索获得的词语的个数以及各个词语出现的 次数, 按照次数的多少降序排列, 将排列在前的一定比例的词语作为检索 获得的当前关键词。
4.根据权利要求 2或 3所述的关键词扩展方法,其特征在于,获取所述 检索获得的词语的方法为:
使用预设关键词在文章库中进行检索, 得到相关度高的文章; 将所述 相关度高的文章进行分词, 将分词后的结果作为检索获得的词语。
5.根据权利要求 4所述的关键词扩展方法, 其特 ^于, 分词后 i£¾ 行去停用词操作, 然后获取与所述预设关键词同时出现的同现词, 将所述 同现词作为检索获得的词语。
6.根据权利要求 1-5 中任一所述的关键词扩展方法, 其特征在于, 当 前一次检索得到的关键词与前一次检索得到的关键词的误差为: 当前一次 检索得到的关键词与前一次检索得到的关键词存在差别的关键词的个数占 当前一次检索得到的关键词的个数的比值。
7.根据权利要求 6所述的关键词扩展方法, 其特征在于, 当前一次检 索得到的关键词与前一次检索得到的关键词, 分别取前 n个关键词, 进行 误差的统计, 5≤n≤10。
8.根据权利要求 1所述的关键词扩展方法,其特征在于,所述预设误差 阈值小于 20%。
9.根据权利要求 1所述的关键词扩展方法,其特征在于, 当当前一次检 索得到的关键词与前一次检索得到的关键词相同时, 将当前一次检索得到 的关键词确定为扩展后的关键词。
10.—种分类语料标注方法, 其特征在于, 步骤包括:
为每个分类确定一个或多个初始核心关键词;
通过所述初始核心关键词采用如权利要求 1-9所述的关键词扩展方法 获取每个分类的扩展关键词;
利用分类对应的所述扩展关键词进行检索, 从中选择分类语料, 并进 行标注。
11. 一种关键词扩展系统, 其特征在于, 包括:
获取单元: 根据预先给定的初始关键词进行检索, 检索获得当前关键 词;
循环检索单元: 将检索获得的当前关键词作为下一次检索的基础, 通 过关键词迭代的方式进行循环检索;
判断单元: 判断当前一次检索得到的关键词与前一次检索得到的关键 词误差是否在预设误差阈值内, 如果是, 则使所述循环检索单元的循环检 索结束, 将当前一次检索得到的关键词确定为扩; !U^的关键词。
12.根据权利要求 11所述的关键词扩展系统,其特 ^于,所述获取单 元中, 包括:
检索词语获得模块, 用于使用预设关键词在文章库中进行检索, 得到 相关度高的文章, 然后将所 目关度高的文章进行分词, 将分词后的结果 作为检索获得的词语;
检索获得关键词模块: 分别统计检索获得的词语出现的次数, 将次数 大于预设阈值的词语作为检索获得的当前关键词。
13.根据权利要求 11所述的关键词扩展系统, 其特征在于, 所述获取 单元包括:
检索词语获得模块, 用于使用预设关键词在文章库中进行检索, 得到 相关度高的文章, 然后将所 目关度高的文章进行分词, 将分词后的结果 作为检索获得的词语;,
检索比较获得关键词模块: 统计检索获得的词语的个数以及各个词语 出现的次数, 按照次数的多少降序排列, 将排列在前的一定比例的词语作 为检索获得的当前关键词。
14.根据权利要求 12或 13所述的关键词扩展系统,其特 于,所述 检索词语获得模块使用预设关键词在文章库中进行检索, 得到相关度高的 文章,然后将所 目关度高的文章进行分词,分词后还进行去停用词操作, 然后获取与所述预设关键词同时出现的同现词, 将所述同现词作为检索获 得的词语。
15.根据权利要求 11-14 中任一所述的关键词扩展系统, 其特征在于, 当前一次检索得到的关键词与前一次检索得到的关键词的误差为: 当前一 次检索得到的关键词与前一次检索得到的关键词存在差别的关键词的个数 占当前一次检索得到的关键词的个数的比值。
16.根据权利要求 15所述的关键词扩展系统, 其特 ^于, 当前一次 检索得到的关键词与前一次检索得到的关键词, 分别取前 n个关键词, 进 ^差的统计, 5≤n≤10。
17.根据权利要求 11-16任一所述的关键词扩展系统, 其特征在于, 所 述预 i殳误差阈值小于 20%。
18.根据权利要求 11-17任一所述的关键词扩展系统, 其特征在于, 如 果当前一次检索得到的关键词与前一次检索得到的关键词相同时, 将当前 一次检索得到的关键词确定为扩展后的关键词。
19.一种分类语料标注系统, 其特征在于, 包括:
确定关键词单元: 为每个分类确定一个或多个初始核心关键词; 关键词扩展单元: 通过所述初始核心关键词采用如权利要求 1-18中任 意一项所述的关键词扩展系统获取每个分类的扩展关键词;
标注单元: 利用分类对应的所述扩展关键词进行检索, 从中选择分类 语料, 并进行标注。
20.—种或多种具有计算机可执行指令的计算机可读介质, 所述指令在 由计算机执行时, 执行关键词扩展方法, 该方法包括:
才艮据预先给定的初始关键词进行检索, 检索获得当前关键词; 将检索获得的所述当前关键词作为下一次检索的基础, 通过关键词迭 代的方式进行循环检索;
如果当前一次检索得到的关键词与前一次检索得到的关键词误差在预 设误差阈值内, 循环检索结束, 将当前一次检索得到的关键词确定为扩展 后的关键词。
PCT/CN2013/088586 2013-09-29 2013-12-05 关键词扩展方法及系统、及分类语料标注方法及系统 WO2015043066A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2016518124A JP6231668B2 (ja) 2013-09-29 2013-12-05 キーワード拡張方法及びシステム並びに分類コーパス注釈方法及びシステム
EP13894407.9A EP3051431A4 (en) 2013-09-29 2013-12-05 Keyword expansion method and system, and classified corpus annotation method and system
US15/025,573 US20160232211A1 (en) 2013-09-29 2013-12-05 Keyword expansion method and system, and classified corpus annotation method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310456381.X 2013-09-29
CN201310456381.XA CN104516903A (zh) 2013-09-29 2013-09-29 关键词扩展方法及系统、及分类语料标注方法及系统

Publications (1)

Publication Number Publication Date
WO2015043066A1 true WO2015043066A1 (zh) 2015-04-02

Family

ID=52741911

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/088586 WO2015043066A1 (zh) 2013-09-29 2013-12-05 关键词扩展方法及系统、及分类语料标注方法及系统

Country Status (5)

Country Link
US (1) US20160232211A1 (zh)
EP (1) EP3051431A4 (zh)
JP (1) JP6231668B2 (zh)
CN (1) CN104516903A (zh)
WO (1) WO2015043066A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228869A (zh) * 2018-01-15 2018-06-29 北京奇艺世纪科技有限公司 一种文本分类模型的建立方法及装置
CN110704590A (zh) * 2019-09-27 2020-01-17 支付宝(杭州)信息技术有限公司 扩充训练样本的方法和装置

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765862A (zh) * 2015-04-22 2015-07-08 百度在线网络技术(北京)有限公司 文档检索的方法和装置
CN106156372B (zh) * 2016-08-31 2019-07-30 北京北信源软件股份有限公司 一种互联网网站的分类方法及装置
CN106776937B (zh) * 2016-12-01 2020-09-29 腾讯科技(深圳)有限公司 一种确定内链关键词的方法和装置
CN107168943B (zh) * 2017-04-07 2018-07-03 平安科技(深圳)有限公司 话题预警的方法和装置
CN108647225A (zh) * 2018-03-23 2018-10-12 浙江大学 一种电商黑灰产舆情自动挖掘方法和系统
CN110399548A (zh) * 2018-04-20 2019-11-01 北京搜狗科技发展有限公司 一种搜索处理方法、装置、电子设备以及存储介质
CN108984519B (zh) * 2018-06-14 2022-07-05 华东理工大学 基于双模式的事件语料库自动构建方法、装置及存储介质
CN110309355B (zh) * 2018-06-15 2023-05-16 腾讯科技(深圳)有限公司 内容标签的生成方法、装置、设备及存储介质
CN108920467B (zh) * 2018-08-01 2021-04-27 北京三快在线科技有限公司 多义词词义学习方法及装置、搜索结果显示方法
CN111078858B (zh) * 2018-10-19 2023-06-09 阿里巴巴集团控股有限公司 文章搜索方法、装置及电子设备
CN109561211B (zh) * 2018-11-27 2021-07-27 维沃移动通信有限公司 一种信息显示方法及移动终端
US10839802B2 (en) * 2018-12-14 2020-11-17 Motorola Mobility Llc Personalized phrase spotting during automatic speech recognition
CN110162621B (zh) * 2019-02-22 2023-05-23 腾讯科技(深圳)有限公司 分类模型训练方法、异常评论检测方法、装置及设备
CN110134799B (zh) * 2019-05-29 2022-03-01 四川长虹电器股份有限公司 一种基于bm25算法的文本语料库的搭建和优化方法
CN110489526A (zh) * 2019-08-13 2019-11-22 上海市儿童医院 一种用于医学检索的检索词扩展方法、装置及存储介质
CN110619067A (zh) * 2019-08-27 2019-12-27 深圳证券交易所 基于行业分类的检索方法、检索装置及可读存储介质
CN111026884B (zh) * 2019-12-12 2023-06-02 上海益商网络科技有限公司 一种提升人机交互对话语料质量与多样性的对话语料库生成方法
CN112883160B (zh) * 2021-02-25 2023-04-07 江西知本位科技创业发展有限公司 一种用于成果转移转化的捕捉方法及辅助系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1341899A (zh) * 2000-09-07 2002-03-27 国际商业机器公司 为文字文档自动生成摘要的方法
CN102682119A (zh) * 2012-05-16 2012-09-19 崔志明 一种基于动态知识的深层网页数据获取方法

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020073079A1 (en) * 2000-04-04 2002-06-13 Merijn Terheggen Method and apparatus for searching a database and providing relevance feedback
JP4773003B2 (ja) * 2001-08-20 2011-09-14 株式会社リコー 文書検索装置、文書検索方法、プログラム及びコンピュータに読み取り可能な記憶媒体
JP2004029906A (ja) * 2002-06-21 2004-01-29 Fuji Xerox Co Ltd 文書検索装置および方法
DE502005003997D1 (de) * 2005-06-09 2008-06-19 Sie Ag Surgical Instr Engineer Ophthalmologische Vorrichtung für die Auflösung von Augengewebe
US8266162B2 (en) * 2005-10-31 2012-09-11 Lycos, Inc. Automatic identification of related search keywords
US20080071744A1 (en) * 2006-09-18 2008-03-20 Elad Yom-Tov Method and System for Interactively Navigating Search Results
JP4819628B2 (ja) * 2006-09-19 2011-11-24 ヤフー株式会社 ドキュメントデータを検索する方法、サーバ、およびプログラム
US7974989B2 (en) * 2007-02-20 2011-07-05 Kenshoo Ltd. Computer implemented system and method for enhancing keyword expansion
KR101078864B1 (ko) * 2009-03-26 2011-11-02 한국과학기술원 질의/문서 주제 범주 변화 분석 시스템 및 그 방법과 이를 이용한 질의 확장 기반 정보 검색 시스템 및 그 방법
JP5321258B2 (ja) * 2009-06-09 2013-10-23 日本電気株式会社 情報収集システムおよび情報収集方法ならびにそのプログラム
CN101996200B (zh) * 2009-08-19 2014-03-12 华为技术有限公司 一种搜索文档的方法和装置
JP5751481B2 (ja) * 2011-05-09 2015-07-22 廣川 佐千男 検索方法、検索装置及びプログラム
CA2747145C (en) * 2011-07-22 2018-08-21 Open Text Corporation Methods, systems, and computer-readable media for semantically enriching content and for semantic navigation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1341899A (zh) * 2000-09-07 2002-03-27 国际商业机器公司 为文字文档自动生成摘要的方法
CN102682119A (zh) * 2012-05-16 2012-09-19 崔志明 一种基于动态知识的深层网页数据获取方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3051431A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228869A (zh) * 2018-01-15 2018-06-29 北京奇艺世纪科技有限公司 一种文本分类模型的建立方法及装置
CN110704590A (zh) * 2019-09-27 2020-01-17 支付宝(杭州)信息技术有限公司 扩充训练样本的方法和装置
CN110704590B (zh) * 2019-09-27 2022-04-12 支付宝(杭州)信息技术有限公司 扩充训练样本的方法和装置

Also Published As

Publication number Publication date
JP6231668B2 (ja) 2017-11-15
EP3051431A1 (en) 2016-08-03
EP3051431A4 (en) 2017-05-03
US20160232211A1 (en) 2016-08-11
JP2016532175A (ja) 2016-10-13
CN104516903A (zh) 2015-04-15

Similar Documents

Publication Publication Date Title
WO2015043066A1 (zh) 关键词扩展方法及系统、及分类语料标注方法及系统
CN107609121B (zh) 基于LDA和word2vec算法的新闻文本分类方法
JP6526329B2 (ja) ウェブページトレーニング方法及び装置、検索意図識別方法及び装置
Li et al. Twiner: named entity recognition in targeted twitter stream
CN106156204B (zh) 文本标签的提取方法和装置
WO2019218514A1 (zh) 网页目标信息的提取方法、装置及存储介质
WO2021068339A1 (zh) 文本分类方法、装置及计算机可读存储介质
CN110825877A (zh) 一种基于文本聚类的语义相似度分析方法
CN105095204B (zh) 同义词的获取方法及装置
US8001139B2 (en) Using a bipartite graph to model and derive image and text associations
WO2017167067A1 (zh) 网页文本分类的方法和装置,网页文本识别的方法和装置
US20150074112A1 (en) Multimedia Question Answering System and Method
CN104881458B (zh) 一种网页主题的标注方法和装置
TWI554896B (zh) Information Classification Method and Information Classification System Based on Product Identification
CN111104526A (zh) 一种基于关键词语义的金融标签提取方法及系统
CN106547864B (zh) 一种基于查询扩展的个性化信息检索方法
CN103617290B (zh) 中文机器阅读系统
CN108038099B (zh) 基于词聚类的低频关键词识别方法
CN108304509B (zh) 一种基于文本多向量表示相互学习的垃圾评论过滤方法
CN109145180B (zh) 一种基于增量聚类的企业热点事件挖掘方法
CN112256861A (zh) 一种基于搜索引擎返回结果的谣言检测方法及电子装置
CN106021424B (zh) 一种文献作者重名检测方法
Bhutada et al. Semantic latent dirichlet allocation for automatic topic extraction
CN108536665A (zh) 一种确定语句一致性的方法及装置
CN109753646B (zh) 一种文章属性识别方法以及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13894407

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016518124

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 15025573

Country of ref document: US

REEP Request for entry into the european phase

Ref document number: 2013894407

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2013894407

Country of ref document: EP