KR101479040B1 - Method, apparatus, and computer storage medium for automatically adding tags to document - Google Patents

Method, apparatus, and computer storage medium for automatically adding tags to document Download PDF

Info

Publication number
KR101479040B1
KR101479040B1 KR20147019605A KR20147019605A KR101479040B1 KR 101479040 B1 KR101479040 B1 KR 101479040B1 KR 20147019605 A KR20147019605 A KR 20147019605A KR 20147019605 A KR20147019605 A KR 20147019605A KR 101479040 B1 KR101479040 B1 KR 101479040B1
Authority
KR
South Korea
Prior art keywords
document
words
corpus
characteristic
tag
Prior art date
Application number
KR20147019605A
Other languages
Korean (ko)
Other versions
KR20140093762A (en
Inventor
시앙 흐어
왕예
펑 지아오
Original Assignee
텐센트 테크놀로지(센젠) 컴퍼니 리미티드
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to CN201210001611.9 priority Critical
Priority to CN201210001611.9A priority patent/CN103198057B/en
Application filed by 텐센트 테크놀로지(센젠) 컴퍼니 리미티드 filed Critical 텐센트 테크놀로지(센젠) 컴퍼니 리미티드
Priority to PCT/CN2012/086733 priority patent/WO2013102396A1/en
Publication of KR20140093762A publication Critical patent/KR20140093762A/en
Application granted granted Critical
Publication of KR101479040B1 publication Critical patent/KR101479040B1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

Embodiments of the present invention provide a method and apparatus for automatically adding a tag to a document, the method comprising: determining a plurality of candidate tag words; Determining a corpus containing a plurality of texts; Selecting common words from the corpus as characteristic words; For each characteristic word and candidate tag word, if a characteristic word occurs, determine the coincidence probability of the candidate tag word occurring at the same time; Extracting characteristic words from the document, calculating weight of the characteristic cues for each extracted characteristic word; Within the corpus, for candidate tag words, count the weighted coincidence probabilities of both candidate tag words and characteristic words occurring in the document; And selecting the candidate tag word having the highest probability of weighted coincidence as the tag word to be added to the document. Embodiments of the present invention can realize intelligence for adding tags to a document, and tags are not limited to keywords generated in a document.

Description

METHOD, APPARATUS, AND COMPUTER STORAGE MEDIUM FOR AUTOMATICALLY ADDING TAGS TO DOCUMENT,

This application claims the priority of Chinese Patent Application No. 201210001611.9, filed on January 5, 2012 with the title "METHOD AND APPARATUS FOR AUTOMATICALLY ADDED TAG TO DOCUMENT", filed with the State Intellectual Property Office, Are incorporated herein by reference in their entirety.

The present invention relates to the art of Internet documents, and more particularly to a method and apparatus for automatically adding tags to a document.

The tags used to organize content on the Internet are key words that are highly related to the document. The contents of the document are briefly described and classified by the tags to facilitate searching and sharing.

Currently, there are mainly three ways to add tags to a document: 1) the manner of a passive tag, where a particular tag is manually assigned to the document; 2) a keyword tag method in which an important keyword, which is extracted automatically from a document by analyzing contents of the document, is taken as a tag; And 3) a socialized tag method in which the tag is added to the user's document by the user himself. There are problems with all three approaches, for example: 1) With regard to the passive tag approach, tags can not be automatically added to a large amount of documents; 2) With respect to the keyword tag method, only keywords occurring in the document can be selected as tags, while not all of the keywords are suitable for the tag; And 3) With respect to the methods of socialization tags, this requires the user to add tags to the document alone, resulting in tags not being aligned due to inconsistent standards of different users.

According to one embodiment of the present invention, a method and apparatus are provided for automatically adding tags to a document, whereby tags that are not limited to keywords in the document can be intelligently added to the document.

The solution to one embodiment of the present invention is implemented as follows.

To automatically add tags to your document:

Determining a plurality of candidate tag words corresponding to the document;

Determining a corpus comprising a plurality of texts; Selecting words that are commonly used from corpus as characteristic words; Determining, for each of the characteristic words and the candidate tag words, a probability that the candidate tag word coincides with the characteristic word;

Extracting characteristic words from a document, and calculating a weight for each of the extracted characteristic words; And

Calculating, in the corpus, a weighted probability that each of the candidate tag words will occur simultaneously with all of the characteristic words extracted from the document; Selecting a candidate tag word having a high probability of occurrence of weighted weighting as a tag word to be added to the document.

Devices that automatically add tags to a document are:

A candidate tag word determination module configured to determine a plurality of candidate tag words corresponding to a document;

To select a corpus containing a plurality of texts, to select commonly used words from the corpus as characteristic words, and for each of the characteristic words and each of the candidate tag words to determine the probability that the candidate tag word coincides with the characteristic word A concurrent occurrence probability determination module configured;

A weight calculation module configured to extract characteristic words from a document and calculate a weight for each of the extracted characteristic words;

A weighted coincidence probability calculation module configured to calculate, in a corpus, a weighted probability that each of the candidate tag words will occur simultaneously with all of the characteristic words extracted from the document; And

And a tag word addition module configured to select a candidate tag word having a high weighted coincidence probability as a tag word to be added to the document.

In a method and apparatus for automatically adding a tag to a document in accordance with an embodiment of the present invention, a tag that is not limited to a keyword in the document calculates the probability that the feature word coincides with the candidate tag word in the corpus, Can be intelligently added to a document by switching to a vote from a word to a candidate tag word and taking the candidate tag word that has acquired the most votes as the tag word to be added to the document.

1 is a flow diagram of a method for automatically adding a tag to a document in accordance with one embodiment of the present invention.
Figure 2 is a schematic diagram of the structure of an apparatus for automatically adding tags to a document in accordance with one embodiment of the present invention.

According to one embodiment of the present invention, a method of automatically adding a tag to a document is provided. 1 is a flow chart of a method including the following steps.

In step 101, a plurality of candidate tag words corresponding to the document are determined.

At this stage, a plurality of candidate tag words corresponding to a document may be determined by, but not limited to, the following three methods:

1) the manner of a passive tag, where a particular tag is manually specified in the document;

2) a keyword tag method in which an important keyword automatically extracted from a document is extracted as a tag by analyzing contents of the document; And

3) The social tag method in which the tag is added to the user's document by the user himself.

In the case where the candidate tag words are determined by the manual tag method or the social tag method, the candidate tag words are not limited to words occurring in the document.

In step 102, a corpus containing a plurality of texts is determined.

For example, if one million texts are obtained from the Internet, one million acquired texts are collectively referred to as corpus.

In step 103, the commonly used words are selected as characteristic words from the corpus, and the probability of simultaneous occurrence of the candidate tag word and the characteristic word for each word of the characteristic words and for each word of the candidate tag words is determined in the corpus.

In step 104, the characteristic words are extracted from the document, and a weight for each word of the characteristic words is calculated.

In step 105, for each word of the candidate tag words, a weighted probability that a candidate tag word occurs simultaneously with all of the characteristic words occurring in the document is calculated in the corpus; A candidate tag word with a high probability of weighted concurrency is selected as the tag word to be added to the document.

At step 103, the coincidence probability is denoted as P (X | Y), where X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus. P (X | Y) can be determined by various methods such as the following.

In the first scheme, P (X | Y) is equal to the result of dividing the number of times X coincides with Y in the same text included in the corpus by the number of times Y occurs in the corpus.

In the second scheme,

Figure 112014066258084-pct00001
Where H (X, Y) represents the combination entropy of X and Y, I (X, Y) represents the mutual information of X and Y, H (X) represents the information entropy of X (information entropy), and H (Y) denotes the information entropy of Y. [

In the third scheme, P (X | Y) is determined by using a vocabulary database such as wordnet.

At step 104, for each word of the extracted characteristic words, the weight for the characteristic word may be calculated based on the number of times the characteristic word occurred in the document and the number of texts in the corpus in which the characteristic word occurred.

Weight for the characteristic word (Y) to be extracted from a document is represented by W Y, W Y is: W and Y is as Y are same as the product of the number of the text in the generated number of times, and Y generated in the document corpus (product) Lt; / RTI >

In step 105, the weighted coincidence probability is

Figure 112014066258084-pct00002
, Where Y i represents one of the characteristic words extracted from the document,
Figure 112014066258084-pct00003
Denotes the weight for Y i , and n denotes the number of characteristic words extracted from the document.

In step 105, the weighted coincidence probability P X can be calculated for candidate tag words that coincide with one or more characteristic words extracted from the document, rather than for all candidate tag words.

Certain embodiments will be introduced in more detail below.

First Embodiment

In step 1, a tag word set is prepared.

A number of candidate tag words corresponding to the document are obtained to construct a tag word set as desired. For example, if a tag needs to be added to documents associated with a movie, the tag word set may include tag types such as movie types and celebrities.

In step 2, a corpus is prepared.

A number of related texts may be collected as a corpus to be used for statistics of concurrent relationships between words from the Internet.

In step 3, characteristic words are extracted from the corpus.

Word segmentation is performed on the texts in the corpus. Then the term frequency (TF) of each word is counted. High frequency words, unused words and low frequency words are removed and the remaining commonly used words are selected as characteristic words.

In step 4, the probability P (X | Y) that each of the characteristic words occurs simultaneously with each of the candidate tag words is calculated.

P (X | Y) is equivalent to the number of simultaneous X and Y occurrences in the same text included in the corpus divided by the number of occurrences of Y in the corpus.

Where X denotes one of the candidate tag words, and Y denotes one of the characteristic words.

In step 5, the tag words are automatically added to the document, the specific steps of which are:

In step I, word breaking is performed on the document;

In step II, all of the characteristic words occurring in the document are extracted according to the word classification result, and the weight (W Y ) for each extracted characteristic word Y is calculated as W Y = TF × IDF, Y indicates the number of occurrences in the document and IDF indicates the number of texts in the corpus in which Y occurs;

Extracting candidate tag words coincident with at least one characteristic word (i. E., The coincidence probability is not zero) based on the coincidence probabilities calculated in step 4;

In step IV, for each of the candidate tag words to be extracted, the weighted coincidence probability of the extracted candidate tag words with all of the characteristic words extracted from the document

Figure 112014066258084-pct00004
, Where Y i represents one of the characteristic words extracted from the document,
Figure 112014066258084-pct00005
Denotes a weight for Y i , n denotes the number of characteristic words extracted from the document; And

In step V, all of the candidate tag words extracted in descending order of P x values are ranked, and one or more candidate tag words having the highest P X are selected as tag words to be added to the document.

At this stage, a few of the candidate tag words are first extracted in step III, and then a weighted coincidence probability is calculated for each of the extracted candidate tag words. This can speed computation and save system resources. According to other embodiments of the invention, the weighted coincidence probability can be calculated for all of the candidate tag words. For a candidate tag word that does not have a coincidence relationship with any of the characteristic words, the calculated weighted coincidence probability P X = 0 and the candidate tag word is ranked at the end of the queue of candidate tag words in step V Will be.

In another embodiment of the present invention, the co-occurrence probability P (X | Y) of the characteristic word and the candidate tag word may be calculated in other manners. For example, P (X | Y)

Figure 112014066258084-pct00006
Where H (X, Y) denotes the combined entropy of X and Y, I (X, Y) denotes the mutual information of X and Y, H (X) denotes the information entropy of X , And H (Y) indicates the information entropy of Y. Alternatively, the relationship between the characteristic word and the candidate tag word is determined by using a vocabulary database such as wordnet.

According to one embodiment of the present invention, there is further provided an apparatus for automatically adding a tag to a document. Figure 2 is a schematic view of the structure of the device,

A candidate tag word determination module (201) configured to determine a plurality of candidate tag words corresponding to a document;

Determining a corpus containing a plurality of texts, selecting commonly used words from the corpus as characteristic words, and for each word of the characteristic words and each word of the candidate tag words, A coincidence probability determination module (202) configured to determine a coincidence probability;

A weight calculation module (203) configured to extract characteristic words from a document and calculate a weight for each word of characteristic words;

Within a corpus, a weighted coincidence probability calculation module (204) configured to calculate a weighted probability that each word of candidate tag words will occur simultaneously with all of the characteristic words occurring in the document; And

And a tag word addition module (205) configured to select a candidate tag word having a high weighted coincidence probability as a tag word to be added to the document.

In the above-described apparatus, the coincidence probability can be expressed as P (X | Y), where X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus. The coincidence probability determination module 202 may calculate P (X | Y) as follows.

P (X | Y) is the same as the result of dividing the number of simultaneous X and Y occurrences in the corpus by the number of occurrences of Y in the corpus.

As an alternative,

Figure 112014066258084-pct00007
, Where H (X, Y) denotes the entropy of the combination of X and Y, and I (X, Y) denotes the mutual information of X and Y.

Alternatively, P (X | Y) is determined by using a lexical database.

In the above-described apparatus, the weight for the characteristic word Y extracted from the document is represented by W Y , which is calculated by the weight calculation module 203: W Y is the number of times Y occurs in the document, Is equal to the product of the number of texts.

In the above-described apparatus, the probability of weighted simultaneous occurrence is

Figure 112014066258084-pct00008
, Where Y i represents one of the characteristic words extracted from the document,
Figure 112014066258084-pct00009
Denotes the weight for Y i , and n denotes the number of characteristic words extracted from the document.

In the apparatus described above, the weighted coincidence probability calculation module 204 may calculate the weighted coincidence probability for candidate tag words coincident with one or more characteristic words extracted from the document only.

In conclusion, in a method and apparatus for automatically adding a tag to a document according to embodiments of the present invention, a tag that is not limited to a keyword occurring in a document calculates a probability that a characteristic word coincides with a candidate tag word in a corpus , The simultaneous occurrence probability can be added to the document intelligently by converting the candidate word tag from the characteristic word into a vote vote and taking the candidate tag word obtained from the maximum tags as a tag word to be added to the document. The relevance between the tag word and the document is improved based on statistics on the coincidence probability according to embodiments of the present invention.

According to one embodiment of the present invention, there is further provided a machine-readable storage medium storing instructions that enable a machine to perform a method of automatically adding a tag to a document, as described herein. A system or apparatus may be provided that includes a storage medium on which software program codes embodying the functions of any of the above embodiments are stored, and a computer (or CPU or MPU) Lt; RTI ID = 0.0 > and / or < / RTI >

In this case, the program codes read from the storage medium may implement any one of the above-described embodiments. Therefore, the storage medium storing the program codes and program codes constitutes a part of the present invention.

Examples of storage media that provide the program codes are a hard disk, a hard disk, a magnetic optical disk, an optical disk (CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD- , Magnetic tape, non-volatile memory, and ROM. Optionally, the program codes may be downloaded from the server computer via the communications network.

Furthermore, any one of the above-described embodiments may be implemented in a computer-readable medium, such as a computer-readable medium or a computer-readable recording medium, As will be appreciated by those skilled in the art.

Furthermore, any one of the above-described embodiments may be implemented by writing program codes read from a storage medium to a memory provided in an expansion board inserted in the computer, or writing the program codes into a memory provided in an expansion unit connected to the computer And then instructing the CPU or the like mounted on the expansion board or expansion unit based on the program codes that perform some or all of the actual operations.

The above-described preferred embodiments of the present invention are not intended to limit the scope of the invention. Any modifications, equivalents, and improvements that fall within the spirit and principles of the present invention are within the scope of the present invention.

Claims (15)

  1. A method to automatically add a tag to a document:
    Determining a plurality of candidate tag words corresponding to the document;
    Determining a corpus comprising a plurality of texts; Selecting words commonly used from the corpus as characteristic words; Determining, for each of the characteristic words and the candidate tag words, a probability that the candidate tag word coincides with the characteristic word;
    Extracting characteristic words from the document and calculating a weight for each of the extracted characteristic words; And
    Calculating, in the corpus, a weighted probability that each of the candidate tag words will occur simultaneously with all of the characteristic words extracted from the document; Selecting a candidate tag word having a high probability of weighted occurrence as a tag word to be added to the document,
    Weight for the characteristic word Y to be extracted from said document that is represented by W Y, W Y is Y is identical to the product of the text, the number of (product) in the corpus that the number and Y generated by the document generation tags To the document automatically.
  2. The method according to claim 1,
    Wherein the coincidence probability is represented as P (X | Y), where X represents one of the candidate tag words and Y represents one of the characteristic words occurring in the corpus;
    A method of automatically adding a tag to a document determined as a result of dividing the number of simultaneous occurrences of X and Y in the same text included in the corpus by the number of occurrences of Y in the corpus.
  3. The method according to claim 1,
    Wherein the coincidence probability is denoted as P (X | Y), X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus;
    P (X | Y)
    Figure 112014066258084-pct00010
    I (X, Y) represents a mutual information of X and Y, and H (X, Y) represents a combination entropy of X and Y, .
  4. The method according to claim 1,
    Wherein the coincidence probability is denoted as P (X | Y), X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus;
    P (X | Y) is a method of automatically adding tags to a document determined by using a vocabulary database.
  5. delete
  6. The method according to claim 1,
    The weighted probability of simultaneous occurrence is
    Figure 112014066258084-pct00011
    Y i represents one of the characteristic words extracted from the document,
    Figure 112014066258084-pct00012
    Represents a weight for Y i , and n is a number indicating the number of characteristic words extracted from the document.
  7. The method according to claim 1,
    In the corpus, calculating the weighted probability that each of the candidate tag words will occur simultaneously with all of the characteristic words extracted from the document:
    Wherein in the corpus, calculating a weighted probability that each of the candidate tag words coincides with at least one characteristic word extracted from the document.
  8. A device that automatically adds tags to a document:
    A candidate tag word determination module configured to determine a plurality of candidate tag words corresponding to the document;
    Determining a corpus containing a plurality of texts, selecting words commonly used from the corpus as characteristic words, and for each of the characteristic words and the candidate tag words, determining whether the candidate tag word is concurrent with the characteristic word A coincidence probability determination module configured to determine a probability of occurrence;
    A weight calculation module configured to extract characteristic words from the document and to calculate a weight for each of the extracted characteristic words;
    A weighted coincidence probability calculation module configured to calculate, in the corpus, a weighted probability that each of the candidate tag words will occur simultaneously with all of the characteristic words extracted from the document; And
    And a tag word addition module configured to select a candidate tag word having a high weighted coincidence probability as a tag word to be added to the document,
    The weight for the characteristic word Y extracted from the document is represented by W Y and the weight calculation module is equal to the product of W Y by the number of times Y occurs in the document and the number of texts in the corpus where Y occurs A device that automatically adds tags to a document that are configured to calculate.
  9. 9. The method of claim 8,
    Wherein the coincidence probability is represented as P (X | Y), where X represents one of the candidate tag words and Y represents one of the characteristic words occurring in the corpus;
    Wherein the simultaneous occurrence probability determination module calculates a tag configured to calculate P (X | Y) as a result of dividing the number of simultaneous occurrence of X and Y in the same text included in the corpus by the number of times Y occurs in the corpus A device that automatically adds to a document.
  10. 9. The method of claim 8,
    Wherein the coincidence probability is denoted as P (X | Y), X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus;
    The coincidence probability determination module determines P (X | Y) as
    Figure 112014066258084-pct00013
    , Where H (X, Y) denotes the combined entropy of X and Y, and I (X, Y) automatically adds a tag to the document indicating the mutual information of X and Y.
  11. 9. The method of claim 8,
    Wherein the coincidence probability is denoted as P (X | Y), X denotes one of the candidate tag words and Y denotes one of the characteristic words occurring in the corpus;
    Wherein the coincidence probability determination module automatically adds a tag to a document that is configured to calculate P (X | Y) by using a lexical database.
  12. delete
  13. The method according to any one of claims 8 to 11,
    The weighted probability of simultaneous occurrence is
    Figure 112014066258084-pct00014
    Y i represents one of the characteristic words extracted from the document,
    Figure 112014066258084-pct00015
    ≪ / RTI > wherein n represents the number of characteristic words extracted from the document, and n represents a number representing the number of characteristic words extracted from the document.
  14. The method according to any one of claims 8 to 11,
    Wherein the weighted coincidence probability calculation module automatically adds within the corpus a tag that is configured to calculate a weighted probability that each of the candidate tag words coincides with one or more characteristic words extracted from the document.
  15. A computer storage medium storing a computer program for implementing the method according to any one of claims 1 to 4, 6, and 7.
KR20147019605A 2012-01-05 2012-12-17 Method, apparatus, and computer storage medium for automatically adding tags to document KR101479040B1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201210001611.9 2012-01-05
CN201210001611.9A CN103198057B (en) 2012-01-05 2012-01-05 One kind adds tagged method and apparatus to document automatically
PCT/CN2012/086733 WO2013102396A1 (en) 2012-01-05 2012-12-17 Method, apparatus, and computer storage medium for automatically adding tags to document

Publications (2)

Publication Number Publication Date
KR20140093762A KR20140093762A (en) 2014-07-28
KR101479040B1 true KR101479040B1 (en) 2015-01-05

Family

ID=48720627

Family Applications (1)

Application Number Title Priority Date Filing Date
KR20147019605A KR101479040B1 (en) 2012-01-05 2012-12-17 Method, apparatus, and computer storage medium for automatically adding tags to document

Country Status (6)

Country Link
US (1) US9146915B2 (en)
EP (1) EP2801917A4 (en)
JP (1) JP2015506515A (en)
KR (1) KR101479040B1 (en)
CN (1) CN103198057B (en)
WO (1) WO2013102396A1 (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199898B (en) * 2014-08-26 2018-05-15 北京小度互娱科技有限公司 Acquisition methods and device, the method for pushing and device of a kind of attribute information
JP6208105B2 (en) * 2014-09-18 2017-10-04 株式会社東芝 Tag assigning apparatus, method, and program
CN105488077B (en) * 2014-10-10 2020-04-28 腾讯科技(深圳)有限公司 Method and device for generating content label
CN104361033B (en) * 2014-10-27 2017-06-09 深圳职业技术学院 A kind of automatic collection method of cancer relevant information and system
CN104462360B (en) * 2014-12-05 2020-02-18 北京奇虎科技有限公司 Method and device for generating semantic identification for text set
CN105989018B (en) * 2015-01-29 2020-04-21 深圳市腾讯计算机系统有限公司 Label generation method and label generation device
US20180075361A1 (en) * 2015-04-10 2018-03-15 Hewlett-Packard Enterprise Development LP Hidden dynamic systems
JP6535858B2 (en) * 2015-04-30 2019-07-03 国立大学法人鳥取大学 Document analyzer, program
WO2017011483A1 (en) * 2015-07-12 2017-01-19 Aravind Musuluri System and method for ranking documents
CN105573968A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Text indexing method based on rules
CN105740404A (en) * 2016-01-28 2016-07-06 上海晶赞科技发展有限公司 Label association method and device
CN106066870B (en) * 2016-05-27 2019-03-15 南京信息工程大学 A kind of bilingual teaching mode building system of context mark
CN106682149A (en) * 2016-12-22 2017-05-17 湖南科技学院 Label automatic generation method based on meta-search engine
CN107436922A (en) * 2017-07-05 2017-12-05 北京百度网讯科技有限公司 Text label generation method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090045520A (en) * 2007-11-02 2009-05-08 조광현 Method of generating tag word automatically by semantics
KR101011726B1 (en) 2009-06-09 2011-01-28 성균관대학교산학협력단 Apparatus and method for providing snippet

Family Cites Families (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3266246B2 (en) * 1990-06-15 2002-03-18 インターナシヨナル・ビジネス・マシーンズ・コーポレーシヨン Natural language analysis apparatus and method, and knowledge base construction method for natural language analysis
JP3220885B2 (en) * 1993-06-18 2001-10-22 株式会社日立製作所 Keyword assignment system
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
JP2809341B2 (en) * 1994-11-18 1998-10-08 松下電器産業株式会社 Information summarizing method, information summarizing device, weighting method, and teletext receiving device.
US6480841B1 (en) * 1997-09-22 2002-11-12 Minolta Co., Ltd. Information processing apparatus capable of automatically setting degree of relevance between keywords, keyword attaching method and keyword auto-attaching apparatus
US6317740B1 (en) * 1998-10-19 2001-11-13 Nec Usa, Inc. Method and apparatus for assigning keywords to media objects
US7130848B2 (en) * 2000-08-09 2006-10-31 Gary Martin Oosta Methods for document indexing and analysis
AU3929702A (en) * 2000-11-16 2002-06-03 Mydtv Inc System and methods for determining the desirability of video programming events
JP4679003B2 (en) 2001-08-24 2011-04-27 ヤフー株式会社 Feature item extraction method from data
AU2003201799A1 (en) * 2002-01-16 2003-07-30 Elucidon Ab Information data retrieval, where the data is organized in terms, documents and document corpora
US7395256B2 (en) * 2003-06-20 2008-07-01 Agency For Science, Technology And Research Method and platform for term extraction from large collection of documents
US20060074900A1 (en) * 2004-09-30 2006-04-06 Nanavati Amit A Selecting keywords representative of a document
TWI254880B (en) * 2004-10-18 2006-05-11 Avectec Com Inc Method for classifying electronic document analysis
KR20070084004A (en) * 2004-11-05 2007-08-24 가부시키가이샤 아이.피.비. Keyword extracting device
JP2006323517A (en) 2005-05-17 2006-11-30 Mitsubishi Electric Corp Text classification device and program
US7711737B2 (en) * 2005-09-12 2010-05-04 Microsoft Corporation Multi-document keyphrase extraction using partial mutual information
US7627559B2 (en) * 2005-12-15 2009-12-01 Microsoft Corporation Context-based key phrase discovery and similarity measurement utilizing search engine query logs
US8856145B2 (en) * 2006-08-04 2014-10-07 Yahoo! Inc. System and method for determining concepts in a content item using context
US7996393B1 (en) * 2006-09-29 2011-08-09 Google Inc. Keywords associated with document categories
US8073850B1 (en) * 2007-01-19 2011-12-06 Wordnetworks, Inc. Selecting key phrases for serving contextually relevant content
JP2009015743A (en) 2007-07-09 2009-01-22 Fujifilm Corp Document creation support system, document creation support method, and document creation support program
US7917355B2 (en) * 2007-08-23 2011-03-29 Google Inc. Word detection
US8280892B2 (en) * 2007-10-05 2012-10-02 Fujitsu Limited Selecting tags for a document by analyzing paragraphs of the document
US9317593B2 (en) * 2007-10-05 2016-04-19 Fujitsu Limited Modeling topics using statistical distributions
WO2009059297A1 (en) * 2007-11-01 2009-05-07 Textdigger, Inc. Method and apparatus for automated tag generation for digital content
US8090724B1 (en) * 2007-11-28 2012-01-03 Adobe Systems Incorporated Document analysis and multi-word term detector
US8055688B2 (en) * 2007-12-07 2011-11-08 Patrick Giblin Method and system for meta-tagging media content and distribution
US8280886B2 (en) * 2008-02-13 2012-10-02 Fujitsu Limited Determining candidate terms related to terms of a query
US20090299998A1 (en) * 2008-02-15 2009-12-03 Wordstream, Inc. Keyword discovery tools for populating a private keyword database
US8606795B2 (en) * 2008-07-01 2013-12-10 Xerox Corporation Frequency based keyword extraction method and system using a statistical measure
CA2638558C (en) * 2008-08-08 2013-03-05 Bloorview Kids Rehab Topic word generation method and system
US20100076976A1 (en) * 2008-09-06 2010-03-25 Zlatko Manolov Sotirov Method of Automatically Tagging Image Data
US8166051B1 (en) * 2009-02-03 2012-04-24 Sandia Corporation Computation of term dominance in text documents
JP2010224622A (en) 2009-03-19 2010-10-07 Nomura Research Institute Ltd Method and program for applying tag
US20110004465A1 (en) * 2009-07-02 2011-01-06 Battelle Memorial Institute Computation and Analysis of Significant Themes
US8370286B2 (en) * 2009-08-06 2013-02-05 Yahoo! Inc. System for personalized term expansion and recommendation
CN101650731A (en) * 2009-08-31 2010-02-17 浙江大学 Method for generating suggested keywords of sponsored search advertisement based on user feedback
US8245135B2 (en) * 2009-09-08 2012-08-14 International Business Machines Corporation Producing a visual summarization of text documents
CN102043791B (en) * 2009-10-10 2014-04-30 深圳市世纪光速信息技术有限公司 Method and device for evaluating word classification
US8266228B2 (en) * 2009-12-08 2012-09-11 International Business Machines Corporation Tagging communication files based on historical association of tags
WO2011127655A1 (en) * 2010-04-14 2011-10-20 Hewlett-Packard Development Company, L.P. Method for keyword extraction
US8463786B2 (en) * 2010-06-10 2013-06-11 Microsoft Corporation Extracting topically related keywords from related documents
CN102081642A (en) 2010-10-28 2011-06-01 华南理工大学 Chinese label extraction method for clustering search results of search engine
US8375022B2 (en) * 2010-11-02 2013-02-12 Hewlett-Packard Development Company, L.P. Keyword determination based on a weight of meaningfulness
CN103201718A (en) * 2010-11-05 2013-07-10 乐天株式会社 Systems and methods regarding keyword extraction
US9483557B2 (en) * 2011-03-04 2016-11-01 Microsoft Technology Licensing Llc Keyword generation for media content
US8700599B2 (en) * 2011-11-21 2014-04-15 Microsoft Corporation Context dependent keyword suggestion for advertising

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090045520A (en) * 2007-11-02 2009-05-08 조광현 Method of generating tag word automatically by semantics
KR101011726B1 (en) 2009-06-09 2011-01-28 성균관대학교산학협력단 Apparatus and method for providing snippet

Also Published As

Publication number Publication date
US20150019951A1 (en) 2015-01-15
JP2015506515A (en) 2015-03-02
WO2013102396A1 (en) 2013-07-11
EP2801917A4 (en) 2015-08-26
CN103198057B (en) 2017-11-07
US9146915B2 (en) 2015-09-29
KR20140093762A (en) 2014-07-28
EP2801917A1 (en) 2014-11-12
CN103198057A (en) 2013-07-10

Similar Documents

Publication Publication Date Title
US20180373742A1 (en) System and method of search indexes using key-value attributes to searchable metadata
Nie et al. Harvesting visual concepts for image search with complex queries
US9317498B2 (en) Systems and methods for generating summaries of documents
Hachey et al. Evaluating entity linking with wikipedia
US8868469B2 (en) System and method for phrase identification
JP4881322B2 (en) Information retrieval system based on multiple indexes
US8554854B2 (en) Systems and methods for identifying terms relevant to web pages using social network messages
JP5647508B2 (en) System and method for identifying short text communication topics
US7424421B2 (en) Word collection method and system for use in word-breaking
KR101098703B1 (en) System and method for identifying related queries for languages with multiple writing systems
US8402036B2 (en) Phrase based snippet generation
US9195738B2 (en) Tokenization platform
JP5540079B2 (en) Knowledge base construction method and apparatus
US8635061B2 (en) Language identification in multilingual text
US7707023B2 (en) Method of finding answers to questions
EP2211280B1 (en) System and method for providing default hierarchical training for social indexing
US6978275B2 (en) Method and system for mining a document containing dirty text
CN106156204B (en) Text label extraction method and device
KR101715432B1 (en) Word pair acquisition device, word pair acquisition method, and recording medium
US8781817B2 (en) Phrase based document clustering with automatic phrase extraction
KR100544514B1 (en) Method and system for determining relation between search terms in the internet search system
US8332434B2 (en) Method and system for finding appropriate semantic web ontology terms from words
TWI506460B (en) System and method for recommending files
US8849787B2 (en) Two stage search
US8356045B2 (en) Method to identify common structures in formatted text documents

Legal Events

Date Code Title Description
A201 Request for examination
A302 Request for accelerated examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right
GRNT Written decision to grant
FPAY Annual fee payment

Payment date: 20171219

Year of fee payment: 4

FPAY Annual fee payment

Payment date: 20181219

Year of fee payment: 5