CN108829679A - Corpus labeling method and device - Google Patents

Corpus labeling method and device Download PDF

Info

Publication number
CN108829679A
CN108829679A CN201810644479.0A CN201810644479A CN108829679A CN 108829679 A CN108829679 A CN 108829679A CN 201810644479 A CN201810644479 A CN 201810644479A CN 108829679 A CN108829679 A CN 108829679A
Authority
CN
China
Prior art keywords
corpus
word
converted
marked
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810644479.0A
Other languages
Chinese (zh)
Inventor
吴健君
倪嘉呈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201810644479.0A priority Critical patent/CN108829679A/en
Publication of CN108829679A publication Critical patent/CN108829679A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/221Parsing markup language streams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of corpus labeling method and device provided in an embodiment of the present invention, each corpus that can be treated in tagged corpus carry out word cutting, obtain the corresponding multiple words to be converted of the corpus to be marked;The term vector of the corresponding each word to be converted of the corpus to be marked is determined according to the term vector of the corresponding each word of training corpus, wherein the corresponding each word of the training corpus is obtained after carrying out word cutting to corpus each in the training corpus;The corpus vector of each corpus in the corpus to be marked is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked;Each corpus vector is clustered, multiple clustering clusters are obtained, each clustering cluster is labeled.Since the present invention realizes the cluster mark to corpus vector, there is no need to individually be labeled to each corpus, more conveniently and quickly, the efficiency of corpus labeling is effectively increased.

Description

Corpus labeling method and device
Technical field
The present invention relates to natural language processing fields, in particular to corpus labeling method and device.
Background technique
Corpus in corpus is labeled be natural language processing field a kind of important technology, in query phrase (Query) analysis etc. has a wide range of applications.
Existing corpus labeling technology needs are respectively individually labeled each of corpus corpus, due to corpus Corpus quantity in library is more, therefore the annotating efficiency of existing corpus labeling technology is lower.
How to improve the efficiency of corpus labeling is still the technical problem urgently to be resolved of this field one.
Summary of the invention
In view of this, the present invention provides a kind of corpus labeling method and device, and to realize that the high efficiency to corpus annotates, skill Art scheme is as follows:
A kind of corpus labeling method, including:
Each corpus treated in tagged corpus carries out word cutting, and it is corresponding multiple to be converted to obtain the corpus to be marked Word;
The corresponding each word to be converted of the corpus to be marked is determined according to the term vector of the corresponding each word of training corpus Term vector, wherein the corresponding each word of the training corpus is after carrying out word cutting to corpus each in the training corpus It arrives;
It is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked each in the corpus to be marked The corpus vector of corpus;
Each corpus vector is clustered, multiple clustering clusters are obtained, each clustering cluster is labeled.
Optionally, the term vector according to the corresponding each word of training corpus determines that the corpus to be marked is corresponding The term vector of each word to be converted, including:
Any one word to be converted corresponding to the corpus to be marked:Searching in the corresponding each word of training corpus should Word to be converted, such as finds, then the term vector of the conversion word is determined from the term vector of the corresponding each word of the training corpus.
Optionally, the term vector according to the corresponding each word to be converted of the corpus to be marked determines described to be marked The corpus vector of each corpus in corpus, including:
Determine the weight of the corresponding each word to be converted of the corpus to be marked;
It is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked and the weight described to be marked The corpus vector of each corpus in corpus.
Optionally, described to be determined according to the term vector and the weight of the corresponding each word to be converted of the corpus to be marked The corpus vector of each corpus in the corpus to be marked, including:
Pass through formula
Calculate the corpus vector v (q for obtaining each corpus in the corpus to be markedj), wherein the j is described wait mark The number of corpus in corpus is infused, the i is the number for each word to be converted that corpus word cutting obtains, the qjIt is described to be marked The corpus that number is j in corpus, the n are corpus qjThe quantity for each conversion word that word cutting obtains, the tijIt is j's for number The word to be converted that the number that corpus word cutting obtains is i, the w (tij) it is word t to be convertedijWeight, the v (tij) it is wait turn Change word tijTerm vector, the j, the i and the n are natural number.
Optionally, the weight of the corresponding each word to be converted of the determination corpus to be marked, including:
According to formula
w(tij)=freq (tij)×idf(tij)
It calculates and obtains the corresponding each word t to be converted of the corpus to be markedijWeight w (tij), wherein the j is institute The number of corpus in corpus to be marked is stated, the i is the number for each word to be converted that corpus word cutting obtains, the tijFor number The word to be converted that the number obtained for the corpus word cutting of j is i, the freq (tij) it is word t to be convertedijThe corpus for being j in number The frequency of middle appearance, the idf (tij) it is word t to be convertedijInverse document frequency, the j and the i are natural number.
Optionally, it is determined in the term vector according to the corresponding each word to be converted of the corpus to be marked described wait mark Infuse corpus in each corpus corpus vector after, it is described each corpus vector is clustered before, the method also includes:
The corpus vector of determining each corpus is normalized;
It is described that each corpus vector is clustered, including:
Each corpus vector after normalized is clustered.
A kind of corpus labeling device, including:Word cutting unit, term vector determination unit, corpus vector determination unit and cluster Unit,
The word cutting unit carries out word cutting for treating each corpus in tagged corpus, obtains the corpus to be marked The corresponding multiple words to be converted in library;
The term vector determination unit, it is described to be marked for being determined according to the term vector of the corresponding each word of training corpus The term vector of the corresponding each word to be converted of corpus, wherein the corresponding each word of the training corpus is to the training corpus It is obtained after each corpus progress word cutting in library;
The corpus vector determination unit, for the term vector according to the corresponding each word to be converted of the corpus to be marked Determine the corpus vector of each corpus in the corpus to be marked;
The cluster cell obtains multiple clustering clusters, marks to each clustering cluster for clustering to each corpus vector Note.
Optionally, the corpus vector determination unit includes:Weight determines that subelement and corpus determine subelement,
The weight determines subelement, for determining the weight of the corresponding each word to be converted of the corpus to be marked;
The corpus determines subelement, for according to the term vector of the corresponding each word to be converted of the corpus to be marked and The weight determines the corpus vector of each corpus in the corpus to be marked.
Optionally, the corpus determines that subelement is specifically configured to:
Pass through formula
Calculate the corpus vector v (q for obtaining each corpus in the corpus to be markedj), wherein the j is described wait mark The number of corpus in corpus is infused, the i is the number for each word to be converted that corpus word cutting obtains, the qjIt is described to be marked The corpus that number is j in corpus, the n are corpus qjThe quantity for each conversion word that word cutting obtains, the tijIt is j's for number The word to be converted that the number that corpus word cutting obtains is i, the w (tij) it is word t to be convertedijWeight, the v (tij) it is wait turn Change word tijTerm vector, the j, the i and the n are natural number.
Optionally, the weight determines that subelement is specifically configured to:
According to formula
w(tij)=freq (tij)×idf(tij)
It calculates and obtains the corresponding each word t to be converted of the corpus to be markedijWeight w (tij), wherein the j is institute The number of corpus in corpus to be marked is stated, the i is the number for each word to be converted that corpus word cutting obtains, the tijFor number The word to be converted that the number obtained for the corpus word cutting of j is i, the freq (tij) it is word t to be convertedijThe corpus for being j in number The frequency of middle appearance, the idf (tij) it is word t to be convertedijInverse document frequency, the j and the i are natural number.
A kind of corpus labeling method and device provided in an embodiment of the present invention, can treat each corpus in tagged corpus Word cutting is carried out, the corresponding multiple words to be converted of the corpus to be marked are obtained;According to the word of the corresponding each word of training corpus Vector determines the term vector of the corresponding each word to be converted of the corpus to be marked, wherein the training corpus is corresponding each Word is obtained after carrying out word cutting to corpus each in the training corpus;It is corresponding respectively wait turn according to the corpus to be marked The term vector for changing word determines the corpus vector of each corpus in the corpus to be marked;Each corpus vector is clustered, is obtained Multiple clustering clusters are labeled each clustering cluster.Since the present invention realizes the cluster mark to corpus vector, there is no need to right Each corpus is individually labeled, and more conveniently and quickly, effectively increases the efficiency of corpus labeling.Simultaneously as this hair It is bright to obtain the term vector of each word first by the way that corpus is cut to multiple words, then corpus vector is obtained further according to each term vector The accuracy of corpus vector can be improved in mode.Meanwhile term vector used in the present invention is by corpus each in training corpus Carry out what the word that word cutting obtains was converted to, therefore the present invention can be by each corpus in screening training corpus come further Improve the accuracy of term vector.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described.
Fig. 1 is a kind of flow diagram of corpus labeling method provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of another corpus labeling method provided in an embodiment of the present invention;
Fig. 3 is the flow diagram of another corpus labeling method provided in an embodiment of the present invention;
Fig. 4 is a kind of structural schematic diagram of corpus labeling device provided in an embodiment of the present invention.
Specific embodiment
The invention discloses a kind of corpus labeling method and device, those skilled in the art can use for reference present disclosure, fit When improvement realization of process parameters.In particular, it should be pointed out that all similar substitutions and modifications are for a person skilled in the art It is it will be apparent that they are considered as being included in the present invention.Method of the invention and application passed through preferred embodiment into Gone description, related personnel obviously can not depart from the content of present invention, in spirit and scope to method described herein and application It is modified or appropriate changes and combinations, carrys out implementation and application the technology of the present invention.
As shown in Figure 1, a kind of corpus labeling method provided in an embodiment of the present invention, may include:
S100, each corpus treated in tagged corpus carry out word cutting, and it is corresponding multiple to obtain the corpus to be marked Word to be converted;
Wherein, each corpus in corpus to be marked can be selected by technical staff and is added in corpus to be marked. In practical applications, each corpus used in the embodiment of the present invention in corpus to be marked can be related to certain theme, such as It is related to automobile.In this way, can be obtained by and the theme by the way that corpus relevant to theme to be added in corpus to be marked Relevant corpus to be marked.Specifically, each corpus in corpus to be marked can be query phrase (Query).In reality In, the present invention can when inquiring, used query phrase be acquired and/or records to user, the present invention User's used query phrase when inquiring can be obtained from third party.
Specifically, a variety of different word cutting methods, which can be used, in the embodiment of the present invention carries out word cutting, such as:The present invention is implemented Example can carry out machine learning to word cutting training data by setting word cutting training data, obtain word cutting model.The present invention is real Need to only corpus be inputted the word to be converted that the output of word cutting model can be obtained in word cutting model by applying example.Certainly, the embodiment of the present invention Also the segmenting method based on string matching, the segmenting method based on understanding or the segmenting method based on statistics can be used to carry out Word cutting.
In practical applications, the embodiment of the present invention can be to the corresponding each word to be converted of the corpus to be marked of acquisition It carries out duplicate removal and reduces the workload of subsequent processing in this way, the embodiment of the present invention can remove duplicate word to be converted, Also avoid the interference as brought by duplicate word to be converted.
S200, determine that the corpus to be marked is corresponding respectively wait turn according to the term vector of the corresponding each word of training corpus Change the term vector of word, wherein the corresponding each word of the training corpus is to carry out word cutting to corpus each in the training corpus It obtains afterwards;
Specifically, in training corpus each corpus acquisition modes can there are many, such as:It is climbed from webpage by crawler A large amount of corpus relevant to certain theme are taken, to form training corpus.It should be understood that different crawling rule by being arranged Then, the embodiment of the present invention can crawl corpus relevant to different themes, to obtain the training corpus of different themes.Its In, above-mentioned theme can be configured and modify according to actual needs, and the embodiment of the present invention is it is not limited here.Such as:This hair Bright embodiment can crawl corpus relevant to automobile, to obtain the training corpus of automobile, wherein language relevant to automobile Material may include:Engine displacement, automatic gear-box, hundred kilometers accelerate 7.5 seconds, air bag etc..
Specifically, the theme of training corpus can be identical as the theme of corpus to be marked, can effectively improve in this way The accuracy of term vector.
Wherein, the word cutting method of corpus can be identical as the word cutting method of corpus in corpus to be marked in training corpus Or it is different.
In practical applications, the embodiment of the present invention obtains after can carrying out word cutting to corpus each in the training corpus Each word be converted to term vector.Specifically, word2vec technology, which can be used, in the present invention is converted to term vector for word.
In other embodiments of the present invention, step S200 is according to the determination of the term vector of the corresponding each word of training corpus The term vector of the corresponding each word to be converted of corpus to be marked may include:
Any one word to be converted corresponding to the corpus to be marked:Searching in the corresponding each word of training corpus should Word to be converted, such as finds, then the term vector of the conversion word is determined from the term vector of the corresponding each word of the training corpus.
For convenience of understanding, it is exemplified below:
If there are five the corresponding words of training corpus, respectively:A, the term vector of B, C, D and E, this five words is respectively: A, b, c, d and e.Then when certain word to be converted is one in A, B, C, D and E, so that it may in the corresponding each word of training corpus In find the word to be converted, to find the term vector of the word to be converted.Assuming that word to be converted is C, so that it may in training language C is found in the corresponding each word in material library, thus may determine that c is the term vector of the word to be converted.
As do not found the word to be converted in the corresponding each word of training corpus, the embodiment of the present invention can wait for this turning Word is changed to be recorded.In this way, technical staff can continue to expand training corpus so that training corpus is added in the word to be converted In the corresponding multiple words to be converted in library.Specifically, the embodiment of the present invention can be used crawler to comprising the word webpage to be converted into Row is crawled to obtain the corpus for including the word to be converted, and then these corpus are put into training corpus.In this way, to training Each corpus carries out being obtained with the word to be converted when word cutting in corpus.
S300, the corpus to be marked is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked In each corpus corpus vector;
Wherein, step S300 is determined there are many modes of corpus vector, such as:It will be obtained after certain corpus word cutting respectively wait turn The term vector for changing word is added, using calculated result as the corpus vector of the corpus.
In other embodiments of the present invention, it may be incorporated into weight, corpus vector determined by weight and term vector.Specifically , as shown in Fig. 2, step S300 may include:
S310, the weight for determining the corresponding each word to be converted of the corpus to be marked;
Wherein, the embodiment of the present invention determines there are many modes of the weight of each word to be converted, below exemplary offer wherein It is a kind of:
Step S310 can be specifically included:
According to formula
w(tij)=freq (tij)×idf(tij)
It calculates and obtains the corresponding each word t to be converted of the corpus to be markedijWeight w (tij), wherein the j is institute The number of corpus in corpus to be marked is stated, the i is the number for each word to be converted that corpus word cutting obtains, the tijFor number The word to be converted that the number obtained for the corpus word cutting of j is i, the freq (tij) it is word t to be convertedijThe corpus for being j in number The frequency of middle appearance, the idf (tij) it is word t to be convertedijInverse document frequency, the j and the i are natural number.
S320, according to the term vector and the weight of the corresponding each word to be converted of the corpus to be marked determine it is described to The corpus vector of each corpus in tagged corpus.
Further, the term vector of each word to be converted obtained after certain corpus word cutting can be weighted by the embodiment of the present invention Summation, using calculated result as the corpus vector of the corpus.Therefore, step S320 can be specifically included:
Pass through formula
Calculate the corpus vector v (q for obtaining each corpus in the corpus to be markedj), wherein the j is described wait mark The number of corpus in corpus is infused, the i is the number for each word to be converted that corpus word cutting obtains, the qjIt is described to be marked The corpus that number is j in corpus, the n are corpus qjThe quantity for each conversion word that word cutting obtains, the tijIt is j's for number The word to be converted that the number that corpus word cutting obtains is i, the w (tij) it is word to be convertedtijWeight, the v (tij) it is wait turn Change word tijTerm vector, the j, the i and the n are natural number.
S400, each corpus vector is clustered, obtains multiple clustering clusters, each clustering cluster is labeled.
Each corpus vector is clustered specifically, K mean cluster technology can be used in the embodiment of the present invention.
It is to be appreciated that step of embodiment of the present invention S400 clusters each corpus vector, after obtaining multiple clustering clusters, The embodiment of the present invention can also further classify to each clustering cluster, or carry out synonym excavation etc. by each clustering cluster.
Specifically, the mode that clustering cluster is labeled can there are many, such as:Each clustering cluster is marked according to industry Note etc..
When each corpus in corpus to be marked is query phrase, the present invention is by carrying out cluster mark to query phrase Can effectively each query phrase be divided or be identified, it is subsequent to dividing or the query phrase that has identified makes to facilitate With, such as:It is relevant to its query intention to determine that user query are intended to provide a user according to the division of query phrase Query result.
A kind of corpus labeling method provided in an embodiment of the present invention, each corpus that can be treated in tagged corpus are cut Word obtains the corresponding multiple words to be converted of the corpus to be marked;Term vector according to the corresponding each word of training corpus is true The term vector of the corresponding each word to be converted of the fixed corpus to be marked, wherein the corresponding each word of the training corpus is pair It is obtained after each corpus progress word cutting in the training corpus;According to the corresponding each word to be converted of the corpus to be marked Term vector determines the corpus vector of each corpus in the corpus to be marked;Each corpus vector is clustered, is obtained multiple poly- Class cluster is labeled each clustering cluster.Since the present invention realizes the cluster mark to corpus vector, there is no need to each Corpus is individually labeled, and more conveniently and quickly, effectively increases the efficiency of corpus labeling.Simultaneously as the present invention passes through Corpus is cut to multiple words, obtains the term vector of each word first, the mode for then obtaining corpus vector further according to each term vector can To improve the accuracy of corpus vector.Meanwhile term vector used in the present invention is cut by corpus each in training corpus What the word that word obtains was converted to, therefore the present invention can further increase word by each corpus in screening training corpus The accuracy of vector.
As shown in figure 3, it is provided in an embodiment of the present invention another kind corpus labeling method, step S300 and step S400 it Between can also include:
S301, the corpus vector of determining each corpus is normalized.
Wherein, normalization is a kind of preprocess method of the data before use, it can not upset data distribution rule Under the premise of, improve the speed and precision of data processed result.For the present invention, it is poly- that corpus vector can be improved in normalization The speed and precision of class.
On this basis, step S400 can be specifically included:
Each corpus vector after normalized is clustered, multiple clustering clusters is obtained, each clustering cluster is labeled.
Wherein, the method that corpus vector is normalized in the embodiment of the present invention may include:
Pass through formula
v(qj) '=v (qj)/||v(qj)||
To corpus vector v (qj) be normalized, the corpus vector v (q after being normalizedj) ', wherein | | v (qj)|| Indicate amount of orientation v (qj) mould.
Corresponding with above method embodiment, the embodiment of the invention also provides a kind of corpus labeling devices.
As shown in figure 4, a kind of corpus labeling device provided in an embodiment of the present invention, may include:Word cutting unit 100, word Vector determination unit 200, corpus vector determination unit 300 and cluster cell 400,
The word cutting unit 100 carries out word cutting for treating each corpus in tagged corpus, obtains the language to be marked Expect the corresponding multiple words to be converted in library;
Wherein, each corpus in corpus to be marked can be selected by technical staff and is added in corpus to be marked. In practical applications, each corpus used in the embodiment of the present invention in corpus to be marked can be related to certain theme, such as It is related to automobile.In this way, can be obtained by and the theme by the way that corpus relevant to theme to be added in corpus to be marked Relevant corpus to be marked.Specifically, each corpus in corpus to be marked can be query phrase (Query).In reality In, the present invention can when inquiring, used query phrase be acquired and/or records to user, the present invention User's used query phrase when inquiring can be obtained from third party.
Specifically, a variety of different word cutting methods, which can be used, in the embodiment of the present invention carries out word cutting, such as:The present invention is implemented Example can carry out machine learning to word cutting training data by setting word cutting training data, obtain word cutting model.The present invention is real Need to only corpus be inputted the word to be converted that the output of word cutting model can be obtained in word cutting model by applying example.Certainly, the embodiment of the present invention Also the segmenting method based on string matching, the segmenting method based on understanding or the segmenting method based on statistics can be used to carry out Word cutting.
In practical applications, the embodiment of the present invention can be to the corresponding each word to be converted of the corpus to be marked of acquisition It carries out duplicate removal and reduces the workload of subsequent processing in this way, the embodiment of the present invention can remove duplicate word to be converted, Also avoid the interference as brought by duplicate word to be converted.
The term vector determination unit 200, for according to the term vector of the corresponding each word of training corpus determine it is described to The term vector of the corresponding each word to be converted of tagged corpus, wherein the corresponding each word of the training corpus is to the training It is obtained after each corpus progress word cutting in corpus;
Specifically, in training corpus each corpus acquisition modes can there are many, such as:It is climbed from webpage by crawler A large amount of corpus relevant to certain theme are taken, to form training corpus.It should be understood that different crawling rule by being arranged Then, the embodiment of the present invention can crawl corpus relevant to different themes, to obtain the training corpus of different themes.Its In, above-mentioned theme can be configured and modify according to actual needs, and the embodiment of the present invention is it is not limited here.Such as:This hair Bright embodiment can crawl corpus relevant to automobile, to obtain the training corpus of automobile, wherein language relevant to automobile Material may include:Engine displacement, automatic gear-box, hundred kilometers accelerate 7.5 seconds, air bag etc..
Specifically, the theme of training corpus can be identical as the theme of corpus to be marked, can effectively improve in this way The accuracy of term vector.
Wherein, the word cutting method of corpus can be identical as the word cutting method of corpus in corpus to be marked in training corpus Or it is different.
In practical applications, the embodiment of the present invention obtains after can carrying out word cutting to corpus each in the training corpus Each word be converted to term vector.Specifically, word2vec technology, which can be used, in the present invention is converted to term vector for word.
As do not found the word to be converted in the corresponding each word of training corpus, the embodiment of the present invention can wait for this turning Word is changed to be recorded.In this way, technical staff can continue to expand training corpus so that training corpus is added in the word to be converted In the corresponding multiple words to be converted in library.Specifically, the embodiment of the present invention can be used crawler to comprising the word webpage to be converted into Row is crawled to obtain the corpus for including the word to be converted, and then these corpus are put into training corpus.In this way, to training Each corpus carries out being obtained with the word to be converted when word cutting in corpus.
Wherein, term vector determination unit 200 can be specifically set for:It is corresponding to the corpus to be marked any one wait for Convert word:It searches the word to be converted in the corresponding each word of training corpus, such as finds, then it is corresponding from the training corpus Each word term vector in determine the conversion word term vector.
The corpus vector determination unit 300, for the word according to the corresponding each word to be converted of the corpus to be marked Vector determines the corpus vector of each corpus in the corpus to be marked;
Wherein, there are many modes of the determining corpus vector of corpus vector determination unit 300, such as:After certain corpus word cutting The term vector of obtained each word to be converted is added, using calculated result as the corpus vector of the corpus.
In other embodiments of the present invention, it may be incorporated into weight, corpus vector determined by weight and term vector.Specifically , the corpus vector determination unit 300 may include:Weight determines that subelement and corpus determine subelement,
The weight determines subelement, for determining the weight of the corresponding each word to be converted of the corpus to be marked;
The corpus determines subelement, for according to the term vector of the corresponding each word to be converted of the corpus to be marked and The weight determines the corpus vector of each corpus in the corpus to be marked.
Wherein, the corpus determine subelement can be specifically set for:
Pass through formula
Calculate the corpus vector v (q for obtaining each corpus in the corpus to be markedj), wherein the j is described wait mark The number of corpus in corpus is infused, the i is the number for each word to be converted that corpus word cutting obtains, the qjIt is described to be marked The corpus that number is j in corpus, the n are corpus qjThe quantity for each conversion word that word cutting obtains, the tijIt is j's for number The word to be converted that the number that corpus word cutting obtains is i, the w (tij) it is word t to be convertedijWeight, the v (tij) it is wait turn Change word tijTerm vector, the j, the i and the n are natural number.
Wherein, the weight determine subelement can be specifically set for:
According to formula
w(tij)=freq (tij)×idf(tij)
It calculates and obtains the corresponding each word t to be converted of the corpus to be markedijWeight w (tij), wherein the j is institute The number of corpus in corpus to be marked is stated, the i is the number for each word to be converted that corpus word cutting obtains, the tijFor number The word to be converted that the number obtained for the corpus word cutting of j is i, the freq (tij) it is word t to be convertedijThe corpus for being j in number The frequency of middle appearance, the idf (tij) it is word t to be convertedijInverse document frequency, the j and the i are natural number.
The cluster cell 400 obtains multiple clustering clusters for clustering to each corpus vector, to each clustering cluster into Rower note.
Each corpus vector is clustered specifically, K mean cluster technology can be used in the embodiment of the present invention.
It is to be appreciated that cluster cell of the embodiment of the present invention 400 clusters each corpus vector, multiple clustering clusters are obtained Afterwards, the embodiment of the present invention can also further classify to each clustering cluster, or carry out synonym excavation by each clustering cluster Deng.
Specifically, the mode that clustering cluster is labeled can there are many, such as:Each clustering cluster is marked according to industry Note etc..
When each corpus in corpus to be marked is query phrase, the present invention is by carrying out cluster mark to query phrase Can effectively each query phrase be divided or be identified, it is subsequent to dividing or the query phrase that has identified makes to facilitate With, such as:It is relevant to its query intention to determine that user query are intended to provide a user according to the division of query phrase Query result.
A kind of corpus labeling device provided in an embodiment of the present invention, each corpus that can be treated in tagged corpus are cut Word obtains the corresponding multiple words to be converted of the corpus to be marked;Term vector according to the corresponding each word of training corpus is true The term vector of the corresponding each word to be converted of the fixed corpus to be marked, wherein the corresponding each word of the training corpus is pair It is obtained after each corpus progress word cutting in the training corpus;According to the corresponding each word to be converted of the corpus to be marked Term vector determines the corpus vector of each corpus in the corpus to be marked;Each corpus vector is clustered, is obtained multiple poly- Class cluster is labeled each clustering cluster.Since the present invention realizes the cluster mark to corpus vector, there is no need to each Corpus is individually labeled, and more conveniently and quickly, effectively increases the efficiency of corpus labeling.Simultaneously as the present invention passes through Corpus is cut to multiple words, obtains the term vector of each word first, the mode for then obtaining corpus vector further according to each term vector can To improve the accuracy of corpus vector.Meanwhile term vector used in the present invention is cut by corpus each in training corpus What the word that word obtains was converted to, therefore the present invention can further increase word by each corpus in screening training corpus The accuracy of vector.
On the basis of embodiment shown in Fig. 4, another kind corpus labeling device provided in an embodiment of the present invention can also include: Normalization unit, for the term vector in corpus vector determination unit according to the corresponding each word to be converted of the corpus to be marked Determine that in the corpus to be marked after the corpus vector of each corpus, the cluster cell carries out clustering it to each corpus vector Before, the corpus vector of determining each corpus is normalized.
Wherein, normalization is a kind of preprocess method of the data before use, it can not upset data distribution rule Under the premise of, improve the speed and precision of data processed result.For the present invention, it is poly- that corpus vector can be improved in normalization The speed and precision of class.
Further, cluster cell 400 can be specifically set for:Each corpus vector after normalized is clustered, Multiple clustering clusters are obtained, each clustering cluster is labeled.
Wherein, cluster cell 400 can pass through formula
v(qj) '=v (qj)/||v(qj)||
To corpus vector v (qj) be normalized, the corpus vector v (q after being normalizedj) ', wherein | | v (qj)|| Indicate amount of orientation v (qj) mould.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (10)

1. a kind of corpus labeling method, which is characterized in that including:
Each corpus treated in tagged corpus carries out word cutting, obtains the corresponding multiple words to be converted of the corpus to be marked;
The word of the corresponding each word to be converted of the corpus to be marked is determined according to the term vector of the corresponding each word of training corpus Vector, wherein the corresponding each word of the training corpus is obtained after carrying out word cutting to corpus each in the training corpus;
Each corpus in the corpus to be marked is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked Corpus vector;
Each corpus vector is clustered, multiple clustering clusters are obtained, each clustering cluster is labeled.
2. the method according to claim 1, wherein the term vector according to the corresponding each word of training corpus Determine the term vector of the corresponding each word to be converted of the corpus to be marked, including:
Any one word to be converted corresponding to the corpus to be marked:Searching in the corresponding each word of training corpus should be wait turn Word is changed, is such as found, then determines the term vector of the conversion word from the term vector of the corresponding each word of the training corpus.
3. method according to claim 1 or 2, which is characterized in that described corresponding each according to the corpus to be marked The term vector of word to be converted determines the corpus vector of each corpus in the corpus to be marked, including:
Determine the weight of the corresponding each word to be converted of the corpus to be marked;
The corpus to be marked is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked and the weight The corpus vector of each corpus in library.
4. according to the method described in claim 3, it is characterized in that, described corresponding respectively wait turn according to the corpus to be marked The term vector and the weight that change word determine the corpus vector of each corpus in the corpus to be marked, including:
Pass through formula
Calculate the corpus vector v (q for obtaining each corpus in the corpus to be markedj), wherein the j is the corpus to be marked The number of corpus in library, the i are the number for each word to be converted that corpus word cutting obtains, the qjFor the corpus to be marked The corpus that middle number is j, the n are corpus qjThe quantity for each conversion word that word cutting obtains, the tijThe corpus for being j for number is cut The word to be converted that the number that word obtains is i, the w (tij) it is word t to be convertedijWeight, the v (tij) it is word t to be convertedij Term vector, the j, the i and the n are natural number.
5. according to the method described in claim 3, it is characterized in that, the determination corpus to be marked is corresponding respectively wait turn The weight of word is changed, including:
According to formula
w(tij)=freq (tij)×idf(tij)
It calculates and obtains the corresponding each word t to be converted of the corpus to be markedijWeight w (tij), wherein the j be it is described to The number of corpus in tagged corpus, the i are the number for each word to be converted that corpus word cutting obtains, the tijIt is j for number The obtained number of corpus word cutting be i word to be converted, the freq (tij) it is word t to be convertedijIn the corpus that number is j The frequency of appearance, the idf (tij) it is word t to be convertedijInverse document frequency, the j and the i are natural number.
6. the method according to claim 1, wherein it is described according to the corpus to be marked it is corresponding respectively to The term vector of conversion word determines in the corpus to be marked after the corpus vector of each corpus, described to carry out to each corpus vector Before cluster, the method also includes:
The corpus vector of determining each corpus is normalized;
It is described that each corpus vector is clustered, including:
Each corpus vector after normalized is clustered.
7. a kind of corpus labeling device, which is characterized in that including:Word cutting unit, term vector determination unit, corpus vector determine single Member and cluster cell,
The word cutting unit carries out word cutting for treating each corpus in tagged corpus, obtains the corpus pair to be marked The multiple words to be converted answered;
The term vector determination unit, for determining the corpus to be marked according to the term vector of the corresponding each word of training corpus The term vector of the corresponding each word to be converted in library, wherein the corresponding each word of the training corpus is in the training corpus It is obtained after each corpus progress word cutting;
The corpus vector determination unit, for being determined according to the term vector of the corresponding each word to be converted of the corpus to be marked The corpus vector of each corpus in the corpus to be marked;
The cluster cell obtains multiple clustering clusters, is labeled to each clustering cluster for clustering to each corpus vector.
8. device according to claim 7, which is characterized in that the corpus vector determination unit includes:Weight determines son Unit and corpus determine subelement,
The weight determines subelement, for determining the weight of the corresponding each word to be converted of the corpus to be marked;
The corpus determines subelement, for according to the term vector of the corresponding each word to be converted of the corpus to be marked and described Weight determines the corpus vector of each corpus in the corpus to be marked.
9. device according to claim 8, which is characterized in that the corpus determines that subelement is specifically configured to:
Pass through formula
Calculate the corpus vector v (q for obtaining each corpus in the corpus to be markedj), wherein the j is the corpus to be marked The number of corpus in library, the i are the number for each word to be converted that corpus word cutting obtains, the qjFor the corpus to be marked The corpus that middle number is j, the n are corpus qjThe quantity for each conversion word that word cutting obtains, the tijThe corpus for being j for number is cut The word to be converted that the number that word obtains is i, the w (tij) it is word t to be convertedijWeight, the v (tij) it is word t to be convertedij Term vector, the j, the i and the n are natural number.
10. device according to claim 8, which is characterized in that the weight determines that subelement is specifically configured to:
According to formula
w(tij)=freq (tij)×idf(tij)
It calculates and obtains the corresponding each word t to be converted of the corpus to be markedijWeight w (tij), wherein the j be it is described to The number of corpus in tagged corpus, the i are the number for each word to be converted that corpus word cutting obtains, the tijIt is j for number The obtained number of corpus word cutting be i word to be converted, the freq (tij) it is word t to be convertedijIn the corpus that number is j The frequency of appearance, the idf (tij) it is word t to be convertedijInverse document frequency, the j and the i are natural number.
CN201810644479.0A 2018-06-21 2018-06-21 Corpus labeling method and device Pending CN108829679A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810644479.0A CN108829679A (en) 2018-06-21 2018-06-21 Corpus labeling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810644479.0A CN108829679A (en) 2018-06-21 2018-06-21 Corpus labeling method and device

Publications (1)

Publication Number Publication Date
CN108829679A true CN108829679A (en) 2018-11-16

Family

ID=64141923

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810644479.0A Pending CN108829679A (en) 2018-06-21 2018-06-21 Corpus labeling method and device

Country Status (1)

Country Link
CN (1) CN108829679A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030191625A1 (en) * 1999-11-05 2003-10-09 Gorin Allen Louis Method and system for creating a named entity language model
CN106528642A (en) * 2016-10-13 2017-03-22 广东广业开元科技有限公司 TF-IDF feature extraction based short text classification method
CN106611052A (en) * 2016-12-26 2017-05-03 东软集团股份有限公司 Text label determination method and device
CN106776713A (en) * 2016-11-03 2017-05-31 中山大学 It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN107766426A (en) * 2017-09-14 2018-03-06 北京百分点信息科技有限公司 A kind of file classification method, device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030191625A1 (en) * 1999-11-05 2003-10-09 Gorin Allen Louis Method and system for creating a named entity language model
CN106528642A (en) * 2016-10-13 2017-03-22 广东广业开元科技有限公司 TF-IDF feature extraction based short text classification method
CN106776713A (en) * 2016-11-03 2017-05-31 中山大学 It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN106611052A (en) * 2016-12-26 2017-05-03 东软集团股份有限公司 Text label determination method and device
CN107766426A (en) * 2017-09-14 2018-03-06 北京百分点信息科技有限公司 A kind of file classification method, device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王立梅 等: "基于k均值聚类的直推式支持向量机学习算法", 《计算机工程与应用》 *

Similar Documents

Publication Publication Date Title
CN105005589B (en) A kind of method and apparatus of text classification
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
Zhang et al. Ad hoc table retrieval using semantic similarity
Li et al. NPRF: A neural pseudo relevance feedback framework for ad-hoc information retrieval
CN110134772A (en) Medical text Relation extraction method based on pre-training model and fine tuning technology
CN107515895A (en) A kind of sensation target search method and system based on target detection
CN109145190A (en) A kind of local quotation recommended method and system based on neural machine translation mothod
Zaw et al. Web document clustering using cuckoo search clustering algorithm based on levy flight
CN107577739A (en) A kind of semi-supervised domain term excavates the method and apparatus with classification
CN106250513A (en) A kind of event personalization sorting technique based on event modeling and system
CN106202211A (en) A kind of integrated microblogging rumour recognition methods based on microblogging type
CN104933181A (en) Mathematical formula searching method and device
WO2005036351A3 (en) Systems and methods for search processing using superunits
CN105468673A (en) Mathematical formula search method and system
CN109977250A (en) Merge the depth hashing image search method of semantic information and multistage similitude
CN110110228A (en) Intelligent real-time professional literature recommendation method and system based on Internet and word bag
CN109492156A (en) A kind of Literature pushing method and device
CN104866517A (en) Method and device for capturing webpage content
CN107656920B (en) Scientific and technological talent recommendation method based on patents
CN103064982A (en) Method for intelligent recommendation of patents in patent searching
CN109684460A (en) A kind of calculation method and system of the negative network public-opinion index based on deep learning
CN111078859A (en) Author recommendation method based on reference times
CN110032619A (en) A kind of segmenter training method and its device based on deep learning
CN108829679A (en) Corpus labeling method and device
CN108021657A (en) A kind of similar author's searching method based on document title semantic information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181116

RJ01 Rejection of invention patent application after publication