CN105808522A - Method and apparatus for semantic association - Google Patents

Method and apparatus for semantic association Download PDF

Info

Publication number
CN105808522A
CN105808522A CN201610130547.2A CN201610130547A CN105808522A CN 105808522 A CN105808522 A CN 105808522A CN 201610130547 A CN201610130547 A CN 201610130547A CN 105808522 A CN105808522 A CN 105808522A
Authority
CN
China
Prior art keywords
word
term vector
conditional probability
document
context
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610130547.2A
Other languages
Chinese (zh)
Inventor
柳廷娜
王茂帅
高峰
甄教明
于文才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Co Ltd
Original Assignee
Inspur Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Co Ltd filed Critical Inspur Software Co Ltd
Priority to CN201610130547.2A priority Critical patent/CN105808522A/en
Publication of CN105808522A publication Critical patent/CN105808522A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and apparatus for semantic association. The method comprises the steps of carrying out word segmentation on a plurality of documents to be processed so as to obtain words to be processed; determining a conditional probability that each word appears in each document, and according to the conditional probability corresponding to each word, generating a word vector corresponding to each word; and according to the word vectors corresponding to any two words, determining a semantic relationship between the any two words. According to the method and apparatus for semantic association, which are provided by the invention, the semantic relationship between the words can be determined.

Description

A kind of method of semantic association and device
Technical field
The present invention relates to field of computer technology, particularly to method and the device of a kind of semantic association.
Background technology
One important branch of computer science is exactly " artificial intelligence ", the essence of intelligence is understood in its attempt, and produce a kind of new intelligent machine can made a response in the way of human intelligence is similar, the most incipient stage in computer science, computer cannot show Chinese, input Chinese, has had the component computer such as the Five-stroke Method, Chinese Card finally can input Chinese display Chinese at " character layer ";At " morphology layer ", there are Chinese word segmentation, full-text search, keyword extraction system, also occurred in that the search giant such as Baidu, Google in this period just;Being " physical layer " toward last layer again, the word of the parts of speech such as name in short, place name, mechanism's name, time, belonging to entity word is express semantic important component part;Be exactly up again " syntactic level " at physical layer computer understanding in short in the implication of independent entity word, but logical relation each other is not known, and syntactic level computer is just understood that the basic meaning in short expressed is appreciated that in short." semantic layer " is then calculate function the meaning of every words of one section of article to be together in series, it is possible to understand that one section of article, is at this time only the boundary being really achieved natural language understanding, it is achieved that the dream of artificial intelligence.The research of artificial intelligence field includes robot, language identification, image recognition, natural language processing and specialist system etc..Wherein, the semantic association related in natural language processing is one of study hotspot of paying close attention at academia and industrial quarters at present.
In prior art, but without a kind of method of semantic association, it is impossible to determine the semantic relation between any two word.
Summary of the invention
Embodiments provide method and the device of a kind of semantic association, it is possible to determine the semantic relation between word.
On the one hand, a kind of method embodiments providing semantic association, including:
S1: multiple pending documents are carried out participle, it is thus achieved that pending word;
S2: determine the conditional probability that each word occurs in each document, according to the conditional probability that each word is corresponding, generates the term vector that each word is corresponding;
S3: according to the term vector that any two word is corresponding, it is determined that the semantic relation between any two word.
Further, in described S2, the described conditional probability determining that each word occurs in each document, including:
According to formula one, it is determined that the conditional probability that each word occurs in each document;
Wherein, described formula one is:
p(wt|contexti)=p (wt|wt-k,wt-k+1,..,wt-1,wt+1,...,wt+k-1,wt+k);
Wherein, p (wt|contexti) represent that current word is at document contextiThe conditional probability of middle appearance, contextiRepresent i-th document, wtRepresent current word, and current word is document contextiIn the t word.
Further, described S3, including:
According to formula two, it is determined that the COS distance between the term vector that any two word is corresponding;
According to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word;
Wherein, described formula two is:
c o s ( θ ) = Σ i = 1 n ( x i × y i ) Σ i = 1 n ( x i ) 2 × Σ i = 1 n ( y i ) 2 ;
Wherein, xiFor the i-th component of term vector X, yiFor the i-th component of term vector Y, cos (θ) is the COS distance between term vector X and term vector Y, term vector X and the term vector that term vector Y is that any two word is corresponding.
Further, the multiple pending word of described acquisition in described S1, including:
The all words obtained after participle are removed stop words, using remaining word as described pending word.
Further, in described S2, the described conditional probability corresponding according to each word, generate the term vector that each word is corresponding, including:
Using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generate the term vector that current word is corresponding.
Further, after described S1, before described S2, also include:
According to described pending word, generate tf-idf dictionary;
Described S2, including:
According to described tf-idf dictionary and word2evc, generate the term vector that each word is corresponding.
On the other hand, embodiments provide the device of a kind of semantic association, including:
Participle unit, for carrying out participle to multiple pending documents, it is thus achieved that pending word;
Generate unit, for determining the conditional probability that each word occurs in each document, according to the conditional probability that each word is corresponding, generate the term vector that each word is corresponding;
Determine unit, for the term vector corresponding according to any two word, it is determined that the semantic relation between any two word.
Further, described generation unit, perform described determine the conditional probability that each word occurs in each document time, be used for according to formula one, it is determined that the conditional probability that each word occurs in each document;
Wherein, described formula one is:
p(wt|contexti)=p (wt|wt-k,wt-k+1,..,wt-1,wt+1,...,wt+k-1,wt+k);
Wherein, p (wt|contexti) represent that current word is at document contextiThe conditional probability of middle appearance, contextiRepresent i-th document, wtRepresent current word, and current word is document contextiIn the t word.
Further, described determine unit, for according to formula two, it is determined that the COS distance between the term vector that any two word is corresponding, according to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word;
Wherein, described formula two is:
c o s ( θ ) = Σ i = 1 n ( x i × y i ) Σ i = 1 n ( x i ) 2 × Σ i = 1 n ( y i ) 2 ;
Wherein, xiFor the i-th component of term vector X, yiFor the i-th component of term vector Y, cos (θ) is the COS distance between term vector X and term vector Y, term vector X and the term vector that term vector Y is that any two word is corresponding.
Further, described participle unit, when performing the multiple pending word of described acquisition, remove stop words for all words of obtaining after participle, using remaining word as described pending word.
Further, described generation unit, is performing the described conditional probability corresponding according to each word, when generating term vector corresponding to each word, for using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generating the term vector that current word is corresponding.
Further, also include: dictionary unit, for according to described pending word, generating tf-idf dictionary;
Described generation unit, for according to described tf-idf dictionary and word2evc, generating the term vector that each word is corresponding.
In embodiments of the present invention, pending word is obtained by participle, by the conditional probability that each word occurs in each document, convert each word to term vector, the semantic relation between any two word is determined by the term vector that each word is corresponding, the embodiment of the present invention, by the abstract semantic relation between word, is presented intuitively by the mode of mathematics.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the flow chart of the method for a kind of semantic association that one embodiment of the invention provides;
Fig. 2 is the flow chart of the method for the another kind of semantic association that one embodiment of the invention provides;
Fig. 3 is the schematic diagram of the device of a kind of semantic association that one embodiment of the invention provides;
Fig. 4 is the schematic diagram of the device of the another kind of semantic association that one embodiment of the invention provides.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, a kind of method embodiments providing semantic association, the method may comprise steps of:
S1: multiple pending documents are carried out participle, it is thus achieved that pending word;
S2: determine the conditional probability that each word occurs in each document, according to the conditional probability that each word is corresponding, generates the term vector that each word is corresponding;
S3: according to the term vector that any two word is corresponding, it is determined that the semantic relation between any two word.
In embodiments of the present invention, pending word is obtained by participle, by the conditional probability that each word occurs in each document, convert each word to term vector, the semantic relation between any two word is determined by the term vector that each word is corresponding, the embodiment of the present invention, by the abstract semantic relation between word, is presented intuitively by the mode of mathematics.
In a kind of possible implementation, in described S2, the described conditional probability determining that each word occurs in each document, including:
According to formula one, it is determined that the conditional probability that each word occurs in each document;
Wherein, described formula one is:
p(wt|contexti)=p (wt|wt-k,wt-k+1,..,wt-1,wt+1,...,wt+k-1,wt+k);
Wherein, p (wt|contexti) represent that current word is at document contextiThe conditional probability of middle appearance, contextiRepresent i-th document, wtRepresent current word, and current word is document contextiIn the t word.It addition, wt-k,wt-k+1,..,wt-1,wt+1,...,wt+k-1,wt+kFor wtContext in word.
This implementation make use of the context of word, and semantic information enriches more.
In this implementation, word2vec adopts stratification Log-Bilinear language model, specifically, adopts CBOW (ContinuousBag-of-WordsModel) model, by formula one, by documentContext-prediction next one word be wtProbability, it is, wtAt documentThe conditional probability of middle appearance.Wherein, the calculating of CBOW can use hierarchicalSoftmax algorithm, this algorithm combines Huffman coding and builds binary tree, and its time complexity is dropped to O (log2 (| V |), it is possible to be greatly improved arithmetic speed by O (| V |).
In a kind of possible implementation, described S3, including:
According to formula two, it is determined that the COS distance between the term vector that any two word is corresponding;
According to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word;
Wherein, described formula two is:
c o s ( θ ) = Σ i = 1 n ( x i × y i ) Σ i = 1 n ( x i ) 2 × Σ i = 1 n ( y i ) 2 ;
Wherein, xiFor the i-th component of term vector X, yiFor the i-th component of term vector Y, cos (θ) is the COS distance between term vector X and term vector Y, term vector X and the term vector that term vector Y is that any two word is corresponding.
In this implementation, by the calculating of COS distance, reflect the semantic relation between word, thus carrying out the association of semanteme.
For reducing the quantity of calculative word, speed up processing, in a kind of possible implementation, the multiple pending word of described acquisition in described S1, including:
The all words obtained after participle are removed stop words, using remaining word as described pending word.
In this implementation, eliminate stop words, decrease the quantity of pending word, it is possible to accelerate to calculate speed, owing to stop words does not have practical significance, the accuracy of final result will not be impacted.
In a kind of possible implementation, in described S2, the described conditional probability corresponding according to each word, generate the term vector that each word is corresponding, including:
Using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generate the term vector that current word is corresponding.
For example, current word is " we ", and the conditional probability at document 1 is 0.2, and the conditional probability at document 2 is 0.3, and the conditional probability at document 3 is 0.4, then the term vector that word " we " is corresponding is (0.2,0.3,0.4).
In a kind of possible implementation, after described S1, before described S2, also include:
According to described pending word, generate tf-idf dictionary;
Described S2, including:
According to described tf-idf dictionary and word2evc, generate the term vector that each word is corresponding.
In this implementation, generate tf-idf dictionary by tf-idf method, using tf-idf dictionary as the basic dictionary in document training process.Word2vec uses the term vector representation of Distributedrepresentation, by training, each word is mapped to K dimension real number vector (K is generally the hyper parameter in model).
As in figure 2 it is shown, a kind of method embodiments providing semantic association, the method may comprise steps of:
Step 201: multiple pending documents are carried out participle, removes stop words, using remaining word as described pending word after participle all words obtained.
In this step, it is possible to the MLlib collaborative filtering mechanism based on Spark processes pending document, carry out participle and remove the process such as stop words.
Wherein, pending document is more many, and the result obtained is more accurate.
For example, having statement in pending document is " beautiful house ", after participle, obtain " beauty " " " " house ", when removing stop words, removal " ".It addition, stop words includes:, etc..
Step 202: according to formula one, it is determined that the conditional probability that each word occurs in each document.
Wherein, described formula one is:
p(wt|contexti)=p (wt|wt-k,wt-k+1,..,wt-1,wt+1,...,wt+k-1,wt+k);
Wherein, p (wt|contexti) represent that current word is at document contextiThe conditional probability of middle appearance, contextiRepresent i-th document, wtRepresent current word, and current word is document contextiIn the t word.Wherein, wt-k,wt-k+1,..,wt-1,wt+1,...,wt+k-1,wt+kFor wtContext.For example, k is 3, then wt+1、wt+2、wt+3For wtW in the statement at placetThree words hereinafter, wt-1、wt-2、wt-3For wtW in the statement at placetThree words above.Can as desired to arrange the value of k.
The relation between current word and the context at its place can be reflected by this conditional probability.
Step 203: using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generate the term vector that current word is corresponding.
Step 204: according to formula two, it is determined that the COS distance between the term vector that any two word is corresponding.
Wherein, described formula two is:
c o s ( θ ) = Σ i = 1 n ( x i × y i ) Σ i = 1 n ( x i ) 2 × Σ i = 1 n ( y i ) 2 ;
Wherein, xiFor the i-th component of term vector X, yiFor the i-th component of term vector Y, cos (θ) is the COS distance between term vector X and term vector Y, term vector X and the term vector that term vector Y is that any two word is corresponding.
The similarity relation between the word of correspondence is reflected by the COS distance between term vector.
Step 205: according to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word.
For each word, it is possible to export predetermined number word according to COS distance.Such as: in order to export 5 most like with current word word, then 5 words that output is maximum with the COS distance of current word.
In embodiments of the present invention, adopt word2vec training pattern, the word in tf-idf dictionary is converted to the form of real number vector, the mathematical requirement in semantic processes can be met.
As shown in Figure 3, Figure 4, the device of a kind of semantic association is embodiments provided.Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; a kind of hardware structure diagram of device place equipment for a kind of semantic association that the embodiment of the present invention provides; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment.The device of a kind of semantic association that the present embodiment provides, including:
Participle unit 401, for carrying out participle to multiple pending documents, it is thus achieved that pending word;
Generate unit 402, for determining the conditional probability that each word occurs in each document, according to the conditional probability that each word is corresponding, generate the term vector that each word is corresponding;
Determine unit 403, for the term vector corresponding according to any two word, it is determined that the semantic relation between any two word.
In a kind of possible implementation, described generation unit 402, perform described determine the conditional probability that each word occurs in each document time, be used for according to formula one, it is determined that the conditional probability that each word occurs in each document;
Wherein, described formula one is:
p(wt|contexti)=p (wt|wt-k,wt-k+1,..,wt-1,wt+1,...,wt+k-1,wt+k);
Wherein, p (wt|contexti) represent that current word is at document contextiThe conditional probability of middle appearance, contextiRepresent i-th document, wtRepresent current word, and current word is document contextiIn the t word.
In a kind of possible implementation, described determine unit 403, for according to formula two, it is determined that the COS distance between the term vector that any two word is corresponding, according to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word;
Wherein, described formula two is:
c o s ( θ ) = Σ i = 1 n ( x i × y i ) Σ i = 1 n ( x i ) 2 × Σ i = 1 n ( y i ) 2 ;
Wherein, xiFor the i-th component of term vector X, yiFor the i-th component of term vector Y, cos (θ) is the COS distance between term vector X and term vector Y, term vector X and the term vector that term vector Y is that any two word is corresponding.
In a kind of possible implementation, described participle unit 401, when performing the multiple pending word of described acquisition, remove stop words for all words of obtaining after participle, using remaining word as described pending word.
In a kind of possible implementation, described generation unit 402, performing the described conditional probability corresponding according to each word, when generating term vector corresponding to each word, for using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generating the term vector that current word is corresponding.
In a kind of possible implementation, this device also includes:
Dictionary unit, for according to described pending word, generating tf-idf dictionary;
Described generation unit 402, for according to described tf-idf dictionary and word2evc, generating the term vector that each word is corresponding.
The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.
The method of a kind of semantic association that the embodiment of the present invention provides and device, at least have the advantages that
1, in embodiments of the present invention, pending word is obtained by participle, by the conditional probability that each word occurs in each document, convert each word to term vector, the semantic relation between any two word is determined by the term vector that each word is corresponding, the embodiment of the present invention, by the abstract semantic relation between word, is presented intuitively by the mode of mathematics.
2, the scheme that the embodiment of the present invention provides, at the social phenomenon such as the analysis of public opinion, focus incident, and the common commercial such as search engine, commercial product recommending field etc. all can be applicable, by Chinese semantic association, maximum possible can infer the semantic relation between vocabulary, it is provided that public sentiment monitoring more accurately and personalized recommendation etc..
3, the scheme that the embodiment of the present invention provides, it is possible to use big data processing shelf Spark and word2vec training pattern realize, and then bottom code is farthest controlled, it is ensured that processing procedure safely controllable.
It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment;And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims (10)

1. the method for a semantic association, it is characterised in that including:
S1: multiple pending documents are carried out participle, it is thus achieved that pending word;
S2: determine the conditional probability that each word occurs in each document, according to the conditional probability that each word is corresponding, generates the term vector that each word is corresponding;
S3: according to the term vector that any two word is corresponding, it is determined that the semantic relation between any two word.
2. method according to claim 1, it is characterised in that in described S2, the described conditional probability determining that each word occurs in each document, including:
According to formula one, it is determined that the conditional probability that each word occurs in each document;
Wherein, described formula one is:
p(wt|contexti)=p (wt|wt-k,wt-k+1,..,wt-1,wt+1,...,wt+k-1,wt+k);
Wherein, p (wt|contexti) represent that current word is at document contextiThe conditional probability of middle appearance, contextiRepresent i-th document, wtRepresent current word, and current word is document contextiIn the t word.
3. method according to claim 1, it is characterised in that described S3, including:
According to formula two, it is determined that the COS distance between the term vector that any two word is corresponding;
According to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word;
Wherein, described formula two is:
c o s ( θ ) = Σ i = 1 n ( x i × y i ) Σ i = 1 n ( x i ) 2 × Σ i = 1 n ( y i ) 2 ;
Wherein, xiFor the i-th component of term vector X, yiFor the i-th component of term vector Y, cos (θ) is the COS distance between term vector X and term vector Y, term vector X and the term vector that term vector Y is that any two word is corresponding.
4. method according to claim 1, it is characterised in that
The multiple pending word of described acquisition in described S1, including:
The all words obtained after participle are removed stop words, using remaining word as described pending word;
And/or,
In described S2, the described conditional probability corresponding according to each word, generate the term vector that each word is corresponding, including:
Using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generate the term vector that current word is corresponding.
5. according to described method arbitrary in claim 1-4, it is characterised in that after described S1, before described S2, also include:
According to described pending word, generate tf-idf dictionary;
Described S2, including:
According to described tf-idf dictionary and word2evc, generate the term vector that each word is corresponding.
6. the device of a semantic association, it is characterised in that including:
Participle unit, for carrying out participle to multiple pending documents, it is thus achieved that pending word;
Generate unit, for determining the conditional probability that each word occurs in each document, according to the conditional probability that each word is corresponding, generate the term vector that each word is corresponding;
Determine unit, for the term vector corresponding according to any two word, it is determined that the semantic relation between any two word.
7. device according to claim 6, it is characterised in that described generation unit, perform described determine the conditional probability that each word occurs in each document time, be used for according to formula one, it is determined that the conditional probability that each word occurs in each document;
Wherein, described formula one is:
p(wt|contexti)=p (wt|wt-k,wt-k+1,..,wt-1,wt+1,...,wt+k-1,wt+k);
Wherein, p (wt|contexti) represent that current word is at document contextiThe conditional probability of middle appearance, contextiRepresent i-th document, wtRepresent current word, and current word is document contextiIn the t word.
8. device according to claim 6, it is characterized in that, described determine unit, for according to formula two, determine the COS distance between the term vector that any two word is corresponding, according to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word;
Wherein, described formula two is:
c o s ( θ ) = Σ i = 1 n ( x i × y i ) Σ i = 1 n ( x i ) 2 × Σ i = 1 n ( y i ) 2 ;
Wherein, xiFor the i-th component of term vector X, yiFor the i-th component of term vector Y, cos (θ) is the COS distance between term vector X and term vector Y, term vector X and the term vector that term vector Y is that any two word is corresponding.
9. device according to claim 6, it is characterised in that
Described participle unit, when performing the multiple pending word of described acquisition, removes stop words for all words of obtaining after participle, using remaining word as described pending word;
And/or,
Described generation unit, performing the described conditional probability corresponding according to each word, when generating term vector corresponding to each word, for using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generating the term vector that current word is corresponding.
10. according to described device arbitrary in claim 6-9, it is characterised in that also include:
Dictionary unit, for according to described pending word, generating tf-idf dictionary;
Described generation unit, for according to described tf-idf dictionary and word2evc, generating the term vector that each word is corresponding.
CN201610130547.2A 2016-03-08 2016-03-08 Method and apparatus for semantic association Pending CN105808522A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610130547.2A CN105808522A (en) 2016-03-08 2016-03-08 Method and apparatus for semantic association

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610130547.2A CN105808522A (en) 2016-03-08 2016-03-08 Method and apparatus for semantic association

Publications (1)

Publication Number Publication Date
CN105808522A true CN105808522A (en) 2016-07-27

Family

ID=56466923

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610130547.2A Pending CN105808522A (en) 2016-03-08 2016-03-08 Method and apparatus for semantic association

Country Status (1)

Country Link
CN (1) CN105808522A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977355A (en) * 2017-11-17 2018-05-01 四川长虹电器股份有限公司 TV programme suggesting method based on term vector training
CN109829149A (en) * 2017-11-23 2019-05-31 中国移动通信有限公司研究院 A kind of generation method and device, equipment, storage medium of term vector model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101847141A (en) * 2010-06-03 2010-09-29 复旦大学 Method for measuring semantic similarity of Chinese words

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101847141A (en) * 2010-06-03 2010-09-29 复旦大学 Method for measuring semantic similarity of Chinese words

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
佚名: "再谈word2vec", 《HTTP://WWW.XUEBUYUAN.COM/2199643.HTML》 *
姜霖等: "采用连续词袋模型(CBOW)的领域术语自动抽取研究", 《现代图书情报技术》 *
苏增才: "基于word2vec和SVMperf的网络中文文本评论信息情感分类研究", 《万方学位论文数据库》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977355A (en) * 2017-11-17 2018-05-01 四川长虹电器股份有限公司 TV programme suggesting method based on term vector training
CN109829149A (en) * 2017-11-23 2019-05-31 中国移动通信有限公司研究院 A kind of generation method and device, equipment, storage medium of term vector model

Similar Documents

Publication Publication Date Title
AU2018247340B2 (en) Dvqa: understanding data visualizations through question answering
CN106933804B (en) Structured information extraction method based on deep learning
CN107436864B (en) Chinese question-answer semantic similarity calculation method based on Word2Vec
CN111680094B (en) Text structuring method, device and system and non-volatile storage medium
CN110705294A (en) Named entity recognition model training method, named entity recognition method and device
CN110737758A (en) Method and apparatus for generating a model
CN109684476B (en) Text classification method, text classification device and terminal equipment
CN107679082A (en) Question and answer searching method, device and electronic equipment
WO2020010834A1 (en) Faq question and answer library generalization method, apparatus, and device
CN112328800A (en) System and method for automatically generating programming specification question answers
CN116304748B (en) Text similarity calculation method, system, equipment and medium
CN115062134B (en) Knowledge question-answering model training and knowledge question-answering method, device and computer equipment
CN117271724A (en) Intelligent question-answering implementation method and system based on large model and semantic graph
CN116595195A (en) Knowledge graph construction method, device and medium
CN113722512A (en) Text retrieval method, device and equipment based on language model and storage medium
CN112800205A (en) Method and device for obtaining question-answer related paragraphs based on semantic change manifold analysis
CN116662488A (en) Service document retrieval method, device, equipment and storage medium
CN111382243A (en) Text category matching method, text category matching device and terminal
CN105808522A (en) Method and apparatus for semantic association
WO2023169301A1 (en) Text processing method and apparatus, and electronic device
CN116881470A (en) Method and device for generating question-answer pairs
CN114491076B (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
Sun et al. Chinese microblog sentiment classification based on convolution neural network with content extension method
CN117501283A (en) Text-to-question model system
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160727