CN105808522A

CN105808522A - Method and apparatus for semantic association

Info

Publication number: CN105808522A
Application number: CN201610130547.2A
Authority: CN
Inventors: 柳廷娜; 王茂帅; 高峰; 甄教明; 于文才
Original assignee: Inspur Software Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2016-03-08
Filing date: 2016-03-08
Publication date: 2016-07-27

Abstract

The invention provides a method and apparatus for semantic association. The method comprises the steps of carrying out word segmentation on a plurality of documents to be processed so as to obtain words to be processed; determining a conditional probability that each word appears in each document, and according to the conditional probability corresponding to each word, generating a word vector corresponding to each word; and according to the word vectors corresponding to any two words, determining a semantic relationship between the any two words. According to the method and apparatus for semantic association, which are provided by the invention, the semantic relationship between the words can be determined.

Description

A kind of method of semantic association and device

Technical field

The present invention relates to field of computer technology, particularly to method and the device of a kind of semantic association.

Background technology

One important branch of computer science is exactly " artificial intelligence ", the essence of intelligence is understood in its attempt, and produce a kind of new intelligent machine can made a response in the way of human intelligence is similar, the most incipient stage in computer science, computer cannot show Chinese, input Chinese, has had the component computer such as the Five-stroke Method, Chinese Card finally can input Chinese display Chinese at " character layer "；At " morphology layer ", there are Chinese word segmentation, full-text search, keyword extraction system, also occurred in that the search giant such as Baidu, Google in this period just；Being " physical layer " toward last layer again, the word of the parts of speech such as name in short, place name, mechanism's name, time, belonging to entity word is express semantic important component part；Be exactly up again " syntactic level " at physical layer computer understanding in short in the implication of independent entity word, but logical relation each other is not known, and syntactic level computer is just understood that the basic meaning in short expressed is appreciated that in short." semantic layer " is then calculate function the meaning of every words of one section of article to be together in series, it is possible to understand that one section of article, is at this time only the boundary being really achieved natural language understanding, it is achieved that the dream of artificial intelligence.The research of artificial intelligence field includes robot, language identification, image recognition, natural language processing and specialist system etc..Wherein, the semantic association related in natural language processing is one of study hotspot of paying close attention at academia and industrial quarters at present.

In prior art, but without a kind of method of semantic association, it is impossible to determine the semantic relation between any two word.

Summary of the invention

Embodiments provide method and the device of a kind of semantic association, it is possible to determine the semantic relation between word.

On the one hand, a kind of method embodiments providing semantic association, including:

S1: multiple pending documents are carried out participle, it is thus achieved that pending word；

S2: determine the conditional probability that each word occurs in each document, according to the conditional probability that each word is corresponding, generates the term vector that each word is corresponding；

S3: according to the term vector that any two word is corresponding, it is determined that the semantic relation between any two word.

Further, in described S2, the described conditional probability determining that each word occurs in each document, including:

According to formula one, it is determined that the conditional probability that each word occurs in each document；

Wherein, described formula one is:

p(w_t|context_i)=p (w_t|w_t-k,w_t-k+1,..,w_t-1,w_t+1,...,w_t+k-1,w_t+k)；

Wherein, p (w_t|context_i) represent that current word is at document context_iThe conditional probability of middle appearance, context_iRepresent i-th document, w_tRepresent current word, and current word is document context_iIn the t word.

Further, described S3, including:

According to formula two, it is determined that the COS distance between the term vector that any two word is corresponding；

According to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word；

Wherein, described formula two is:

c o s (θ) = \frac{Σ_{i = 1}^{n} (x_{i} \times y_{i})}{\sqrt{Σ_{i = 1}^{n} {(x_{i})}^{2}} \times \sqrt{Σ_{i = 1}^{n} {(y_{i})}^{2}}};

Wherein, x_iFor the i-th component of term vector X, y_iFor the i-th component of term vector Y, cos (θ) is the COS distance between term vector X and term vector Y, term vector X and the term vector that term vector Y is that any two word is corresponding.

Further, the multiple pending word of described acquisition in described S1, including:

The all words obtained after participle are removed stop words, using remaining word as described pending word.

Further, in described S2, the described conditional probability corresponding according to each word, generate the term vector that each word is corresponding, including:

Using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generate the term vector that current word is corresponding.

Further, after described S1, before described S2, also include:

According to described pending word, generate tf-idf dictionary；

Described S2, including:

According to described tf-idf dictionary and word2evc, generate the term vector that each word is corresponding.

On the other hand, embodiments provide the device of a kind of semantic association, including:

Participle unit, for carrying out participle to multiple pending documents, it is thus achieved that pending word；

Generate unit, for determining the conditional probability that each word occurs in each document, according to the conditional probability that each word is corresponding, generate the term vector that each word is corresponding；

Determine unit, for the term vector corresponding according to any two word, it is determined that the semantic relation between any two word.

Further, described generation unit, perform described determine the conditional probability that each word occurs in each document time, be used for according to formula one, it is determined that the conditional probability that each word occurs in each document；

Wherein, described formula one is:

p(w_t|context_i)=p (w_t|w_t-k,w_t-k+1,..,w_t-1,w_t+1,...,w_t+k-1,w_t+k)；

Further, described determine unit, for according to formula two, it is determined that the COS distance between the term vector that any two word is corresponding, according to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word；

Wherein, described formula two is:

c o s (θ) = \frac{Σ_{i = 1}^{n} (x_{i} \times y_{i})}{\sqrt{Σ_{i = 1}^{n} {(x_{i})}^{2}} \times \sqrt{Σ_{i = 1}^{n} {(y_{i})}^{2}}};

Further, described participle unit, when performing the multiple pending word of described acquisition, remove stop words for all words of obtaining after participle, using remaining word as described pending word.

Further, described generation unit, is performing the described conditional probability corresponding according to each word, when generating term vector corresponding to each word, for using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generating the term vector that current word is corresponding.

Further, also include: dictionary unit, for according to described pending word, generating tf-idf dictionary；

Described generation unit, for according to described tf-idf dictionary and word2evc, generating the term vector that each word is corresponding.

In embodiments of the present invention, pending word is obtained by participle, by the conditional probability that each word occurs in each document, convert each word to term vector, the semantic relation between any two word is determined by the term vector that each word is corresponding, the embodiment of the present invention, by the abstract semantic relation between word, is presented intuitively by the mode of mathematics.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the flow chart of the method for a kind of semantic association that one embodiment of the invention provides；

Fig. 2 is the flow chart of the method for the another kind of semantic association that one embodiment of the invention provides；

Fig. 3 is the schematic diagram of the device of a kind of semantic association that one embodiment of the invention provides；

Fig. 4 is the schematic diagram of the device of the another kind of semantic association that one embodiment of the invention provides.

Detailed description of the invention

For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.

As it is shown in figure 1, a kind of method embodiments providing semantic association, the method may comprise steps of:

In a kind of possible implementation, in described S2, the described conditional probability determining that each word occurs in each document, including:

Wherein, described formula one is:

p(w_t|context_i)=p (w_t|w_t-k,w_t-k+1,..,w_t-1,w_t+1,...,w_t+k-1,w_t+k)；

Wherein, p (w_t|context_i) represent that current word is at document context_iThe conditional probability of middle appearance, context_iRepresent i-th document, w_tRepresent current word, and current word is document context_iIn the t word.It addition, w_t-k,w_t-k+1,..,w_t-1,w_t+1,...,w_t+k-1,w_t+kFor w_tContext in word.

This implementation make use of the context of word, and semantic information enriches more.

In this implementation, word2vec adopts stratification Log-Bilinear language model, specifically, adopts CBOW (ContinuousBag-of-WordsModel) model, by formula one, by documentContext-prediction next one word be w_tProbability, it is, w_tAt documentThe conditional probability of middle appearance.Wherein, the calculating of CBOW can use hierarchicalSoftmax algorithm, this algorithm combines Huffman coding and builds binary tree, and its time complexity is dropped to O (log2 (| V |), it is possible to be greatly improved arithmetic speed by O (| V |).

In a kind of possible implementation, described S3, including:

Wherein, described formula two is:

c o s (θ) = \frac{Σ_{i = 1}^{n} (x_{i} \times y_{i})}{\sqrt{Σ_{i = 1}^{n} {(x_{i})}^{2}} \times \sqrt{Σ_{i = 1}^{n} {(y_{i})}^{2}}};

In this implementation, by the calculating of COS distance, reflect the semantic relation between word, thus carrying out the association of semanteme.

For reducing the quantity of calculative word, speed up processing, in a kind of possible implementation, the multiple pending word of described acquisition in described S1, including:

In this implementation, eliminate stop words, decrease the quantity of pending word, it is possible to accelerate to calculate speed, owing to stop words does not have practical significance, the accuracy of final result will not be impacted.

In a kind of possible implementation, in described S2, the described conditional probability corresponding according to each word, generate the term vector that each word is corresponding, including:

For example, current word is " we ", and the conditional probability at document 1 is 0.2, and the conditional probability at document 2 is 0.3, and the conditional probability at document 3 is 0.4, then the term vector that word " we " is corresponding is (0.2,0.3,0.4).

In a kind of possible implementation, after described S1, before described S2, also include:

According to described pending word, generate tf-idf dictionary；

Described S2, including:

In this implementation, generate tf-idf dictionary by tf-idf method, using tf-idf dictionary as the basic dictionary in document training process.Word2vec uses the term vector representation of Distributedrepresentation, by training, each word is mapped to K dimension real number vector (K is generally the hyper parameter in model).

As in figure 2 it is shown, a kind of method embodiments providing semantic association, the method may comprise steps of:

Step 201: multiple pending documents are carried out participle, removes stop words, using remaining word as described pending word after participle all words obtained.

In this step, it is possible to the MLlib collaborative filtering mechanism based on Spark processes pending document, carry out participle and remove the process such as stop words.

Wherein, pending document is more many, and the result obtained is more accurate.

For example, having statement in pending document is " beautiful house ", after participle, obtain " beauty " " " " house ", when removing stop words, removal " ".It addition, stop words includes:, etc..

Step 202: according to formula one, it is determined that the conditional probability that each word occurs in each document.

Wherein, described formula one is:

p(w_t|context_i)=p (w_t|w_t-k,w_t-k+1,..,w_t-1,w_t+1,...,w_t+k-1,w_t+k)；

Wherein, p (w_t|context_i) represent that current word is at document context_iThe conditional probability of middle appearance, context_iRepresent i-th document, w_tRepresent current word, and current word is document context_iIn the t word.Wherein, w_t-k,w_t-k+1,..,w_t-1,w_t+1,...,w_t+k-1,w_t+kFor w_tContext.For example, k is 3, then w_t+1、w_t+2、w_t+3For w_tW in the statement at place_tThree words hereinafter, w_t-1、w_t-2、w_t-3For w_tW in the statement at place_tThree words above.Can as desired to arrange the value of k.

The relation between current word and the context at its place can be reflected by this conditional probability.

Step 203: using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generate the term vector that current word is corresponding.

Step 204: according to formula two, it is determined that the COS distance between the term vector that any two word is corresponding.

Wherein, described formula two is:

c o s (θ) = \frac{Σ_{i = 1}^{n} (x_{i} \times y_{i})}{\sqrt{Σ_{i = 1}^{n} {(x_{i})}^{2}} \times \sqrt{Σ_{i = 1}^{n} {(y_{i})}^{2}}};

The similarity relation between the word of correspondence is reflected by the COS distance between term vector.

Step 205: according to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word.

For each word, it is possible to export predetermined number word according to COS distance.Such as: in order to export 5 most like with current word word, then 5 words that output is maximum with the COS distance of current word.

In embodiments of the present invention, adopt word2vec training pattern, the word in tf-idf dictionary is converted to the form of real number vector, the mathematical requirement in semantic processes can be met.

As shown in Figure 3, Figure 4, the device of a kind of semantic association is embodiments provided.Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; a kind of hardware structure diagram of device place equipment for a kind of semantic association that the embodiment of the present invention provides; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment.The device of a kind of semantic association that the present embodiment provides, including:

Participle unit 401, for carrying out participle to multiple pending documents, it is thus achieved that pending word；

Generate unit 402, for determining the conditional probability that each word occurs in each document, according to the conditional probability that each word is corresponding, generate the term vector that each word is corresponding；

Determine unit 403, for the term vector corresponding according to any two word, it is determined that the semantic relation between any two word.

In a kind of possible implementation, described generation unit 402, perform described determine the conditional probability that each word occurs in each document time, be used for according to formula one, it is determined that the conditional probability that each word occurs in each document；

Wherein, described formula one is:

p(w_t|context_i)=p (w_t|w_t-k,w_t-k+1,..,w_t-1,w_t+1,...,w_t+k-1,w_t+k)；

In a kind of possible implementation, described determine unit 403, for according to formula two, it is determined that the COS distance between the term vector that any two word is corresponding, according to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word；

Wherein, described formula two is:

c o s (θ) = \frac{Σ_{i = 1}^{n} (x_{i} \times y_{i})}{\sqrt{Σ_{i = 1}^{n} {(x_{i})}^{2}} \times \sqrt{Σ_{i = 1}^{n} {(y_{i})}^{2}}};

In a kind of possible implementation, described participle unit 401, when performing the multiple pending word of described acquisition, remove stop words for all words of obtaining after participle, using remaining word as described pending word.

In a kind of possible implementation, described generation unit 402, performing the described conditional probability corresponding according to each word, when generating term vector corresponding to each word, for using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generating the term vector that current word is corresponding.

In a kind of possible implementation, this device also includes:

Dictionary unit, for according to described pending word, generating tf-idf dictionary；

Described generation unit 402, for according to described tf-idf dictionary and word2evc, generating the term vector that each word is corresponding.

The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.

The method of a kind of semantic association that the embodiment of the present invention provides and device, at least have the advantages that

1, in embodiments of the present invention, pending word is obtained by participle, by the conditional probability that each word occurs in each document, convert each word to term vector, the semantic relation between any two word is determined by the term vector that each word is corresponding, the embodiment of the present invention, by the abstract semantic relation between word, is presented intuitively by the mode of mathematics.

2, the scheme that the embodiment of the present invention provides, at the social phenomenon such as the analysis of public opinion, focus incident, and the common commercial such as search engine, commercial product recommending field etc. all can be applicable, by Chinese semantic association, maximum possible can infer the semantic relation between vocabulary, it is provided that public sentiment monitoring more accurately and personalized recommendation etc..

3, the scheme that the embodiment of the present invention provides, it is possible to use big data processing shelf Spark and word2vec training pattern realize, and then bottom code is farthest controlled, it is ensured that processing procedure safely controllable.

It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.

One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment；And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.

Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims

1. the method for a semantic association, it is characterised in that including:

2. method according to claim 1, it is characterised in that in described S2, the described conditional probability determining that each word occurs in each document, including:

Wherein, described formula one is:

p(w_t|context_i)=p (w_t|w_t-k,w_t-k+1,..,w_t-1,w_t+1,...,w_t+k-1,w_t+k)；

3. method according to claim 1, it is characterised in that described S3, including:

Wherein, described formula two is:

c o s (θ) = \frac{Σ_{i = 1}^{n} (x_{i} \times y_{i})}{\sqrt{Σ_{i = 1}^{n} {(x_{i})}^{2}} \times \sqrt{Σ_{i = 1}^{n} {(y_{i})}^{2}}};

4. method according to claim 1, it is characterised in that

The multiple pending word of described acquisition in described S1, including:

The all words obtained after participle are removed stop words, using remaining word as described pending word；

And/or,

In described S2, the described conditional probability corresponding according to each word, generate the term vector that each word is corresponding, including:

5. according to described method arbitrary in claim 1-4, it is characterised in that after described S1, before described S2, also include:

According to described pending word, generate tf-idf dictionary；

Described S2, including:

6. the device of a semantic association, it is characterised in that including:

7. device according to claim 6, it is characterised in that described generation unit, perform described determine the conditional probability that each word occurs in each document time, be used for according to formula one, it is determined that the conditional probability that each word occurs in each document；

Wherein, described formula one is:

p(w_t|context_i)=p (w_t|w_t-k,w_t-k+1,..,w_t-1,w_t+1,...,w_t+k-1,w_t+k)；

8. device according to claim 6, it is characterized in that, described determine unit, for according to formula two, determine the COS distance between the term vector that any two word is corresponding, according to the COS distance between the term vector that any two word is corresponding, it is determined that the semantic relation between any two word；

Wherein, described formula two is:

c o s (θ) = \frac{Σ_{i = 1}^{n} (x_{i} \times y_{i})}{\sqrt{Σ_{i = 1}^{n} {(x_{i})}^{2}} \times \sqrt{Σ_{i = 1}^{n} {(y_{i})}^{2}}};

9. device according to claim 6, it is characterised in that

Described participle unit, when performing the multiple pending word of described acquisition, removes stop words for all words of obtaining after participle, using remaining word as described pending word；

And/or,

Described generation unit, performing the described conditional probability corresponding according to each word, when generating term vector corresponding to each word, for using each conditional probability corresponding for the current word one-component as term vector corresponding to current word, generating the term vector that current word is corresponding.

10. according to described device arbitrary in claim 6-9, it is characterised in that also include: